Top Related Projects
Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Metric depth estimation from a single image
Quick Overview
DPT (Dense Prediction Transformers) is a GitHub repository by Intel ISL that implements vision transformers for dense prediction tasks. It focuses on monocular depth estimation, semantic segmentation, and panoptic segmentation, offering state-of-the-art performance on various benchmarks.
Pros
- High-performance models for dense prediction tasks
- Pretrained weights available for quick implementation
- Supports multiple vision tasks (depth estimation, segmentation)
- Well-documented with clear instructions for usage
Cons
- Requires significant computational resources for training
- Limited to specific vision tasks, not a general-purpose library
- Dependency on specific versions of PyTorch and other libraries
- Relatively complex architecture, which may be challenging for beginners
Code Examples
- Loading a pretrained DPT model for depth estimation:
import torch
from dpt.models import DPTDepthModel
model = DPTDepthModel(
path="weights/dpt_large-midas-2f21e586.pt",
backbone="vitl16_384",
non_negative=True,
enable_attention_hooks=False,
)
model.eval()
- Preprocessing input for the DPT model:
from torchvision.transforms import Compose
from dpt.transforms import Resize, NormalizeImage, PrepareForNet
transform = Compose([
Resize(
384,
384,
resize_target=None,
keep_aspect_ratio=True,
ensure_multiple_of=32,
resize_method="minimal",
image_interpolation_method=cv2.INTER_CUBIC,
),
NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
PrepareForNet(),
])
img = transform({"image": img})["image"]
- Inferencing with the DPT model:
with torch.no_grad():
sample = torch.from_numpy(img).to(device).unsqueeze(0)
prediction = model.forward(sample)
prediction = torch.nn.functional.interpolate(
prediction.unsqueeze(1),
size=img.shape[:2],
mode="bicubic",
align_corners=False,
).squeeze()
output = prediction.cpu().numpy()
Getting Started
To get started with DPT:
-
Clone the repository:
git clone https://github.com/isl-org/DPT.git cd DPT
-
Install dependencies:
pip install -r requirements.txt
-
Download pretrained weights from the provided links in the repository.
-
Use the provided scripts or integrate the models into your own code as shown in the code examples above.
Competitor Comparisons
Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
Pros of MiDaS
- More established project with a longer history and larger user base
- Supports a wider range of pre-trained models for different use cases
- Better documentation and examples for ease of use
Cons of MiDaS
- Slightly lower accuracy compared to DPT for certain tasks
- Less focus on real-time performance optimization
- Fewer options for fine-tuning on custom datasets
Code Comparison
MiDaS:
model_type = "DPT_Large"
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.to(device).eval()
DPT:
model = DPTDepthModel(
path=path_to_model,
backbone="vitb_rn50_384",
non_negative=True,
)
model.eval()
Both repositories focus on monocular depth estimation, with DPT being a more recent advancement building upon MiDaS. DPT offers improved accuracy and performance for certain tasks, while MiDaS provides a broader range of pre-trained models and better documentation. DPT is more suitable for users seeking state-of-the-art results, while MiDaS may be preferable for those requiring a more established and well-documented solution with a variety of model options.
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Pros of Detectron2
- More comprehensive and feature-rich object detection framework
- Extensive documentation and community support
- Supports a wider range of computer vision tasks
Cons of Detectron2
- Steeper learning curve due to its complexity
- Heavier resource requirements for training and inference
Code Comparison
DPT (Dense Prediction Transformers):
model = DPTDepthModel.from_pretrained("Intel/dpt-large")
depth_map = model(image)
Detectron2:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
predictor = DefaultPredictor(cfg)
outputs = predictor(image)
Key Differences
- DPT focuses on depth estimation and monocular tasks
- Detectron2 offers a broader range of object detection and segmentation capabilities
- DPT uses transformer-based architecture, while Detectron2 primarily uses CNN-based models
- Detectron2 provides more flexibility in model configuration and customization
Pros of DROID-SLAM
- Focuses on real-time dense SLAM, providing more comprehensive 3D reconstruction
- Utilizes deep learning for feature extraction and matching, potentially improving accuracy in challenging environments
- Includes loop closure detection for improved mapping consistency
Cons of DROID-SLAM
- May require more computational resources due to its dense reconstruction approach
- Potentially less suitable for scenarios where only depth estimation is needed
- Could have a steeper learning curve for implementation and customization
Code Comparison
DROID-SLAM:
class DROID(nn.Module):
def __init__(self, args):
super(DROID, self).__init__()
self.update_op = DroidUpdateOp.apply
self.nets = nn.ModuleDict()
self.nets['update'] = UpdateModule(args)
DPT:
class DPTDepthModel(DPT):
def __init__(self, path=None, non_negative=True, **kwargs):
features = kwargs["features"] if "features" in kwargs else None
super().__init__(head=head, features=features)
self.pretrained = _load_weights(self, path)
Both repositories focus on different aspects of 3D vision. DROID-SLAM is geared towards real-time dense SLAM, while DPT specializes in monocular depth estimation. DROID-SLAM may offer more comprehensive 3D reconstruction but could be more resource-intensive. DPT, on the other hand, might be more suitable for applications that primarily require depth information without full SLAM capabilities.
Metric depth estimation from a single image
Pros of ZoeDepth
- Improved performance and accuracy in depth estimation
- Faster inference time, especially on mobile devices
- More lightweight model architecture
Cons of ZoeDepth
- Less extensive documentation compared to DPT
- Fewer pre-trained models available
- Limited support for older hardware
Code Comparison
ZoeDepth:
from zoedepth.models.zoedepth.zoedepth_v3 import ZoeDepth
model = ZoeDepth.build(weights='zoe_nk')
depth_map = model.infer_pil(image)
DPT:
from dpt.models import DPTDepthModel
model = DPTDepthModel.from_pretrained("intel-isl/DPT-Large")
depth_map = model(image)
Both repositories focus on depth estimation, but ZoeDepth offers improved performance and efficiency, particularly for mobile applications. DPT provides more comprehensive documentation and a wider range of pre-trained models. The code comparison shows that ZoeDepth has a slightly simpler API for model initialization and inference, while DPT follows a more traditional PyTorch approach. Overall, ZoeDepth is better suited for lightweight, mobile-friendly applications, while DPT may be preferred for more extensive research and experimentation.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Vision Transformers for Dense Prediction
This repository contains code and models for our paper:
Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
Changelog
- [March 2021] Initial release of inference code and models
Setup
- Download the model weights and place them in the
weights
folder:
Monodepth:
Segmentation:
-
Set up dependencies:
pip install -r requirements.txt
The code was tested with Python 3.7, PyTorch 1.8.0, OpenCV 4.5.1, and timm 0.4.5
Usage
-
Place one or more input images in the folder
input
. -
Run a monocular depth estimation model:
python run_monodepth.py
Or run a semantic segmentation model:
python run_segmentation.py
-
The results are written to the folder
output_monodepth
andoutput_semseg
, respectively.
Use the flag -t
to switch between different models. Possible options are dpt_hybrid
(default) and dpt_large
.
Additional models:
- Monodepth finetuned on KITTI: dpt_hybrid_kitti-cb926ef4.pt Mirror
- Monodepth finetuned on NYUv2: dpt_hybrid_nyu-2ce69ec7.pt Mirror
Run with
python run_monodepth -t [dpt_hybrid_kitti|dpt_hybrid_nyu]
Evaluation
Hints on how to evaluate monodepth models can be found here: https://github.com/intel-isl/DPT/blob/main/EVALUATION.md
Citation
Please cite our papers if you use this code or any of the models.
@article{Ranftl2021,
author = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ArXiv preprint},
year = {2021},
}
@article{Ranftl2020,
author = {Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun},
title = {Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
year = {2020},
}
Acknowledgements
Our work builds on and uses code from timm and PyTorch-Encoding. We'd like to thank the authors for making these libraries available.
License
MIT License
Top Related Projects
Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Metric depth estimation from a single image
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot