Convert Figma logo to code with AI

isl-org logoDPT

Dense Prediction Transformers

1,958
254
1,958
37

Top Related Projects

4,402

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Metric depth estimation from a single image

Quick Overview

DPT (Dense Prediction Transformers) is a GitHub repository by Intel ISL that implements vision transformers for dense prediction tasks. It focuses on monocular depth estimation, semantic segmentation, and panoptic segmentation, offering state-of-the-art performance on various benchmarks.

Pros

  • High-performance models for dense prediction tasks
  • Pretrained weights available for quick implementation
  • Supports multiple vision tasks (depth estimation, segmentation)
  • Well-documented with clear instructions for usage

Cons

  • Requires significant computational resources for training
  • Limited to specific vision tasks, not a general-purpose library
  • Dependency on specific versions of PyTorch and other libraries
  • Relatively complex architecture, which may be challenging for beginners

Code Examples

  1. Loading a pretrained DPT model for depth estimation:
import torch
from dpt.models import DPTDepthModel

model = DPTDepthModel(
    path="weights/dpt_large-midas-2f21e586.pt",
    backbone="vitl16_384",
    non_negative=True,
    enable_attention_hooks=False,
)
model.eval()
  1. Preprocessing input for the DPT model:
from torchvision.transforms import Compose
from dpt.transforms import Resize, NormalizeImage, PrepareForNet

transform = Compose([
    Resize(
        384,
        384,
        resize_target=None,
        keep_aspect_ratio=True,
        ensure_multiple_of=32,
        resize_method="minimal",
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    PrepareForNet(),
])

img = transform({"image": img})["image"]
  1. Inferencing with the DPT model:
with torch.no_grad():
    sample = torch.from_numpy(img).to(device).unsqueeze(0)
    prediction = model.forward(sample)
    prediction = torch.nn.functional.interpolate(
        prediction.unsqueeze(1),
        size=img.shape[:2],
        mode="bicubic",
        align_corners=False,
    ).squeeze()
    output = prediction.cpu().numpy()

Getting Started

To get started with DPT:

  1. Clone the repository:

    git clone https://github.com/isl-org/DPT.git
    cd DPT
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pretrained weights from the provided links in the repository.

  4. Use the provided scripts or integrate the models into your own code as shown in the code examples above.

Competitor Comparisons

4,402

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

Pros of MiDaS

  • More established project with a longer history and larger user base
  • Supports a wider range of pre-trained models for different use cases
  • Better documentation and examples for ease of use

Cons of MiDaS

  • Slightly lower accuracy compared to DPT for certain tasks
  • Less focus on real-time performance optimization
  • Fewer options for fine-tuning on custom datasets

Code Comparison

MiDaS:

model_type = "DPT_Large"
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.to(device).eval()

DPT:

model = DPTDepthModel(
    path=path_to_model,
    backbone="vitb_rn50_384",
    non_negative=True,
)
model.eval()

Both repositories focus on monocular depth estimation, with DPT being a more recent advancement building upon MiDaS. DPT offers improved accuracy and performance for certain tasks, while MiDaS provides a broader range of pre-trained models and better documentation. DPT is more suitable for users seeking state-of-the-art results, while MiDaS may be preferable for those requiring a more established and well-documented solution with a variety of model options.

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

  • More comprehensive and feature-rich object detection framework
  • Extensive documentation and community support
  • Supports a wider range of computer vision tasks

Cons of Detectron2

  • Steeper learning curve due to its complexity
  • Heavier resource requirements for training and inference

Code Comparison

DPT (Dense Prediction Transformers):

model = DPTDepthModel.from_pretrained("Intel/dpt-large")
depth_map = model(image)

Detectron2:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

Key Differences

  • DPT focuses on depth estimation and monocular tasks
  • Detectron2 offers a broader range of object detection and segmentation capabilities
  • DPT uses transformer-based architecture, while Detectron2 primarily uses CNN-based models
  • Detectron2 provides more flexibility in model configuration and customization

Pros of DROID-SLAM

  • Focuses on real-time dense SLAM, providing more comprehensive 3D reconstruction
  • Utilizes deep learning for feature extraction and matching, potentially improving accuracy in challenging environments
  • Includes loop closure detection for improved mapping consistency

Cons of DROID-SLAM

  • May require more computational resources due to its dense reconstruction approach
  • Potentially less suitable for scenarios where only depth estimation is needed
  • Could have a steeper learning curve for implementation and customization

Code Comparison

DROID-SLAM:

class DROID(nn.Module):
    def __init__(self, args):
        super(DROID, self).__init__()
        self.update_op = DroidUpdateOp.apply
        self.nets = nn.ModuleDict()
        self.nets['update'] = UpdateModule(args)

DPT:

class DPTDepthModel(DPT):
    def __init__(self, path=None, non_negative=True, **kwargs):
        features = kwargs["features"] if "features" in kwargs else None
        super().__init__(head=head, features=features)
        self.pretrained = _load_weights(self, path)

Both repositories focus on different aspects of 3D vision. DROID-SLAM is geared towards real-time dense SLAM, while DPT specializes in monocular depth estimation. DROID-SLAM may offer more comprehensive 3D reconstruction but could be more resource-intensive. DPT, on the other hand, might be more suitable for applications that primarily require depth information without full SLAM capabilities.

Metric depth estimation from a single image

Pros of ZoeDepth

  • Improved performance and accuracy in depth estimation
  • Faster inference time, especially on mobile devices
  • More lightweight model architecture

Cons of ZoeDepth

  • Less extensive documentation compared to DPT
  • Fewer pre-trained models available
  • Limited support for older hardware

Code Comparison

ZoeDepth:

from zoedepth.models.zoedepth.zoedepth_v3 import ZoeDepth
model = ZoeDepth.build(weights='zoe_nk')
depth_map = model.infer_pil(image)

DPT:

from dpt.models import DPTDepthModel
model = DPTDepthModel.from_pretrained("intel-isl/DPT-Large")
depth_map = model(image)

Both repositories focus on depth estimation, but ZoeDepth offers improved performance and efficiency, particularly for mobile applications. DPT provides more comprehensive documentation and a wider range of pre-trained models. The code comparison shows that ZoeDepth has a slightly simpler API for model initialization and inference, while DPT follows a more traditional PyTorch approach. Overall, ZoeDepth is better suited for lightweight, mobile-friendly applications, while DPT may be preferred for more extensive research and experimentation.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Vision Transformers for Dense Prediction

This repository contains code and models for our paper:

Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

Changelog

  • [March 2021] Initial release of inference code and models

Setup

  1. Download the model weights and place them in the weights folder:

Monodepth:

Segmentation:

  1. Set up dependencies:

    pip install -r requirements.txt
    

    The code was tested with Python 3.7, PyTorch 1.8.0, OpenCV 4.5.1, and timm 0.4.5

Usage

  1. Place one or more input images in the folder input.

  2. Run a monocular depth estimation model:

    python run_monodepth.py
    

    Or run a semantic segmentation model:

    python run_segmentation.py
    
  3. The results are written to the folder output_monodepth and output_semseg, respectively.

Use the flag -t to switch between different models. Possible options are dpt_hybrid (default) and dpt_large.

Additional models:

Run with

python run_monodepth -t [dpt_hybrid_kitti|dpt_hybrid_nyu] 

Evaluation

Hints on how to evaluate monodepth models can be found here: https://github.com/intel-isl/DPT/blob/main/EVALUATION.md

Citation

Please cite our papers if you use this code or any of the models.

@article{Ranftl2021,
	author    = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
	title     = {Vision Transformers for Dense Prediction},
	journal   = {ArXiv preprint},
	year      = {2021},
}
@article{Ranftl2020,
	author    = {Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun},
	title     = {Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer},
	journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
	year      = {2020},
}

Acknowledgements

Our work builds on and uses code from timm and PyTorch-Encoding. We'd like to thank the authors for making these libraries available.

License

MIT License