DPT

Dense Prediction Transformers

2,163

271

2,163

View on GitHub View on NPM

Top Related Projects

MiDaS

5,013

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

detectron2

32,239

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

ZoeDepth

2,645

Metric depth estimation from a single image

Quick Overview

DPT (Dense Prediction Transformers) is a GitHub repository by Intel ISL that implements vision transformers for dense prediction tasks. It focuses on monocular depth estimation, semantic segmentation, and panoptic segmentation, offering state-of-the-art performance on various benchmarks.

Pros

High-performance models for dense prediction tasks
Pretrained weights available for quick implementation
Supports multiple vision tasks (depth estimation, segmentation)
Well-documented with clear instructions for usage

Cons

Requires significant computational resources for training
Limited to specific vision tasks, not a general-purpose library
Dependency on specific versions of PyTorch and other libraries
Relatively complex architecture, which may be challenging for beginners

Code Examples

Loading a pretrained DPT model for depth estimation:

import torch
from dpt.models import DPTDepthModel

model = DPTDepthModel(
    path="weights/dpt_large-midas-2f21e586.pt",
    backbone="vitl16_384",
    non_negative=True,
    enable_attention_hooks=False,
)
model.eval()

Preprocessing input for the DPT model:

from torchvision.transforms import Compose
from dpt.transforms import Resize, NormalizeImage, PrepareForNet

transform = Compose([
    Resize(
        384,
        384,
        resize_target=None,
        keep_aspect_ratio=True,
        ensure_multiple_of=32,
        resize_method="minimal",
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    PrepareForNet(),
])

img = transform({"image": img})["image"]

Inferencing with the DPT model:

with torch.no_grad():
    sample = torch.from_numpy(img).to(device).unsqueeze(0)
    prediction = model.forward(sample)
    prediction = torch.nn.functional.interpolate(
        prediction.unsqueeze(1),
        size=img.shape[:2],
        mode="bicubic",
        align_corners=False,
    ).squeeze()
    output = prediction.cpu().numpy()

Getting Started

To get started with DPT:

Clone the repository:

git clone https://github.com/isl-org/DPT.git
cd DPT

Install dependencies:
```
pip install -r requirements.txt
```
Download pretrained weights from the provided links in the repository.
Use the provided scripts or integrate the models into your own code as shown in the code examples above.

Competitor Comparisons

MiDaS

5,013

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

Pros of MiDaS

More established project with a longer history and larger user base
Supports a wider range of pre-trained models for different use cases
Better documentation and examples for ease of use

Cons of MiDaS

Slightly lower accuracy compared to DPT for certain tasks
Less focus on real-time performance optimization
Fewer options for fine-tuning on custom datasets

Code Comparison

MiDaS:

model_type = "DPT_Large"
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.to(device).eval()

DPT:

model = DPTDepthModel(
    path=path_to_model,
    backbone="vitb_rn50_384",
    non_negative=True,
)
model.eval()

Both repositories focus on monocular depth estimation, with DPT being a more recent advancement building upon MiDaS. DPT offers improved accuracy and performance for certain tasks, while MiDaS provides a broader range of pre-trained models and better documentation. DPT is more suitable for users seeking state-of-the-art results, while MiDaS may be preferable for those requiring a more established and well-documented solution with a variety of model options.

detectron2

32,239

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

More comprehensive and feature-rich object detection framework
Extensive documentation and community support
Supports a wider range of computer vision tasks

Cons of Detectron2

Steeper learning curve due to its complexity
Heavier resource requirements for training and inference

Code Comparison

DPT (Dense Prediction Transformers):

model = DPTDepthModel.from_pretrained("Intel/dpt-large")
depth_map = model(image)

Detectron2:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

Key Differences

DPT focuses on depth estimation and monocular tasks
Detectron2 offers a broader range of object detection and segmentation capabilities
DPT uses transformer-based architecture, while Detectron2 primarily uses CNN-based models
Detectron2 provides more flexibility in model configuration and customization

DROID-SLAM

2,144

Pros of DROID-SLAM

Focuses on real-time dense SLAM, providing more comprehensive 3D reconstruction
Utilizes deep learning for feature extraction and matching, potentially improving accuracy in challenging environments
Includes loop closure detection for improved mapping consistency

Cons of DROID-SLAM

May require more computational resources due to its dense reconstruction approach
Potentially less suitable for scenarios where only depth estimation is needed
Could have a steeper learning curve for implementation and customization

Code Comparison

DROID-SLAM:

class DROID(nn.Module):
    def __init__(self, args):
        super(DROID, self).__init__()
        self.update_op = DroidUpdateOp.apply
        self.nets = nn.ModuleDict()
        self.nets['update'] = UpdateModule(args)

DPT:

class DPTDepthModel(DPT):
    def __init__(self, path=None, non_negative=True, **kwargs):
        features = kwargs["features"] if "features" in kwargs else None
        super().__init__(head=head, features=features)
        self.pretrained = _load_weights(self, path)

Both repositories focus on different aspects of 3D vision. DROID-SLAM is geared towards real-time dense SLAM, while DPT specializes in monocular depth estimation. DROID-SLAM may offer more comprehensive 3D reconstruction but could be more resource-intensive. DPT, on the other hand, might be more suitable for applications that primarily require depth information without full SLAM capabilities.

ZoeDepth

2,645

Metric depth estimation from a single image

Pros of ZoeDepth

Improved performance and accuracy in depth estimation
Faster inference time, especially on mobile devices
More lightweight model architecture

Cons of ZoeDepth

Less extensive documentation compared to DPT
Fewer pre-trained models available
Limited support for older hardware

Code Comparison

ZoeDepth:

from zoedepth.models.zoedepth.zoedepth_v3 import ZoeDepth
model = ZoeDepth.build(weights='zoe_nk')
depth_map = model.infer_pil(image)

DPT:

from dpt.models import DPTDepthModel
model = DPTDepthModel.from_pretrained("intel-isl/DPT-Large")
depth_map = model(image)

Both repositories focus on depth estimation, but ZoeDepth offers improved performance and efficiency, particularly for mobile applications. DPT provides more comprehensive documentation and a wider range of pre-trained models. The code comparison shows that ZoeDepth has a slightly simpler API for model initialization and inference, while DPT follows a more traditional PyTorch approach. Overall, ZoeDepth is better suited for lightweight, mobile-friendly applications, while DPT may be preferred for more extensive research and experimentation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PROJECT NOT UNDER ACTIVE MANAGEMENT

This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Vision Transformers for Dense Prediction

This repository contains code and models for our paper:

Vision Transformers for Dense Prediction
RenÃ© Ranftl, Alexey Bochkovskiy, Vladlen Koltun

Changelog

[March 2021] Initial release of inference code and models

Setup

Download the model weights and place them in the weights folder:

Monodepth:

Segmentation:

Set up dependencies:
```
pip install -r requirements.txt
```
The code was tested with Python 3.7, PyTorch 1.8.0, OpenCV 4.5.1, and timm 0.4.5

Usage

Place one or more input images in the folder input.
Run a monocular depth estimation model:
```
python run_monodepth.py
```
Or run a semantic segmentation model:
```
python run_segmentation.py
```
The results are written to the folder output_monodepth and output_semseg, respectively.

Use the flag -t to switch between different models. Possible options are dpt_hybrid (default) and dpt_large.

Additional models:

Monodepth finetuned on KITTI: dpt_hybrid_kitti-cb926ef4.pt Mirror
Monodepth finetuned on NYUv2: dpt_hybrid_nyu-2ce69ec7.pt Mirror

Run with

python run_monodepth -t [dpt_hybrid_kitti|dpt_hybrid_nyu]

Evaluation

Hints on how to evaluate monodepth models can be found here: https://github.com/intel-isl/DPT/blob/main/EVALUATION.md

Citation

Please cite our papers if you use this code or any of the models.

@article{Ranftl2021,
	author    = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
	title     = {Vision Transformers for Dense Prediction},
	journal   = {ArXiv preprint},
	year      = {2021},
}

@article{Ranftl2020,
	author    = {Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun},
	title     = {Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer},
	journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
	year      = {2020},
}

Acknowledgements

Our work builds on and uses code from timm and PyTorch-Encoding. We'd like to thank the authors for making these libraries available.

License

MIT License

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot