Convert Figma logo to code with AI

isl-org logoMiDaS

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

4,383
616
4,383
153

Top Related Projects

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

TRI-ML Monocular Depth Estimation Repository

[ICCV 2019] Monocular depth estimation from a single image

Quick Overview

MiDaS (Monocular Depth Estimation) is an open-source project for estimating depth from a single image. It provides state-of-the-art models for monocular depth estimation, which can be used for various computer vision tasks such as 3D reconstruction, augmented reality, and robotics.

Pros

  • High-quality depth estimation from a single image
  • Multiple model variants for different performance-speed trade-offs
  • Pre-trained models available for easy use
  • Supports various input formats and resolutions

Cons

  • Requires significant computational resources for training and inference
  • May struggle with complex scenes or unusual lighting conditions
  • Limited to monocular depth estimation (single image input)
  • Dependency on specific deep learning frameworks

Code Examples

  1. Loading and using a MiDaS model:
import cv2
import torch
import numpy as np
from midas.model_loader import load_model

# Load model
model_type = "DPT_Large"
model, transform, net_w, net_h = load_model(model_type, device="cuda", optimize=True)

# Load and preprocess image
img = cv2.imread("input.jpg")
img_input = transform({"image": img})["image"]

# Compute depth
with torch.no_grad():
    prediction = model.forward(img_input)
    prediction = torch.nn.functional.interpolate(
        prediction.unsqueeze(1),
        size=img.shape[:2],
        mode="bicubic",
        align_corners=False,
    ).squeeze()

output = prediction.cpu().numpy()
  1. Visualizing the depth map:
import matplotlib.pyplot as plt

plt.imshow(output, cmap='plasma')
plt.colorbar(label='Depth')
plt.title('Depth Map')
plt.show()
  1. Batch processing multiple images:
import glob

image_files = glob.glob("input_folder/*.jpg")
for file in image_files:
    img = cv2.imread(file)
    img_input = transform({"image": img})["image"]
    
    with torch.no_grad():
        prediction = model.forward(img_input)
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size=img.shape[:2],
            mode="bicubic",
            align_corners=False,
        ).squeeze()
    
    output = prediction.cpu().numpy()
    cv2.imwrite(f"output_folder/{file.split('/')[-1]}_depth.png", (output * 255).astype(np.uint8))

Getting Started

To get started with MiDaS:

  1. Clone the repository:

    git clone https://github.com/isl-org/MiDaS.git
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained weights:

    from midas.model_loader import download_model
    download_model("DPT_Large")
    
  4. Use the model as shown in the code examples above.

Competitor Comparisons

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

  • Broader scope: Supports multiple computer vision tasks (object detection, segmentation, etc.)
  • Extensive documentation and community support
  • Modular architecture for easy customization and extension

Cons of Detectron2

  • Steeper learning curve due to its complexity and wide range of features
  • Potentially higher computational requirements for some tasks
  • May be overkill for projects focused solely on depth estimation

Code Comparison

MiDaS (depth estimation):

import torch
from midas.model_loader import load_model

model_type = "DPT_Large"
model, transform, net_w, net_h = load_model(model_type, device="cuda", optimize=True)

prediction = model(transform(img).to("cuda"))

Detectron2 (object detection):

from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)

outputs = predictor(image)

TRI-ML Monocular Depth Estimation Repository

Pros of packnet-sfm

  • Self-supervised learning approach, requiring no ground truth depth data for training
  • Capable of estimating absolute scale in monocular depth prediction
  • Includes pose estimation for ego-motion, useful for SLAM applications

Cons of packnet-sfm

  • More complex architecture and training process compared to MiDaS
  • May require more computational resources for training and inference
  • Less generalization to diverse datasets without fine-tuning

Code Comparison

MiDaS:

midas = torch.hub.load("intel-isl/MiDaS", "MiDaS")
midas.eval()
prediction = midas(input_image)

packnet-sfm:

model = PackNet01(config)
model.load_state_dict(torch.load('model.ckpt'))
depth, pose = model(image_sequence)

MiDaS focuses on simplicity and ease of use, with a straightforward inference process. packnet-sfm offers more flexibility and additional features like pose estimation, but requires more setup and configuration.

Both projects aim to solve monocular depth estimation, but packnet-sfm takes a self-supervised approach with additional capabilities, while MiDaS prioritizes simplicity and generalization across diverse datasets.

[ICCV 2019] Monocular depth estimation from a single image

Pros of monodepth2

  • Self-supervised training approach, requiring no ground truth depth data
  • Supports multi-scale depth estimation for improved accuracy
  • Includes pre-trained models for various datasets and architectures

Cons of monodepth2

  • Limited to monocular depth estimation
  • May struggle with complex scenes or unusual camera motions
  • Requires careful hyperparameter tuning for optimal performance

Code Comparison

MiDaS:

midas = torch.hub.load("intel-isl/MiDaS", "MiDaS")
midas.to(device).eval()
prediction = midas(input_image)

monodepth2:

encoder = networks.ResnetEncoder(18, False)
depth_decoder = networks.DepthDecoder(num_ch_enc=encoder.num_ch_enc, scales=range(4))
loaded_dict = torch.load("model_path.pth")
depth_decoder.load_state_dict(loaded_dict)
outputs = depth_decoder(encoder(input_image))

MiDaS offers a simpler API for inference, while monodepth2 provides more flexibility in model architecture and training. MiDaS is designed for robust performance across various datasets, whereas monodepth2 focuses on self-supervised learning from monocular videos.

Pros of DROID-SLAM

  • Performs simultaneous localization and mapping (SLAM), providing a more comprehensive 3D reconstruction
  • Utilizes deep learning for feature extraction and matching, potentially improving accuracy in challenging scenarios
  • Offers real-time performance on GPU-equipped systems

Cons of DROID-SLAM

  • Requires more computational resources due to its complex SLAM pipeline
  • May struggle with certain types of scenes or motions where traditional SLAM methods excel
  • Has a steeper learning curve for implementation and fine-tuning

Code Comparison

MiDaS:

midas = torch.hub.load("intel-isl/MiDaS", "MiDaS")
midas.to(device).eval()
prediction = midas(img)

DROID-SLAM:

slam = DROID(args)
for image in images:
    slam.track(image)
poses, points, colors = slam.get_map()

The code snippets show that MiDaS focuses on single-image depth estimation, while DROID-SLAM processes a sequence of images to build a 3D map and estimate camera poses. DROID-SLAM's API is more complex due to its SLAM functionality.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

This repository contains code to compute depth from a single image. It accompanies our paper:

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun

and our preprint:

Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

For the latest release MiDaS 3.1, a technical report and video are available.

MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with multi-objective optimization. The original model that was trained on 5 datasets (MIX 5 in the paper) can be found here. The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.

Setup

  1. Pick one or more models and download the corresponding weights to the weights folder:

MiDaS 3.1

MiDaS 3.0: Legacy transformer models dpt_large_384 and dpt_hybrid_384

MiDaS 2.1: Legacy convolutional models midas_v21_384 and midas_v21_small_256

  1. Set up dependencies:

    conda env create -f environment.yaml
    conda activate midas-py310
    

optional

For the Next-ViT model, execute

git submodule add https://github.com/isl-org/Next-ViT midas/external/next_vit

For the OpenVINO model, install

pip install openvino

Usage

  1. Place one or more input images in the folder input.

  2. Run the model with

    python run.py --model_type <model_type> --input_path input --output_path output
    

    where <model_type> is chosen from dpt_beit_large_512, dpt_beit_large_384, dpt_beit_base_384, dpt_swin2_large_384, dpt_swin2_base_384, dpt_swin2_tiny_256, dpt_swin_large_384, dpt_next_vit_large_384, dpt_levit_224, dpt_large_384, dpt_hybrid_384, midas_v21_384, midas_v21_small_256, openvino_midas_v21_small_256.

  3. The resulting depth maps are written to the output folder.

optional

  1. By default, the inference resizes the height of input images to the size of a model to fit into the encoder. This size is given by the numbers in the model names of the accuracy table. Some models do not only support a single inference height but a range of different heights. Feel free to explore different heights by appending the extra command line argument --height. Unsupported height values will throw an error. Note that using this argument may decrease the model accuracy.
  2. By default, the inference keeps the aspect ratio of input images when feeding them into the encoder if this is supported by a model (all models except for Swin, Swin2, LeViT). In order to resize to a square resolution, disregarding the aspect ratio while preserving the height, use the command line argument --square.

via Camera

If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths away and choose a model type as shown above:

python run.py --model_type <model_type> --side

The argument --side is optional and causes both the input RGB image and the output depth map to be shown side-by-side for comparison.

via Docker

  1. Make sure you have installed Docker and the NVIDIA Docker runtime.

  2. Build the Docker image:

    docker build -t midas .
    
  3. Run inference:

    docker run --rm --gpus all -v $PWD/input:/opt/MiDaS/input -v $PWD/output:/opt/MiDaS/output -v $PWD/weights:/opt/MiDaS/weights midas
    

    This command passes through all of your NVIDIA GPUs to the container, mounts the input and output directories and then runs the inference.

via PyTorch Hub

The pretrained model is also available on PyTorch Hub

via TensorFlow or ONNX

See README in the tf subdirectory.

Currently only supports MiDaS v2.1.

via Mobile (iOS / Android)

See README in the mobile subdirectory.

via ROS1 (Robot Operating System)

See README in the ros subdirectory.

Currently only supports MiDaS v2.1. DPT-based models to be added.

Accuracy

We provide a zero-shot error $\epsilon_d$ which is evaluated for 6 different datasets (see paper). Lower error values are better. $\color{green}{\textsf{Overall model quality is represented by the improvement}}$ (Imp.) with respect to MiDaS 3.0 DPTL-384. The models are grouped by the height used for inference, whereas the square training resolution is given by the numbers in the model names. The table also shows the number of parameters (in millions) and the frames per second for inference at the training resolution (for GPU RTX 3090):

MiDaS ModelDIW
WHDR
Eth3d
AbsRel
Sintel
AbsRel
TUM
δ1
KITTI
δ1
NYUv2
δ1
$\color{green}{\textsf{Imp.}}$
%
Par.
M
FPS
 
Inference height 512
v3.1 BEiTL-5120.11370.06590.23666.1311.56*1.86*$\color{green}{\textsf{19}}$3455.7
v3.1 BEiTL-512$\tiny{\square}$0.11210.06140.20906.465.00*1.90*$\color{green}{\textsf{34}}$3455.7
Inference height 384
v3.1 BEiTL-5120.12450.06810.21766.136.28*2.16*$\color{green}{\textsf{28}}$34512
v3.1 Swin2L-384$\tiny{\square}$0.11060.07320.24428.875.84*2.92*$\color{green}{\textsf{22}}$21341
v3.1 Swin2B-384$\tiny{\square}$0.10950.07900.24048.935.97*3.28*$\color{green}{\textsf{22}}$10239
v3.1 SwinL-384$\tiny{\square}$0.11260.08530.24288.746.60*3.34*$\color{green}{\textsf{17}}$21349
v3.1 BEiTL-3840.12390.06670.25457.179.84*2.21*$\color{green}{\textsf{17}}$34413
v3.1 Next-ViTL-3840.10310.09540.22959.216.89*3.47*$\color{green}{\textsf{16}}$7230
v3.1 BEiTB-3840.11590.09670.29019.8826.60*3.91*$\color{green}{\textsf{-31}}$11231
v3.0 DPTL-3840.10820.08880.26979.978.468.32$\color{green}{\textsf{0}}$34461
v3.0 DPTH-3840.11060.09340.274110.8911.568.69$\color{green}{\textsf{-10}}$12350
v2.1 Large3840.12950.11550.328512.5116.088.71$\color{green}{\textsf{-32}}$10547
Inference height 256
v3.1 Swin2T-256$\tiny{\square}$0.12110.11060.286813.4310.13*5.55*$\color{green}{\textsf{-11}}$4264
v2.1 Small2560.13440.13440.337014.5329.2713.43$\color{green}{\textsf{-76}}$2190
Inference height 224
v3.1 LeViT224$\tiny{\square}$0.13140.12060.314818.2115.27*8.64*$\color{green}{\textsf{-40}}$5173

* No zero-shot error, because models are also trained on KITTI and NYU Depth V2
$\square$ Validation performed at square resolution, either because the transformer encoder backbone of a model does not support non-square resolutions (Swin, Swin2, LeViT) or for comparison with these models. All other validations keep the aspect ratio. A difference in resolution limits the comparability of the zero-shot error and the improvement, because these quantities are averages over the pixels of an image and do not take into account the advantage of more details due to a higher resolution.
Best values per column and same validation height in bold

Improvement

The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0 DPTL-384 and averaging over the datasets. So, if $\epsilon_d$ is the zero-shot error for dataset $d$, then the $\color{green}{\textsf{improvement}}$ is given by $100(1-(1/6)\sum_d\epsilon_d/\epsilon_{d,\rm{DPT_{L-384}}})$%.

Note that the improvements of 10% for MiDaS v2.0 → v2.1 and 21% for MiDaS v2.1 → v3.0 are not visible from the improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large384 and v2.0 Large384 respectively instead of v3.0 DPTL-384.

Depth map comparison

Zoom in for better visibility

Speed on Camera Feed

Test configuration

  • Windows 10
  • 11th Gen Intel Core i7-1185G7 3.00GHz
  • 16GB RAM
  • Camera resolution 640x480
  • openvino_midas_v21_small_256

Speed: 22 FPS

Applications

MiDaS is used in the following other projects from Intel Labs:

  • ZoeDepth (code available here): MiDaS computes the relative depth map given an image. For metric depth estimation, ZoeDepth can be used, which combines MiDaS with a metric depth binning module appended to the decoder.
  • LDM3D (Hugging Face model available here): LDM3D is an extension of vanilla stable diffusion designed to generate joint image and depth data from a text prompt. The depth maps used for supervision when training LDM3D have been computed using MiDaS.

Changelog

Citation

Please cite our paper if you use this code or any of the models:

@ARTICLE {Ranftl2022,
    author  = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun",
    title   = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer",
    journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
    year    = "2022",
    volume  = "44",
    number  = "3"
}

If you use a DPT-based model, please also cite:

@article{Ranftl2021,
	author    = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
	title     = {Vision Transformers for Dense Prediction},
	journal   = {ICCV},
	year      = {2021},
}

Please cite the technical report for MiDaS 3.1 models:

@article{birkl2023midas,
      title={MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation},
      author={Reiner Birkl and Diana Wofk and Matthias M{\"u}ller},
      journal={arXiv preprint arXiv:2307.14460},
      year={2023}
}

For ZoeDepth, please use

@article{bhat2023zoedepth,
  title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
  author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{\"u}ller, Matthias},
  journal={arXiv preprint arXiv:2302.12288},
  year={2023}
}

and for LDM3D

@article{stan2023ldm3d,
  title={LDM3D: Latent Diffusion Model for 3D},
  author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
  journal={arXiv preprint arXiv:2305.10853},
  year={2023}
}

Acknowledgements

Our work builds on and uses code from timm and Next-ViT. We'd like to thank the authors for making these libraries available.

License

MIT License