monodepth2

[ICCV 2019] Monocular depth estimation from a single image

4,359

984

4,359

View on GitHub

Top Related Projects

MiDaS

4,995

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

packnet-sfm

1,259

TRI-ML Monocular Depth Estimation Repository

Quick Overview

Monodepth2 is an open-source project for self-supervised monocular depth estimation. It provides a framework for training and testing deep learning models that can predict depth from a single image, without the need for ground truth depth data during training.

Pros

Self-supervised learning approach, eliminating the need for expensive ground truth depth data
State-of-the-art performance on various benchmarks for monocular depth estimation
Flexible architecture that supports multiple input resolutions and different backbone networks
Includes pre-trained models for quick deployment and testing

Cons

Requires significant computational resources for training, especially on high-resolution images
Performance can be affected by challenging lighting conditions or complex scenes
May struggle with objects or scenes not well-represented in the training data
Limited to estimating relative depth, not absolute depth measurements

Code Examples

Loading a pre-trained model and making predictions:

import torch
from monodepth2 import networks
from monodepth2.utils import readlines
from monodepth2.layers import disp_to_depth

model_path = "models/mono+stereo_640x192"
encoder = networks.ResnetEncoder(18, False)
depth_decoder = networks.DepthDecoder(num_ch_enc=encoder.num_ch_enc, scales=range(4))

loaded_dict_enc = torch.load(model_path + "/encoder.pth")
loaded_dict_dec = torch.load(model_path + "/depth.pth")
encoder.load_state_dict(loaded_dict_enc)
depth_decoder.load_state_dict(loaded_dict_dec)

encoder.eval()
depth_decoder.eval()

# Predict depth for a single image
with torch.no_grad():
    features = encoder(input_image)
    outputs = depth_decoder(features)

Visualizing depth predictions:

import matplotlib.pyplot as plt
import numpy as np

def visualize_depth(depth):
    plt.imshow(depth, cmap='plasma')
    plt.colorbar(label='Depth')
    plt.title('Depth Prediction')
    plt.show()

disp = outputs[("disp", 0)]
scaled_disp, _ = disp_to_depth(disp, 0.1, 100)
depth = 1 / scaled_disp
depth_np = depth.squeeze().cpu().numpy()

visualize_depth(depth_np)

Training a new model:

from monodepth2.trainer import Trainer

options = {
    "data_path": "/path/to/kitti/dataset",
    "log_dir": "/path/to/log/directory",
    "model_name": "my_model",
    "split": "eigen_zhou",
    "height": 192,
    "width": 640,
    "batch_size": 12,
    "num_epochs": 20,
    "learning_rate": 1e-4,
}

trainer = Trainer(options)
trainer.train()

Getting Started

Clone the repository:

git clone https://github.com/nianticlabs/monodepth2.git
cd monodepth2

Install dependencies:
```
pip install -r requirements.txt
```

Download pre-trained models:

wget https://storage.googleapis.com/niantic-lon-static/research/monodepth2/mono+stereo_640x192.zip
unzip mono+stereo_640x192.zip

Run inference on a single image:

python test_simple.py --image_path assets/test_image.jpg --model_name mono+stereo_640x192

Competitor Comparisons

MiDaS

4,995

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

Pros of MiDaS

Supports a wider range of input resolutions and aspect ratios
Offers pre-trained models for various architectures (e.g., ResNet, EfficientNet)
Provides better generalization across different datasets and scenes

Cons of MiDaS

Slower inference time compared to Monodepth2
Requires more computational resources for training and inference
Less focus on real-time applications

Code Comparison

MiDaS:

model_type = "DPT_Large"
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.to(device)
midas.eval()

Monodepth2:

encoder = networks.ResnetEncoder(18, False)
depth_decoder = networks.DepthDecoder(num_ch_enc=encoder.num_ch_enc, scales=range(4))
loaded_dict = torch.load("model_path.pth")

MiDaS offers a simpler model loading process through PyTorch Hub, while Monodepth2 requires manual instantiation of encoder and decoder components. MiDaS provides more flexibility in model selection, whereas Monodepth2 focuses on a specific architecture. Both projects aim to estimate depth from single images, but MiDaS emphasizes robustness across diverse scenes, while Monodepth2 targets efficient, real-time performance.

DPT

2,163

Dense Prediction Transformers

Pros of DPT

Utilizes a more advanced transformer-based architecture, potentially offering better performance on complex scenes
Supports multiple vision tasks beyond depth estimation, including semantic segmentation and surface normal estimation
Provides pre-trained models for various datasets and tasks, enabling easier adaptation to different use cases

Cons of DPT

Generally requires more computational resources due to its larger model size and transformer architecture
May have slower inference times compared to Monodepth2, especially on less powerful hardware
Has a more complex codebase and architecture, which could be harder to understand and modify for some users

Code Comparison

Monodepth2 (model prediction):

outputs = self.models["encoder"](input_image)
outputs = self.models["depth"](outputs)

DPT (model prediction):

features = self.pretrained.forward_features(x)
features = self.scratch.forward_features(features)
out = self.scratch.output_conv(features)

Both repositories provide depth estimation functionality, but DPT offers a more versatile and potentially more powerful approach at the cost of increased complexity and resource requirements. Monodepth2 may be more suitable for simpler tasks or resource-constrained environments, while DPT could be preferred for more advanced applications or when multiple vision tasks are needed.

packnet-sfm

1,259

TRI-ML Monocular Depth Estimation Repository

Pros of PackNet-SFM

Improved depth estimation accuracy, especially in challenging scenarios
Better generalization to unseen environments
Incorporates 3D geometry constraints for more robust predictions

Cons of PackNet-SFM

Higher computational requirements due to more complex architecture
Longer training time compared to Monodepth2
May require more data for optimal performance

Code Comparison

PackNet-SFM:

class PackNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()
        self.packing = PackingLayer()

Monodepth2:

class DepthDecoder(nn.Module):
    def __init__(self, num_ch_enc, scales=range(4)):
        super().__init__()
        self.num_output_channels = 1
        self.use_skips = True
        self.upsample_mode = 'nearest'

PackNet-SFM uses a more sophisticated architecture with packing and unpacking layers, while Monodepth2 employs a simpler encoder-decoder structure. PackNet-SFM's approach allows for better preservation of fine details and improved depth estimation, but at the cost of increased computational complexity.

Both projects are actively maintained and offer pretrained models, but PackNet-SFM may be more suitable for applications requiring higher accuracy, while Monodepth2 might be preferred for scenarios with limited computational resources or faster inference requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Monodepth2

This is the reference PyTorch implementation for training and testing depth estimation models using the method described in

Digging into Self-Supervised Monocular Depth Prediction

ClÃ©ment Godard, Oisin Mac Aodha, Michael Firman and Gabriel J. Brostow

ICCV 2019 (arXiv pdf)

example input output gif

This code is for non-commercial use; please see the license file for terms.

If you find our work useful in your research please consider citing our paper:

@article{monodepth2,
  title     = {Digging into Self-Supervised Monocular Depth Prediction},
  author    = {Cl{\'{e}}ment Godard and
               Oisin {Mac Aodha} and
               Michael Firman and
               Gabriel J. Brostow},
  booktitle = {The International Conference on Computer Vision (ICCV)},
  month = {October},
year = {2019}
}

âï¸ Setup

Assuming a fresh Anaconda distribution, you can install the dependencies with:

conda install pytorch=0.4.1 torchvision=0.2.1 -c pytorch
pip install tensorboardX==1.4
conda install opencv=3.3.1   # just needed for evaluation

We ran our experiments with PyTorch 0.4.1, CUDA 9.1, Python 3.6.6 and Ubuntu 18.04. We have also successfully trained models with PyTorch 1.0, and our code is compatible with Python 2.7. You may have issues installing OpenCV version 3.3.1 if you use Python 3.7, we recommend to create a virtual environment with Python 3.6.6 conda create -n monodepth2 python=3.6.6 anaconda .

ð¼ï¸ Prediction for a single image

You can predict scaled disparity for a single image with:

python test_simple.py --image_path assets/test_image.jpg --model_name mono+stereo_640x192

or, if you are using a stereo-trained model, you can estimate metric depth with

python test_simple.py --image_path assets/test_image.jpg --model_name mono+stereo_640x192 --pred_metric_depth

On its first run either of these commands will download the mono+stereo_640x192 pretrained model (99MB) into the models/ folder. We provide the following options for --model_name:

`--model_name`	Training modality	Imagenet pretrained?	Model resolution	KITTI abs. rel. error	delta < 1.25
`mono_640x192`	Mono	Yes	640 x 192	0.115	0.877
`stereo_640x192`	Stereo	Yes	640 x 192	0.109	0.864
`mono+stereo_640x192`	Mono + Stereo	Yes	640 x 192	0.106	0.874
`mono_1024x320`	Mono	Yes	1024 x 320	0.115	0.879
`stereo_1024x320`	Stereo	Yes	1024 x 320	0.107	0.874
`mono+stereo_1024x320`	Mono + Stereo	Yes	1024 x 320	0.106	0.876
`mono_no_pt_640x192`	Mono	No	640 x 192	0.132	0.845
`stereo_no_pt_640x192`	Stereo	No	640 x 192	0.130	0.831
`mono+stereo_no_pt_640x192`	Mono + Stereo	No	640 x 192	0.127	0.836

You can also download models trained on the odometry split with monocular and mono+stereo training modalities.

Finally, we provide resnet 50 depth estimation models trained with ImageNet pretrained weights and trained from scratch. Make sure to set --num_layers 50 if using these.

ð¾ KITTI training data

You can download the entire raw KITTI dataset by running:

wget -i splits/kitti_archives_to_download.txt -P kitti_data/

Then unzip with

cd kitti_data
unzip "*.zip"
cd ..

Warning: it weighs about 175GB, so make sure you have enough space to unzip too!

Our default settings expect that you have converted the png images to jpeg with this command, which also deletes the raw KITTI .png files:

find kitti_data/ -name '*.png' | parallel 'convert -quality 92 -sampling-factor 2x2,1x1,1x1 {.}.png {.}.jpg && rm {}'

or you can skip this conversion step and train from raw png files by adding the flag --png when training, at the expense of slower load times.

The above conversion command creates images which match our experiments, where KITTI .png images were converted to .jpg on Ubuntu 16.04 with default chroma subsampling 2x2,1x1,1x1. We found that Ubuntu 18.04 defaults to 2x2,2x2,2x2, which gives different results, hence the explicit parameter in the conversion command.

You can also place the KITTI dataset wherever you like and point towards it with the --data_path flag during training and evaluation.

Splits

The train/test/validation splits are defined in the splits/ folder. By default, the code will train a depth model using Zhou's subset of the standard Eigen split of KITTI, which is designed for monocular training. You can also train a model using the new benchmark split or the odometry split by setting the --split flag.

Custom dataset

You can train on a custom monocular or stereo dataset by writing a new dataloader class which inherits from MonoDataset â see the KITTIDataset class in datasets/kitti_dataset.py for an example.

â³ Training

By default models and tensorboard event files are saved to ~/tmp/<model_name>. This can be changed with the --log_dir flag.

Monocular training:

python train.py --model_name mono_model

Stereo training:

Our code defaults to using Zhou's subsampled Eigen training data. For stereo-only training we have to specify that we want to use the full Eigen training set â see paper for details.

python train.py --model_name stereo_model \
  --frame_ids 0 --use_stereo --split eigen_full

Monocular + stereo training:

python train.py --model_name mono+stereo_model \
  --frame_ids 0 -1 1 --use_stereo

GPUs

The code can only be run on a single GPU. You can specify which GPU to use with the CUDA_VISIBLE_DEVICES environment variable:

CUDA_VISIBLE_DEVICES=2 python train.py --model_name mono_model

All our experiments were performed on a single NVIDIA Titan Xp.

Training modality	Approximate GPU memory	Approximate training time
Mono	9GB	12 hours
Stereo	6GB	8 hours
Mono + Stereo	11GB	15 hours

ð½ Finetuning a pretrained model

Add the following to the training command to load an existing model for finetuning:

python train.py --model_name finetuned_mono --load_weights_folder ~/tmp/mono_model/models/weights_19

ð§ Other training options

Run python train.py -h (or look at options.py) to see the range of other training options, such as learning rates and ablation settings.

ð KITTI evaluation

To prepare the ground truth depth maps run:

python export_gt_depth.py --data_path kitti_data --split eigen
python export_gt_depth.py --data_path kitti_data --split eigen_benchmark

...assuming that you have placed the KITTI dataset in the default location of ./kitti_data/.

The following example command evaluates the epoch 19 weights of a model named mono_model:

python evaluate_depth.py --load_weights_folder ~/tmp/mono_model/models/weights_19/ --eval_mono

For stereo models, you must use the --eval_stereo flag (see note below):

python evaluate_depth.py --load_weights_folder ~/tmp/stereo_model/models/weights_19/ --eval_stereo

If you train your own model with our code you are likely to see slight differences to the publication results due to randomization in the weights initialization and data loading.

An additional parameter --eval_split can be set. The three different values possible for eval_split are explained here:

`--eval_split`	Test set size	For models trained with...	Description
`eigen`	697	`--split eigen_zhou` (default) or `--split eigen_full`	The standard Eigen test files
`eigen_benchmark`	652	`--split eigen_zhou` (default) or `--split eigen_full`	Evaluate with the improved ground truth from the new KITTI depth benchmark
`benchmark`	500	`--split benchmark`	The new KITTI depth benchmark test files.

Because no ground truth is available for the new KITTI depth benchmark, no scores will be reported when --eval_split benchmark is set. Instead, a set of .png images will be saved to disk ready for upload to the evaluation server.

External disparities evaluation

Finally you can also use evaluate_depth.py to evaluate raw disparities (or inverse depth) from other methods by using the --ext_disp_to_eval flag:

python evaluate_depth.py --ext_disp_to_eval ~/other_method_disp.npy

ð·ð· Note on stereo evaluation

Our stereo models are trained with an effective baseline of 0.1 units, while the actual KITTI stereo rig has a baseline of 0.54m. This means a scaling of 5.4 must be applied for evaluation. In addition, for models trained with stereo supervision we disable median scaling. Setting the --eval_stereo flag when evaluating will automatically disable median scaling and scale predicted depths by 5.4.

â¤´ï¸â¤µï¸ Odometry evaluation

We include code for evaluating poses predicted by models trained with --split odom --dataset kitti_odom --data_path /path/to/kitti/odometry/dataset.

For this evaluation, the KITTI odometry dataset (color, 65GB) and ground truth poses zip files must be downloaded. As above, we assume that the pngs have been converted to jpgs.

If this data has been unzipped to folder kitti_odom, a model can be evaluated with:

python evaluate_pose.py --eval_split odom_9 --load_weights_folder ./odom_split.M/models/weights_29 --data_path kitti_odom/
python evaluate_pose.py --eval_split odom_10 --load_weights_folder ./odom_split.M/models/weights_29 --data_path kitti_odom/

ð¦ Precomputed results

You can download our precomputed disparity predictions from the following links:

Training modality	Input size	`.npy` filesize	Eigen disparities
Mono	640 x 192	343 MB	Download ð
Stereo	640 x 192	343 MB	Download ð
Mono + Stereo	640 x 192	343 MB	Download ð
Mono	1024 x 320	914 MB	Download ð
Stereo	1024 x 320	914 MB	Download ð
Mono + Stereo	1024 x 320	914 MB	Download ð

ð©ââï¸ License

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of MiDaS

Cons of MiDaS

Code Comparison

Pros of DPT

Cons of DPT

Code Comparison

Pros of PackNet-SFM

Cons of PackNet-SFM

Code Comparison

Convert designs to code with AI

README

Monodepth2

âï¸ Setup

ð¼ï¸ Prediction for a single image

ð¾ KITTI training data

â³ Training

GPUs

ð½ Finetuning a pretrained model

ð§ Other training options

ð KITTI evaluation

ð¦ Precomputed results

ð©ââï¸ License

Top Related Projects

Convert designs to code with AI

âï¸ Setup

ð¼ï¸ Prediction for a single image

ð¾ KITTI training data

â³ Training

ð½ Finetuning a pretrained model

ð§ Other training options

ð KITTI evaluation

ð¦ Precomputed results

ð©ââï¸ License