Top Related Projects
[ICCV 2019] Monocular depth estimation from a single image
Pytorch version of SfmLearner from Tinghui Zhou et al.
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
Quick Overview
PackNet-SFM is a deep learning-based project for self-supervised monocular depth estimation and camera motion prediction. It implements a novel architecture called PackNet, which uses space-to-depth and depth-to-space transformations for efficient feature learning in depth estimation tasks.
Pros
- Self-supervised learning approach, requiring no ground truth depth data for training
- State-of-the-art performance in monocular depth estimation and pose prediction
- Efficient architecture design with PackNet for improved feature learning
- Includes pre-trained models and datasets for easy experimentation
Cons
- Requires significant computational resources for training and inference
- Limited documentation for advanced usage and customization
- Dependency on specific versions of libraries, which may cause compatibility issues
- Primarily focused on research, potentially limiting industrial applications
Code Examples
- Loading a pre-trained model and performing inference:
from packnet_sfm.models.model_wrapper import ModelWrapper
from packnet_sfm.datasets.augmentations import resize_image
model_wrapper = ModelWrapper.load("packnet_model")
image = resize_image(load_image("path/to/image.jpg"), (192, 640))
depth = model_wrapper.depth(image)
- Training a PackNet model:
from packnet_sfm.trainers.horovod_trainer import HorovodTrainer
from packnet_sfm.models.SfmModel import SfmModel
model = SfmModel()
trainer = HorovodTrainer(model, train_dataset, val_dataset, config)
trainer.fit()
- Evaluating depth estimation performance:
from packnet_sfm.models.model_wrapper import ModelWrapper
from packnet_sfm.utils.depth import compute_depth_metrics
model_wrapper = ModelWrapper.load("packnet_model")
pred_depths = model_wrapper.depth_all(test_dataset)
metrics = compute_depth_metrics(gt_depths, pred_depths)
Getting Started
-
Clone the repository:
git clone https://github.com/TRI-ML/packnet-sfm.git cd packnet-sfm
-
Install dependencies:
pip install -r requirements.txt
-
Download pre-trained models:
./scripts/download_models.sh
-
Run inference on a sample image:
from packnet_sfm.models.model_wrapper import ModelWrapper from packnet_sfm.datasets.augmentations import resize_image model = ModelWrapper.load("PackNet01_MR_velsup_CStoK") image = resize_image(load_image("data/test_image.jpg"), (192, 640)) depth = model.depth(image) visualize_depth(depth)
Competitor Comparisons
[ICCV 2019] Monocular depth estimation from a single image
Pros of monodepth2
- Simpler architecture, making it easier to understand and implement
- Faster training and inference times
- More lightweight, requiring less computational resources
Cons of monodepth2
- Generally lower accuracy compared to PackNet-SfM
- Less robust to challenging scenarios and diverse environments
- Limited ability to handle complex geometric structures
Code Comparison
monodepth2:
class DepthDecoder(nn.Module):
def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
super(DepthDecoder, self).__init__()
self.num_output_channels = num_output_channels
self.use_skips = use_skips
self.upsample_mode = 'nearest'
self.scales = scales
PackNet-SfM:
class PackNet01(nn.Module):
def __init__(self, version=None, dropout=None, depth_scales=4):
super().__init__()
self.version = version
self.depth_scales = depth_scales
self.encoder = PackNetEncoder(version=self.version, dropout=dropout)
self.decoder = PackNetDecoder(version=self.version, dropout=dropout, depth_scales=self.depth_scales)
The code snippets show that PackNet-SfM uses a more complex architecture with specialized encoder and decoder modules, while monodepth2 employs a simpler depth decoder structure. This reflects the trade-off between complexity and performance in these two approaches to monocular depth estimation.
Pytorch version of SfmLearner from Tinghui Zhou et al.
Pros of SfmLearner-Pytorch
- Simpler implementation, making it easier to understand and modify
- Lighter weight, requiring less computational resources
- Faster training and inference times
Cons of SfmLearner-Pytorch
- Generally lower accuracy compared to PackNet-SfM
- Less robust to challenging scenarios like occlusions and dynamic objects
- Limited ability to handle high-resolution images
Code Comparison
SfmLearner-Pytorch:
class SfMLearner(nn.Module):
def __init__(self, img_size, intrinsics, use_ssim=True):
super(SfMLearner, self).__init__()
self.pose_net = PoseExpNet(nb_ref_imgs=2)
self.depth_net = DepthNet(img_size)
PackNet-SfM:
class PackNet01(nn.Module):
def __init__(self, **kwargs):
super().__init__()
self.encoder = PackNetEncoder(**kwargs)
self.decoder = PackNetDecoder(**kwargs)
self.pose_network = PoseNetwork(**kwargs)
The code comparison shows that PackNet-SfM uses a more complex architecture with separate encoder and decoder modules, while SfmLearner-Pytorch has a simpler structure with separate pose and depth networks. This reflects the overall design philosophy of each project, with PackNet-SfM focusing on higher accuracy and robustness at the cost of increased complexity.
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
Pros of SfMLearner
- Pioneering work in self-supervised depth and ego-motion estimation
- Simpler architecture, potentially easier to understand and implement
- Established benchmark in the field with numerous citations
Cons of SfMLearner
- Generally lower accuracy compared to more recent methods like PackNet-SfM
- Less robust to dynamic objects and occlusions
- Limited to monocular depth estimation without leveraging additional sensors
Code Comparison
SfMLearner:
def get_multi_scale_intrinsics(raw_cam_mat, num_scales):
proj_cam2pix = []
for s in range(num_scales):
proj_cam2pix.append(raw_cam_mat.copy())
proj_cam2pix[-1][:2, :] = proj_cam2pix[-1][:2, :] / (2**s)
return proj_cam2pix
PackNet-SfM:
class PackNet(nn.Module):
def __init__(self, version='1A'):
super().__init__()
self.encoder = PackNetEncoder(version=version)
self.decoder = PackNetDecoder(version=version)
self.depth_net = DepthDecoder(num_ch_enc=self.encoder.num_ch_enc, scales=range(4))
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation
Install // Datasets // Training // Evaluation // Models // License // References
** UPDATE **: We have released a new depth estimation repository here, containing code related to our latest publications. It is an updated version of this repository, so if you are familiar with PackNet-SfM you should be able to migrate easily. Future publications will be included in our new repository, and this one will remain as is. Thank you very much for your support over these past couple of years!
Official PyTorch implementation of self-supervised monocular depth estimation methods invented by the ML Team at Toyota Research Institute (TRI), in particular for PackNet: 3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020 oral), Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos and Adrien Gaidon. Although self-supervised (i.e. trained only on monocular videos), PackNet outperforms other self, semi, and fully supervised methods. Furthermore, it gets better with input resolution and number of parameters, generalizes better, and can run in real-time (with TensorRT). See References for more info on our models.
This is also the official implementation of Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion (3DV 2020 oral), Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich and Adrien Gaidon. Neural Ray Surfaces (NRS) generalize self-supervised depth and pose estimation beyond the pinhole model to all central cameras, allowing the learning of meaningful depth and pose on non-pinhole cameras such as fisheye and catadioptric.
Install
You need a machine with recent Nvidia drivers and a GPU with at least 6GB of memory (more for the bigger models at higher resolution). We recommend using docker (see nvidia-docker2 instructions) to have a reproducible environment. To setup your environment, type in a terminal (only tested in Ubuntu 18.04):
git clone https://github.com/TRI-ML/packnet-sfm.git
cd packnet-sfm
# if you want to use docker (recommended)
make docker-build
We will list below all commands as if run directly inside our container. To run any of the commands in a container, you can either start the container in interactive mode with make docker-start-interactive
to land in a shell where you can type those commands, or you can do it in one step:
# single GPU
make docker-run COMMAND="some-command"
# multi-GPU
make docker-run-mpi COMMAND="some-command"
For instance, to verify that the environment is setup correctly, you can run a simple overfitting test:
# download a tiny subset of KITTI
curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/KITTI_tiny.tar | tar xv -C /data/datasets/
# in docker
make docker-run COMMAND="python3 scripts/train.py configs/overfit_kitti.yaml"
If you want to use features related to AWS (for dataset access) and Weights & Biases (WANDB) (for experiment management/visualization), then you should create associated accounts and configure your shell with the following environment variables:
export AWS_SECRET_ACCESS_KEY="something"
export AWS_ACCESS_KEY_ID="something"
export AWS_DEFAULT_REGION="something"
export WANDB_ENTITY="something"
export WANDB_API_KEY="something"
To enable WANDB logging and AWS checkpoint syncing, you can then set the corresponding configuration parameters in configs/<your config>.yaml
(cf. configs/default_config.py for defaults and docs):
wandb:
dry_run: True # Wandb dry-run (not logging)
name: '' # Wandb run name
project: os.environ.get("WANDB_PROJECT", "") # Wandb project
entity: os.environ.get("WANDB_ENTITY", "") # Wandb entity
tags: [] # Wandb tags
dir: '' # Wandb save folder
checkpoint:
s3_path: '' # s3 path for AWS model syncing
s3_frequency: 1 # How often to s3 sync
If you encounter out of memory issues, try a lower batch_size
parameter in the config file.
NB: if you would rather not use docker, you could create a conda environment via following the steps in the Dockerfile and mixing conda
and pip
at your own risks...
Datasets
Datasets are assumed to be downloaded in /data/datasets/<dataset-name>
(can be a symbolic link).
Dense Depth for Autonomous Driving (DDAD)
Together with PackNet, we introduce Dense Depth for Automated Driving (DDAD): a new dataset that leverages diverse logs from TRI's fleet of well-calibrated self-driving cars equipped with cameras and high-accuracy long-range LiDARs. Compared to existing benchmarks, DDAD enables much more accurate 360 degree depth evaluation at range, see the official DDAD repository for more info and instructions. You can also download DDAD directly via:
curl -s https://tri-ml-public.s3.amazonaws.com/github/DDAD/datasets/DDAD.tar | tar -xv -C /data/datasets/
KITTI
The KITTI (raw) dataset used in our experiments can be downloaded from the KITTI website. For convenience, we provide the standard splits used for training and evaluation: eigen_zhou, eigen_train, eigen_val and eigen_test, as well as pre-computed ground-truth depth maps: original and improved. The full KITTI_raw dataset, as used in our experiments, can be directly downloaded here or with the following command:
# KITTI_raw
curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/KITTI_raw.tar | tar -xv -C /data/datasets/
Tiny DDAD/KITTI
For simple tests, we also provide a "tiny" version of DDAD and KITTI:
# DDAD_tiny
curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/DDAD_tiny.tar | tar -xv -C /data/datasets/
# KITTI_tiny
curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/KITTI_tiny.tar | tar -xv -C /data/datasets/
OmniCam
The raw data for the catadioptric OmniCam dataset can be downloaded from the Omnicam website. For convenience, we provide the dataset for testing the Neural Ray Surfaces (NRS) model. The dataset can be downloaded with the following command:
# omnicam
curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/OmniCam.tar | tar -xv -C /data/datasets/
The ray surface template we used for training on OmniCam can be found here.
Training
PackNet can be trained from scratch in a fully self-supervised way (from video only, cf. CVPR'20), in a semi-supervised way (with sparse lidar using our reprojected 3D loss, cf. CoRL'19), and it can also use a fixed pre-trained semantic segmentation network to guide the representation learning further (cf. ICLR'20).
Any training, including fine-tuning, can be done by passing either a .yaml
config file or a .ckpt
model checkpoint to scripts/train.py:
python3 scripts/train.py <config.yaml or checkpoint.ckpt>
If you pass a config file, training will start from scratch using the parameters in that config file. Example config files are in configs.
If you pass instead a .ckpt
file, training will continue from the current checkpoint state.
Note that it is also possible to define checkpoints within the config file itself. These can be done either individually for the depth and/or pose networks or by defining a checkpoint to the model itself, which includes all sub-networks (setting the model checkpoint will overwrite depth and pose checkpoints). In this case, a new training session will start and the networks will be initialized with the model state in the .ckpt
file(s). Below we provide the locations in the config file where these checkpoints are defined:
checkpoint:
# Folder where .ckpt files will be saved during training
filepath: /path/to/where/checkpoints/will/be/saved
model:
# Checkpoint for the model (depth + pose)
checkpoint_path: /path/to/model.ckpt
depth_net:
# Checkpoint for the depth network
checkpoint_path: /path/to/depth_net.ckpt
pose_net:
# Checkpoint for the pose network
checkpoint_path: /path/to/pose_net.ckpt
Every aspect of the training configuration can be controlled by modifying the yaml config file. This include the model configuration (self-supervised, semi-supervised, loss parameters, etc), depth and pose networks configuration (choice of architecture and different parameters), optimizers and schedulers (learning rates, weight decay, etc), datasets (name, splits, depth types, etc) and much more. For a comprehensive list please refer to configs/default_config.py.
Evaluation
Similar to the training case, to evaluate a trained model (cf. above or our pre-trained models) you need to provide a .ckpt
checkpoint, followed optionally by a .yaml
config file that overrides the configuration stored in the checkpoint.
python3 scripts/eval.py --checkpoint <checkpoint.ckpt> [--config <config.yaml>]
You can also directly run inference on a single image or folder:
python3 scripts/infer.py --checkpoint <checkpoint.ckpt> --input <image or folder> --output <image or folder> [--image_shape <input shape (h,w)>]
Models
DDAD
Model | Abs.Rel. | Sqr.Rel | RMSE | RMSElog | d < 1.25 |
---|---|---|---|---|---|
ResNet18, Self-Supervised, 384x640, ImageNet → DDAD (D) | 0.213 | 4.975 | 18.051 | 0.340 | 0.761 |
PackNet, Self-Supervised, 384x640, DDAD (D) | 0.162 | 3.917 | 13.452 | 0.269 | 0.823 |
ResNet18, Self-Supervised, 384x640, ImageNet → DDAD (D)* | 0.227 | 11.293 | 17.368 | 0.303 | 0.758 |
PackNet, Self-Supervised, 384x640, DDAD (D)* | 0.173 | 7.164 | 14.363 | 0.249 | 0.835 |
PackNetSAN, Supervised, 384x640, DDAD (D)* | 0.086/0.038 | 1.609/0.546 | 10.700/5.951 | 0.185/0.115 | 0.909/0.976 |
*: Note that this repository's results differ slightly from the ones reported in our CVPR'20 paper (first two rows), although conclusions are the same. Since CVPR'20, we have officially released an updated DDAD dataset to account for privacy constraints and improve scene distribution. Please use the latest numbers when comparing to the official DDAD release.
KITTI
Model | Abs.Rel. | Sqr.Rel | RMSE | RMSElog | d < 1.25 |
---|---|---|---|---|---|
ResNet18, Self-Supervised, 192x640, ImageNet → KITTI (K) | 0.116 | 0.811 | 4.902 | 0.198 | 0.865 |
PackNet, Self-Supervised, 192x640, KITTI (K) | 0.111 | 0.800 | 4.576 | 0.189 | 0.880 |
PackNet, Self-Supervised Scale-Aware, 192x640, CS → K | 0.108 | 0.758 | 4.506 | 0.185 | 0.887 |
PackNet, Self-Supervised Scale-Aware, 384x1280, CS → K | 0.106 | 0.838 | 4.545 | 0.186 | 0.895 |
PackNet, Semi-Supervised (densified GT), 192x640, CS → K | 0.072 | 0.335 | 3.220 | 0.115 | 0.934 |
PackNetSAN, Supervised (densified GT), 352x1216, K | 0.052/0.016 | 0.175/0.028 | 2.230/0.902 | 0.083/0.032 | 0.970/0.997 |
All experiments followed the Eigen et al. protocol for training and evaluation, with Zhou et al's preprocessing to remove static training frames. The PackNet model pre-trained on Cityscapes used for fine-tuning on KITTI can be found here.
OmniCam
Our NRS model for OmniCam can be found here.
Precomputed Depth Maps
For convenience, we also provide pre-computed depth maps for supervised training and evaluation:
-
PackNet, Self-Supervised Scale-Aware, 192x640, CS → K | eigen_train_files | eigen_zhou_files | eigen_val_files | eigen_test_files |
-
PackNet, Semi-Supervised (densified GT), 192x640, CS → K | eigen_train_files | eigen_zhou_files | eigen_val_files | eigen_test_files |
License
The source code is released under the MIT license.
References
PackNet relies on symmetric packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. It also uses depth superresolution, which we introduce in SuperDepth (ICRA 2019). Our network can also output metrically scaled depth thanks to our weak velocity supervision (CVPR 2020).
We also experimented with sparse supervision from as few as 4-beam LiDAR sensors, using a novel reprojection loss that minimizes distance errors in the image plane (CoRL 2019). By enforcing a sparsity-inducing data augmentation policy for ego-motion learning, we were also able to effectively regularize the pose network and enable stronger generalization performance (CoRL 2019). In a follow-up work, we propose the injection of semantic information directly into the decoder layers of the depth networks, using pixel-adaptive convolutions to create semantic-aware features and further improve performance (ICLR 2020).
Depending on the application, please use the following citations when referencing our work:
3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020 oral)
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos and Adrien Gaidon, [paper], [video]
@inproceedings{packnet,
author = {Vitor Guizilini and Rares Ambrus and Sudeep Pillai and Allan Raventos and Adrien Gaidon},
title = {3D Packing for Self-Supervised Monocular Depth Estimation},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
primaryClass = {cs.CV}
year = {2020},
}
Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion (CVPR 2021)
Vitor Guizilini, Rares Ambrus, Wolfram Burgard and Adrien Gaidon, [paper]
@inproceedings{packnet-san,
author = {Vitor Guizilini and Rares Ambrus and Wolfram Burgard and Adrien Gaidon},
title = {Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
primaryClass = {cs.CV}
year = {2021},
}
Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion (3DV 2020 oral)
Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, Adrien Gaidon, [paper], [video]
@inproceedings{vasiljevic2020neural,
title={Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion},
author={Vasiljevic, Igor and Guizilini, Vitor and Ambrus, Rares and Pillai, Sudeep and Burgard, Wolfram and Shakhnarovich, Greg and Gaidon, Adrien},
booktitle = {International Conference on 3D Vision},
primaryClass = {cs.CV},
year={2020}
}
Semantically-Guided Representation Learning for Self-Supervised Monocular Depth (ICLR 2020)
Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus and Adrien Gaidon, [paper]
@inproceedings{packnet-semguided,
author = {Vitor Guizilini and Rui Hou and Jie Li and Rares Ambrus and Adrien Gaidon},
title = {Semantically-Guided Representation Learning for Self-Supervised Monocular Depth},
booktitle = {International Conference on Learning Representations (ICLR)}
month = {April},
year = {2020},
}
Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances (CoRL 2019 spotlight)
Vitor Guizilini, Jie Li, Rares Ambrus, Sudeep Pillai and Adrien Gaidon, [paper],[video]
@inproceedings{packnet-semisup,
author = {Vitor Guizilini and Jie Li and Rares Ambrus and Sudeep Pillai and Adrien Gaidon},
title = {Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances},
booktitle = {Conference on Robot Learning (CoRL)}
month = {October},
year = {2019},
}
Two Stream Networks for Self-Supervised Ego-Motion Estimation (CoRL 2019 spotlight)
Rares Ambrus, Vitor Guizilini, Jie Li, Sudeep Pillai and Adrien Gaidon, [paper]
@inproceedings{packnet-twostream,
author = {Rares Ambrus and Vitor Guizilini and Jie Li and Sudeep Pillai and Adrien Gaidon},
title = {{Two Stream Networks for Self-Supervised Ego-Motion Estimation}},
booktitle = {Conference on Robot Learning (CoRL)}
month = {October},
year = {2019},
}
SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation (ICRA 2019)
Sudeep Pillai, Rares Ambrus and Adrien Gaidon, [paper], [video]
@inproceedings{superdepth,
author = {Sudeep Pillai and Rares Ambrus and Adrien Gaidon},
title = {SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}
month = {May},
year = {2019},
}
Top Related Projects
[ICCV 2019] Monocular depth estimation from a single image
Pytorch version of SfmLearner from Tinghui Zhou et al.
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot