Top Related Projects
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
TRI-ML Monocular Depth Estimation Repository
[ICCV 2019] Monocular depth estimation from a single image
Quick Overview
ClementPinard/SfmLearner-Pytorch is a PyTorch implementation of the SfMLearner model for unsupervised depth and ego-motion estimation from monocular videos. It provides a framework for training and testing the model on various datasets, including KITTI and Cityscapes.
Pros
- Implements a state-of-the-art unsupervised learning approach for depth and ego-motion estimation
- Supports multiple datasets and provides pre-trained models
- Well-documented codebase with clear instructions for setup and usage
- Includes evaluation scripts for benchmarking results
Cons
- Requires significant computational resources for training
- Limited to monocular video input, not suitable for stereo or multi-view setups
- May struggle with complex scenes or rapid camera movements
- Dependency on specific versions of PyTorch and other libraries
Code Examples
- Loading a pre-trained model:
from models import DispNetS, PoseExpNet
from torch.utils.data import DataLoader
# Load pre-trained DispNet model
disp_net = DispNetS().cuda()
weights = torch.load('pretrained_model_disp.pth.tar')
disp_net.load_state_dict(weights['state_dict'])
disp_net.eval()
# Load pre-trained PoseExpNet model
pose_exp_net = PoseExpNet(nb_ref_imgs=2).cuda()
weights = torch.load('pretrained_model_pose.pth.tar')
pose_exp_net.load_state_dict(weights['state_dict'])
pose_exp_net.eval()
- Predicting depth and pose:
import torch
from torchvision import transforms
# Prepare input data
img_transform = transforms.Compose([
transforms.Resize((128, 416)),
transforms.ToTensor(),
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
])
tgt_img = img_transform(tgt_img).unsqueeze(0).cuda()
ref_imgs = [img_transform(img).unsqueeze(0).cuda() for img in ref_imgs]
# Predict depth and pose
with torch.no_grad():
disp = disp_net(tgt_img)
depth = 1 / disp
pose = pose_exp_net(tgt_img, ref_imgs)
- Visualizing results:
import matplotlib.pyplot as plt
# Visualize depth map
plt.figure(figsize=(10, 5))
plt.imshow(depth[0, 0].cpu().numpy(), cmap='magma')
plt.colorbar()
plt.title('Predicted Depth Map')
plt.show()
# Visualize pose (translation only)
plt.figure(figsize=(10, 5))
plt.plot(pose[0, :3].cpu().numpy())
plt.title('Predicted Camera Translation')
plt.legend(['x', 'y', 'z'])
plt.show()
Getting Started
-
Clone the repository:
git clone https://github.com/ClementPinard/SfmLearner-Pytorch.git cd SfmLearner-Pytorch
-
Install dependencies:
pip install -r requirements.txt
-
Download a pre-trained model:
wget https://github.com/ClementPinard/SfmLearner-Pytorch/releases/download/v0.2/exp_pose_model_best.pth.tar wget https://github.com/ClementPinard/SfmLearner-Pytorch/releases/download/v0.2/exp_mask_model_best.pth.tar
-
Run inference on a sample video:
python test_vo.py --pretrained-posenet exp_pose_model_best.pth.tar --pretrained-dispnet exp_mask_model_best.pth.tar --dataset-dir
Competitor Comparisons
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
Pros of SfMLearner
- Original implementation by the authors of the SfMLearner paper
- Supports TensorFlow, which may be preferred by some researchers
- Includes pre-trained models for immediate use
Cons of SfMLearner
- Less actively maintained (last update in 2018)
- Limited to TensorFlow 1.x, which is now deprecated
- Lacks some modern features and optimizations
Code Comparison
SfMLearner (TensorFlow):
def get_multi_scale_intrinsics(raw_cam_mat, num_scales):
proj_cam2pix = []
for s in range(num_scales):
proj_cam2pix.append(raw_cam_mat * (2**s))
proj_cam2pix = tf.stack(proj_cam2pix)
return proj_cam2pix
SfmLearner-Pytorch:
def get_multi_scale_intrinsics(raw_cam_mat, num_scales):
proj_cam2pix = []
for s in range(num_scales):
proj_cam2pix.append(raw_cam_mat * (2**s))
proj_cam2pix = torch.stack(proj_cam2pix)
return proj_cam2pix
The code structure is similar, with the main difference being the use of TensorFlow vs. PyTorch functions.
TRI-ML Monocular Depth Estimation Repository
Pros of packnet-sfm
- More advanced architecture with PackNet and 3D convolutions for improved depth estimation
- Supports self-supervised learning on stereo and monocular video sequences
- Includes pre-trained models and benchmarking tools for easier evaluation
Cons of packnet-sfm
- More complex implementation, potentially harder to understand and modify
- Requires more computational resources due to 3D convolutions
- Less flexibility in terms of input data formats compared to SfmLearner-Pytorch
Code Comparison
SfmLearner-Pytorch:
class DepthNet(nn.Module):
def __init__(self, num_layers=50, pretrained=True):
super(DepthNet, self).__init__()
self.encoder = resnet_encoder(num_layers, pretrained)
self.decoder = depth_decoder(self.encoder.num_ch_enc)
packnet-sfm:
class PackNet01(nn.Module):
def __init__(self, version='1A'):
super().__init__()
self.encoder = PackNetEncoder01(version=version)
self.decoder = PackNetDecoder01(version=version)
self.depth = PackNetDepth(version=version)
The code comparison shows that packnet-sfm uses a more specialized architecture with PackNet modules, while SfmLearner-Pytorch relies on a more standard ResNet-based encoder-decoder structure.
[ICCV 2019] Monocular depth estimation from a single image
Pros of monodepth2
- More advanced and recent implementation, incorporating improvements in self-supervised depth estimation
- Supports multi-GPU training and mixed precision, leading to faster training times
- Includes pre-trained models for immediate use and evaluation
Cons of monodepth2
- More complex architecture, potentially harder to understand and modify for beginners
- Requires more computational resources due to its advanced features
- Limited to monocular depth estimation, while SfmLearner-Pytorch also handles camera pose estimation
Code Comparison
SfmLearner-Pytorch:
class SfMLearner(nn.Module):
def __init__(self, img_size, pretrained=False):
super(SfMLearner, self).__init__()
self.img_size = img_size
self.encoder = ResnetModel(18, pretrained)
self.decoder = DepthDecoder(self.encoder.resnet_encoder.num_ch_enc)
monodepth2:
class DepthDecoder(nn.Module):
def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
super(DepthDecoder, self).__init__()
self.num_output_channels = num_output_channels
self.use_skips = use_skips
self.upsample_mode = 'nearest'
self.scales = scales
Both repositories implement depth estimation models, but monodepth2 offers a more sophisticated approach with additional features and optimizations. SfmLearner-Pytorch may be more suitable for those interested in understanding the basics of structure from motion and depth estimation.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
SfMLearner Pytorch version
This codebase implements the system described in the paper:
Unsupervised Learning of Depth and Ego-Motion from Video
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe
In CVPR 2017 (Oral).
See the project webpage for more details.
Original Author : Tinghui Zhou (tinghuiz@berkeley.edu) Pytorch implementation : Clément Pinard (clement.pinard@ensta-paristech.fr)
Preamble
This codebase was developed and tested with Pytorch 1.0.1, CUDA 10 and Ubuntu 16.04. Original code was developped in tensorflow, you can access it here
Prerequisite
pip3 install -r requirements.txt
or install manually the following packages :
pytorch >= 1.0.1
pebble
matplotlib
imageio
scipy
scikit-image
argparse
tensorboardX
blessings
progressbar2
path.py
Note
Because it uses latests pytorch features, it is not compatible with anterior versions of pytorch.
If you don't have an up to date pytorch, the tags can help you checkout the right commits corresponding to your pytorch version.
What has been done
- Training has been tested on KITTI and CityScapes.
- Dataset preparation has been largely improved, and now stores image sequences in folders, making sure that movement is each time big enough between each frame
- That way, training is now significantly faster, running at ~0.14sec per step vs ~0.2s per steps initially (on a single GTX980Ti)
- In addition you don't need to prepare data for a particular sequence length anymore as stacking is made on the fly.
- You can still choose the former stacked frames dataset format.
- Convergence is now almost as good as original paper with same hyper parameters
- You can know compare with ground truth for your validation set. It is still possible to validate without, but you now can see that minimizing photometric error is not equivalent to optimizing depth map.
Differences with official Implementation
- Smooth Loss is different from official repo. Instead of applying it to disparity, we apply it to depth. Original disparity smooth loss did not work well (don't know why !) and it did not even converge at all with weight values used (0.5).
- loss is divided by
2.3
when downscaling instead of2
. This is the results of empiric experiments, so the optimal value is clearly not carefully determined. - As a consequence, with a smooth loss of
2.0Ì
, depth test is better, but Pose test is worse. To revert smooth loss back to original, you can change it here
Preparing training data
Preparation is roughly the same command as in the original code.
For KITTI, first download the dataset using this script provided on the official website, and then run the following command. The --with-depth
option will save resized copies of groundtruth to help you setting hyper parameters. The --with-pose
will dump the sequence pose in the same format as Odometry dataset (see pose evaluation)
python3 data/prepare_train_data.py /path/to/raw/kitti/dataset/ --dataset-format 'kitti_raw' --dump-root /path/to/resulting/formatted/data/ --width 416 --height 128 --num-threads 4 [--static-frames /path/to/static_frames.txt] [--with-depth] [--with-pose]
For Cityscapes, download the following packages: 1) leftImg8bit_sequence_trainvaltest.zip
, 2) camera_trainvaltest.zip
. You will probably need to contact the administrators to be able to get it. Then run the following command
python3 data/prepare_train_data.py /path/to/cityscapes/dataset/ --dataset-format 'cityscapes' --dump-root /path/to/resulting/formatted/data/ --width 416 --height 171 --num-threads 4
Notice that for Cityscapes the img_height
is set to 171 because we crop out the bottom part of the image that contains the car logo, and the resulting image will have height 128.
Training
Once the data are formatted following the above instructions, you should be able to train the model by running the following command
python3 train.py /path/to/the/formatted/data/ -b4 -m0.2 -s0.1 --epoch-size 3000 --sequence-length 3 --log-output [--with-gt]
You can then start a tensorboard
session in this folder by
tensorboard --logdir=checkpoints/
and visualize the training progress by opening https://localhost:6006 on your browser. If everything is set up properly, you should start seeing reasonable depth prediction after ~30K iterations when training on KITTI.
Evaluation
Disparity map generation can be done with run_inference.py
python3 run_inference.py --pretrained /path/to/dispnet --dataset-dir /path/pictures/dir --output-dir /path/to/output/dir
Will run inference on all pictures inside dataset-dir
and save a jpg of disparity (or depth) to output-dir
for each one see script help (-h
) for more options.
Disparity evaluation is avalaible
python3 test_disp.py --pretrained-dispnet /path/to/dispnet --pretrained-posenet /path/to/posenet --dataset-dir /path/to/KITTI_raw --dataset-list /path/to/test_files_list
Test file list is available in kitti eval folder. To get fair comparison with Original paper evaluation code, don't specify a posenet. However, if you do, it will be used to solve the scale factor ambiguity, the only ground truth used to get it will be vehicle speed which is far more acceptable for real conditions quality measurement, but you will obviously get worse results.
Pose evaluation is also available on Odometry dataset. Be sure to download both color images and pose !
python3 test_pose.py /path/to/posenet --dataset-dir /path/to/KITIT_odometry --sequences [09]
ATE (Absolute Trajectory Error) is computed as long as RE for rotation (Rotation Error). RE between R1
and R2
is defined as the angle of R1*R2^-1
when converted to axis/angle. It corresponds to RE = arccos( (trace(R1 @ R2^-1) - 1) / 2)
.
While ATE is often said to be enough to trajectory estimation, RE seems important here as sequences are only seq_length
frames long.
Pretrained Nets
Arguments used :
python3 train.py /path/to/the/formatted/data/ -b4 -m0 -s2.0 --epoch-size 1000 --sequence-length 5 --log-output --with-gt
Depth Results
Abs Rel | Sq Rel | RMSE | RMSE(log) | Acc.1 | Acc.2 | Acc.3 |
---|---|---|---|---|---|---|
0.181 | 1.341 | 6.236 | 0.262 | 0.733 | 0.901 | 0.964 |
Pose Results
5-frames snippets used
Seq. 09 | Seq. 10 | |
---|---|---|
ATE | 0.0179 (std. 0.0110) | 0.0141 (std. 0.0115) |
RE | 0.0018 (std. 0.0009) | 0.0018 (std. 0.0011) |
Discussion
Here I try to link the issues that I think raised interesting questions about scale factor, pose inference, and training hyperparameters
- Issue 48 : Why is target frame at the center of the sequence ?
- Issue 39 : Getting pose vector without the scale factor uncertainty
- Issue 46 : Is Interpolated groundtruth better than sparse groundtruth ?
- Issue 45 : How come the inverse warp is absolute and pose and depth are only relative ?
- Issue 32 : Discussion about validation set, and optimal batch size
- Issue 25 : Why filter out static frames ?
- Issue 24 : Filtering pixels out of the photometric loss
- Issue 60 : Inverse warp is only one way !
Other Implementations
TensorFlow by tinghuiz (original code, and paper author)
Top Related Projects
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
TRI-ML Monocular Depth Estimation Repository
[ICCV 2019] Monocular depth estimation from a single image
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot