DROID-SLAM

No description available

2,144

342

2,144

View on GitHub

Top Related Projects

ORB_SLAM2

9,851

Real-Time SLAM for Monocular, Stereo and RGB-D Cameras, with Loop Detection and Relocalization Capabilities

ORB_SLAM3

7,407

ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM

VINS-Mono

5,380

A Robust and Versatile Monocular Visual-Inertial State Estimator

Quick Overview

DROID-SLAM is an open-source visual SLAM (Simultaneous Localization and Mapping) system developed by Princeton Vision & Learning Lab. It uses deep learning techniques to perform real-time dense reconstruction and tracking, offering a novel approach to visual odometry and 3D mapping.

Pros

High accuracy and robustness in challenging environments
Real-time performance on consumer-grade hardware
Ability to handle both indoor and outdoor scenes
Integrates deep learning with traditional SLAM techniques

Cons

Requires a GPU for optimal performance
May have higher computational requirements compared to traditional SLAM methods
Limited documentation for advanced customization
Dependency on specific deep learning frameworks

Code Examples

Initializing DROID-SLAM:

from droid_slam import DROID

# Initialize DROID-SLAM
slam = DROID(device="cuda:0")

Processing a video sequence:

import cv2

# Open video file
video = cv2.VideoCapture("path/to/video.mp4")

while True:
    ret, frame = video.read()
    if not ret:
        break
    
    # Process frame with DROID-SLAM
    slam.track(frame)

# Finalize reconstruction
poses, points, colors = slam.get_map()

Visualizing the reconstructed map:

import open3d as o3d

# Create point cloud
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
pcd.colors = o3d.utility.Vector3dVector(colors / 255.0)

# Visualize
o3d.visualization.draw_geometries([pcd])

Getting Started

Install dependencies:

pip install torch torchvision opencv-python open3d

Clone the repository:

git clone https://github.com/princeton-vl/DROID-SLAM.git
cd DROID-SLAM

Install DROID-SLAM:

pip install -e .

Download pre-trained weights:

wget https://github.com/princeton-vl/DROID-SLAM/releases/download/v1.0/droid.pth

Run DROID-SLAM on a video:

from droid_slam import DROID
import cv2

slam = DROID(device="cuda:0")
video = cv2.VideoCapture("path/to/video.mp4")

while True:
    ret, frame = video.read()
    if not ret:
        break
    slam.track(frame)

poses, points, colors = slam.get_map()

Competitor Comparisons

ORB_SLAM2

9,851

Real-Time SLAM for Monocular, Stereo and RGB-D Cameras, with Loop Detection and Relocalization Capabilities

Pros of ORB_SLAM2

Well-established and widely used in the robotics community
Efficient feature-based SLAM algorithm with loop closure
Supports monocular, stereo, and RGB-D cameras

Cons of ORB_SLAM2

May struggle in low-texture environments
Requires careful parameter tuning for optimal performance
Less robust to dynamic objects in the scene

Code Comparison

ORB_SLAM2 (C++):

// Feature extraction and matching
void Frame::ExtractORB(int flag, const cv::Mat &im)
{
    (*mpORBextractor)(im,cv::Mat(),mvKeys,mDescriptors);
}

DROID-SLAM (Python):

# Feature extraction using deep learning
def extract_features(image):
    return self.feature_net(image)

DROID-SLAM uses deep learning for feature extraction, potentially offering more robust performance in challenging environments. ORB_SLAM2 relies on traditional computer vision techniques, which can be faster but may be less adaptable to diverse scenes.

ORB_SLAM3

7,407

ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM

Pros of ORB_SLAM3

Well-established and widely used in the SLAM community
Supports various sensor inputs (monocular, stereo, RGB-D)
Real-time performance on CPU

Cons of ORB_SLAM3

Relies on handcrafted features, which may not be optimal for all environments
Can struggle in low-texture or dynamic scenes

Code Comparison

ORB_SLAM3:

// Feature extraction and matching
void Frame::ExtractORB(int flag, const cv::Mat &im)
{
    (*mpORBextractorLeft)(im, cv::Mat(), mvKeys, mDescriptors);
}

DROID-SLAM:

# Feature extraction using deep learning
def extract_features(self, images):
    return self.fnet(images)

ORB_SLAM3 uses traditional computer vision techniques for feature extraction, while DROID-SLAM leverages deep learning for this task. This fundamental difference affects their performance in various scenarios and computational requirements.

VINS-Mono

5,380

A Robust and Versatile Monocular Visual-Inertial State Estimator

Pros of VINS-Mono

Utilizes both visual and inertial data for more robust state estimation
Designed specifically for aerial robotics applications
Includes loop closure for improved accuracy in long-term operation

Cons of VINS-Mono

May have higher computational requirements due to fusion of multiple sensor inputs
Potentially more complex setup and calibration process
Less suitable for pure visual odometry scenarios

Code Comparison

VINS-Mono (C++):

void FeatureManager::setDepth(const VectorXd &x)
{
    int feature_index = -1;
    for (auto &it_per_id : feature)
    {
        it_per_id.used_num = it_per_id.feature_per_frame.size();
        if (!(it_per_id.used_num >= 2 && it_per_id.start_frame < WINDOW_SIZE - 2))
            continue;

        it_per_id.estimated_depth = 1.0 / x(++feature_index);
    }
}

DROID-SLAM (Python):

def update(self, tstamp, image, depth=None, intrinsics=None):
    if intrinsics is not None:
        self.video.intrinsics[self.fidx] = intrinsics

    self.video.poses[self.fidx] = self.video.poses[self.fidx-1].clone()
    self.video.images[self.fidx] = image
    self.video.timestamps[self.fidx] = tstamp

    self.fidx += 1
    return self.fidx - 1

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DROID-SLAM

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras
Zachary Teed and Jia Deng

@article{teed2021droid,
  title={{DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras}},
  author={Teed, Zachary and Deng, Jia},
  journal={Advances in neural information processing systems},
  year={2021}
}

Requirements

To run the code you will need ...

Inference: Running the demos will require a GPU with at least 11G of memory.
Training: Training requires a GPU with at least 24G of memory. We train on 4 x RTX-3090 GPUs.

Getting Started

Clone the repo using the --recursive flag

git clone --recursive https://github.com/princeton-vl/DROID-SLAM.git

If you forgot --recursive

git submodule update --init --recursive .

Installing

Requires CUDA to be installed on your machine. If you run into issues, make sure the PyTorch and CUDA major versions match with the following check (minor version mismatch should be fine).

nvidia-smi
python -c "import torch; print(torch.version.cuda)"

python3 -m venv .venv
source .venv/bin/activate

# install requirements (tested up to torch 2.7)
pip install -r requirements.txt

# optional (for visualization)
pip install moderngl moderngl-window

# install third-party modules (this will take a while)
pip install thirdparty/lietorch
pip install thirdparty/pytorch_scatter

# install droid-backends
pip install -e .

Demos

Download the model from google drive: droid.pth or with
```
./tools/download_model.sh
```
Download some sample videos using the provided script.
```
./tools/download_sample_data.sh
```

Run the demo on any of the samples (all demos can be run on a GPU with 11G of memory). To save the reconstruction with full resolution depth maps use the --reconstruction_path flag. If you ran with --reconstruction_path my_reconstruction.pth, you can view the reconstruction in high resolution by running

python view_reconstruction.py my_reconstruction.pth

Asynchronous and Multi-GPU Inference: You can run the demos in asynchronous mode by running with --asynchronous. In this setting, the frontend and backend will run in seperate Python processes. You can additionally enable multi-GPU inference by setting the devices of the frontend and backend processes with the following arguments.

Visualization currently doesn't work multi-gpu setting. You will need to run with --disable_vis.

python demo.py --imagedir=data/sfm_bench/rgb --calib=calib/eth.txt

python demo.py --imagedir=data/mav0/cam0/data --calib=calib/euroc.txt --t0=150

python demo.py --imagedir=data/rgbd_dataset_freiburg3_cabinet/rgb --calib=calib/tum3.txt

Running on your own data: All you need is a calibration file. Calibration files are in the form

fx fy cx cy [k1 k2 p1 p2 [ k3 [ k4 k5 k6 ]]]

with parameters in brackets optional.

Evaluation

We provide evaluation scripts for TartanAir, EuRoC, TUM, and ETH3D-SLAM. EuRoC and TUM can be run on a 1080Ti. The TartanAir and ETH3D-SLAM datasets will require 24G of memory.

Asynchronous and Multi-GPU Inference: You can run evaluation in asynchronous mode by running with --asynchronous. In this setting, the frontend and backend will run in seperate Python processes. You can additionally enable multi-GPU inference by setting the devices of the frontend and backend processes with the following arguments. For example:

python evaluation_scripts/test_tartanair.py \
  --datapath data/tartanair_test/mono \
  --gt_path data/tartanair_test/mono_gt \
  --frontend_device cuda:0 \
  --backend_device cuda:1 \
  --asynchronous \
  --disable_vis

Note: Running with --asynchronous will typically produce better results, but this mode is not deterministic.

TartanAir (Mono + Stereo)

Download the TartanAir test set with this command.

./tools/download_tartanair_test.sh

Or from these links: Images, Groundtruth

Monocular evaluation:

python evaluation_scripts/test_tartanair.py \
  --datapath datasets/tartanair_test/mono \
  --gt_path datasets/tartanair_test/mono_gt \
  --disable_vis

Stereo evaluation:

python evaluation_scripts/test_tartanair.py \
  --datapath datasets/tartanair_test/stereo \
  --gt_path datasets/tartanair_test/stereo_gt \
  --stereo --disable_vis

Evaluating on the validation split:

Download the TartanAir dataset using the script thirdparty/tartanair_tools/download_training.py and put them in datasets/TartanAir

# monocular eval
./tools/validate_tartanair.sh --plot_curve

# stereo eval
./tools/validate_tartanair.sh --plot_curve  --stereo

EuRoC (Mono + Stereo)

Download the EuRoC sequences (ASL format):

./tools/download_euroc.sh

Then run evaluation:

# monocular eval (single gpu)
./tools/evaluate_euroc.sh

# monocular eval (multi gpu)
./tools/evaluate_euroc.sh --asynchronous --frontend_device cuda:0 --backend_device cuda:1

# stereo eval (single gpu)
./tools/evaluate_euroc.sh --stereo

# stereo eval (multi gpu)
./tools/evaluate_euroc.sh --stereo --asynchronous --frontend_device cuda:0 --backend_device cuda:1

TUM-RGBD (Mono)

Download the TUM-RGBD sequences:

./tools/download_tum.sh

Then run evaluation:

# monocular eval (single gpu)
./tools/evaluate_tum.sh

# monocular eval (multi gpu)
./tools/evaluate_tum.sh --asynchronous --frontend_device cuda:0 --backend_device cuda:1

ETH3D (RGB-D)

Download the ETH3D dataset:

./tools/download_eth3d.sh

# RGB-D eval (single gpu)
./tools/evaluate_eth3d.sh > eth3d_results.txt
python evaluation_scripts/parse_results.py eth3d_results.txt

# RGB-D eval (multi gpu)
./tools/evaluate_eth3d.sh --asynchronous --frontend_device cuda:0 --backend_device cuda:1 > eth3d_results_async.txt
python evaluation_scripts/parse_results.py eth3d_results_async.txt

Training

First download the TartanAir dataset. The download script can be found in thirdparty/tartanair_tools/download_training.py. You will only need the rgb and depth data.

python download_training.py --rgb --depth

You can then run the training script. We use 4x3090 RTX GPUs for training which takes approximatly 1 week. If you use a different number of GPUs, adjust the learning rate accordingly.

Note: On the first training run, covisibility is computed between all pairs of frames. This can take several hours, but the results are cached so that future training runs will start immediately.

python train.py --datapath=<path to tartanair> --gpus=4 --lr=0.00025

Acknowledgements

Data from TartanAir was used to train our model. We additionally use evaluation tools from evo and tartanair_tools.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot