detr

End-to-End Object Detection with Transformers

14,567

2,586

14,567

255

View on GitHub

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

yolov5

54,362

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

models

77,618

Models and examples built with TensorFlow

Mask_RCNN

25,251

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

detectron2

32,239

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

mmdetection

31,487

OpenMMLab Detection Toolbox and Benchmark

Quick Overview

DETR (DEtection TRansformer) is an end-to-end object detection model developed by Facebook AI Research. It uses a transformer architecture to directly predict object bounding boxes and classes, eliminating the need for many hand-designed components like anchor generation and non-maximum suppression.

Pros

Simplifies the object detection pipeline by removing the need for many hand-designed components
Achieves competitive performance with state-of-the-art object detection systems
Easily extensible to more complex tasks like panoptic segmentation
Provides a more interpretable object detection process

Cons

May require more computational resources compared to traditional object detection methods
Performance on small objects can be suboptimal compared to some other methods
Training can be slower than some traditional object detection approaches
Requires a good understanding of transformer architectures for fine-tuning and customization

Code Examples

Loading a pre-trained DETR model:

import torch
from torchvision.models import detection

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
model.eval()

Performing inference on an image:

from PIL import Image
import torchvision.transforms as T

transform = T.Compose([
    T.Resize(800),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

img = Image.open('path/to/image.jpg')
img_tensor = transform(img).unsqueeze(0)

with torch.no_grad():
    outputs = model(img_tensor)

Visualizing the results:

from matplotlib import pyplot as plt
import numpy as np

probas = outputs['pred_logits'].softmax(-1)[0, :, :-1]
keep = probas.max(-1).values > 0.7

bboxes_scaled = rescale_bboxes(outputs['pred_boxes'][0, keep], img.size)
plot_results(img, probas[keep], bboxes_scaled)

Getting Started

To get started with DETR:

Install the required dependencies:

pip install torch torchvision

Clone the repository:

git clone https://github.com/facebookresearch/detr.git
cd detr

Run inference on an image:

import torch
from PIL import Image
import requests
from io import BytesIO

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
im = Image.open(requests.get(url, stream=True).raw)

# standard PyTorch mean-std input image normalization
transform = T.Compose([
    T.Resize(800),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# run inference
outputs = model(transform(im).unsqueeze(0))
print(outputs['pred_logits'], outputs['pred_boxes'])

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope: Covers a wide range of NLP tasks and models
Extensive documentation and community support
Regular updates and new model implementations

Cons of transformers

Larger codebase, potentially more complex to navigate
May have higher computational requirements for some tasks
Less specialized for computer vision tasks compared to DETR

Code comparison

DETR (object detection):

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
img = Image.open(image_path)
outputs = model(img)

transformers (text classification):

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

Summary

transformers offers a comprehensive suite of NLP tools and models, with strong community support and frequent updates. It's ideal for a wide range of text-based tasks. DETR, on the other hand, is more specialized for object detection in computer vision. While transformers has a broader scope, it may be more complex and resource-intensive for some applications compared to the more focused DETR.

yolov5

54,362

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

Pros of YOLOv5

Faster inference speed and real-time performance
Easier to deploy and use in production environments
More extensive documentation and community support

Cons of YOLOv5

Generally lower accuracy compared to DETR, especially for small objects
Less flexible architecture, primarily designed for object detection

Code Comparison

DETR (PyTorch):

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
model.eval()
outputs = model(images)

YOLOv5 (PyTorch):

model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
results = model(images)
results.print()

Both DETR and YOLOv5 are popular object detection models, but they have different strengths and use cases. DETR offers a novel approach using transformers, potentially achieving higher accuracy, especially for complex scenes. However, YOLOv5 excels in speed and ease of use, making it more suitable for real-time applications and deployment on edge devices. The choice between the two depends on the specific requirements of the project, balancing factors such as accuracy, speed, and deployment constraints.

models

77,618

Models and examples built with TensorFlow

Pros of models

Broader scope: Covers a wide range of machine learning models and applications
Extensive documentation and tutorials for various use cases
Active community and frequent updates from Google's TensorFlow team

Cons of models

Larger and more complex repository, potentially overwhelming for beginners
Less focused on a specific task compared to DETR's object detection specialization
May require more setup and configuration for specific use cases

Code Comparison

DETR (PyTorch):

class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)

models (TensorFlow):

class DetectionModel(tf.keras.Model):
    def __init__(self, num_classes):
        super(DetectionModel, self).__init__()
        self.backbone = tf.keras.applications.ResNet50(include_top=False)
        self.conv = tf.keras.layers.Conv2D(num_classes, 3, padding='same')
        self.reshape = tf.keras.layers.Reshape((-1, num_classes))

Mask_RCNN

25,251

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Pros of Mask_RCNN

Well-established and widely adopted in the computer vision community
Provides instance segmentation in addition to object detection
Extensive documentation and community support

Cons of Mask_RCNN

Generally slower inference time compared to DETR
More complex architecture, potentially harder to modify or extend
May require more computational resources for training and inference

Code Comparison

Mask_RCNN:

import mrcnn.model as modellib

model = modellib.MaskRCNN(mode="inference", config=config, model_dir=MODEL_DIR)
model.load_weights(WEIGHTS_PATH, by_name=True)
results = model.detect([image], verbose=1)

DETR:

from models import build_model

model, criterion = build_model(args)
checkpoint = torch.load(args.resume, map_location='cpu')
model.load_state_dict(checkpoint['model'])
outputs = model(images)

Both repositories provide powerful object detection capabilities, but they differ in their approaches and features. Mask_RCNN offers instance segmentation and has a more established presence in the community, while DETR introduces a novel end-to-end approach with potentially faster inference times and a simpler architecture.

detectron2

32,239

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

More comprehensive and feature-rich, offering a wider range of object detection and segmentation models
Better documentation and community support, making it easier for beginners to get started
Faster training and inference times for many models

Cons of Detectron2

Steeper learning curve due to its extensive features and options
Larger codebase, which can be more challenging to customize or modify
Potentially higher computational requirements for some models

Code Comparison

Detectron2:

from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

DETR:

from detr import DETRDemo
import torch

model = DETRDemo(num_classes=91)
model.load_state_dict(torch.load("path/to/weights.pth"))
outputs = model(image)

Both repositories provide powerful object detection capabilities, but Detectron2 offers a more comprehensive toolkit with a wider range of models and features. DETR, on the other hand, introduces a novel end-to-end approach using transformers, which can be simpler to understand and implement for specific use cases.

mmdetection

31,487

OpenMMLab Detection Toolbox and Benchmark

Pros of mmdetection

Extensive model zoo with a wide variety of pre-trained models
Modular design allowing easy customization and extension
Comprehensive documentation and tutorials

Cons of mmdetection

Steeper learning curve due to its complexity
Potentially slower inference compared to DETR's end-to-end approach

Code Comparison

mmdetection:

from mmdet.apis import init_detector, inference_detector

config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')

DETR:

import torch
from models import build_model

model, _, postprocessors = build_model(num_classes=91)
checkpoint = torch.load('detr_demo.pth', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()
outputs = model(images)

Both repositories offer powerful object detection capabilities, but mmdetection provides a more comprehensive toolkit with multiple algorithms, while DETR focuses on a novel end-to-end approach using transformers.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DEâ«¶TR: End-to-End Object Detection with Transformers

PyTorch training code and pretrained models for DETR (DEtection TRansformer). We replace the full complex hand-crafted object detection pipeline with a Transformer, and match Faster R-CNN with a ResNet-50, obtaining 42 AP on COCO using half the computation power (FLOPs) and the same number of parameters. Inference in 50 lines of PyTorch.

DETR

What it is. Unlike traditional computer vision techniques, DETR approaches object detection as a direct set prediction problem. It consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient.

About the code. We believe that object detection should not be more difficult than classification, and should not require complex libraries for training and inference. DETR is very simple to implement and experiment with, and we provide a standalone Colab Notebook showing how to do inference with DETR in only a few lines of PyTorch code. Training code follows this idea - it is not a library, but simply a main.py importing model and criterion definitions with standard training loops.

Additionnally, we provide a Detectron2 wrapper in the d2/ folder. See the readme there for more information.

For details see End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.

See our blog post to learn more about end to end object detection with transformers.

Model Zoo

We provide baseline DETR and DETR-DC5 models, and plan to include more in future. AP is computed on COCO 2017 val5k, and inference time is over the first 100 val5k COCO images, with torchscript transformer.

	name	backbone	schedule	inf_time	box AP	url	size
0	DETR	R50	500	0.036	42.0	model \| logs	159Mb
1	DETR-DC5	R50	500	0.083	43.3	model \| logs	159Mb
2	DETR	R101	500	0.050	43.5	model \| logs	232Mb
3	DETR-DC5	R101	500	0.097	44.9	model \| logs	232Mb

COCO val5k evaluation results can be found in this gist.

The models are also available via torch hub, to load DETR R50 with pretrained weights simply do:

model = torch.hub.load('facebookresearch/detr:main', 'detr_resnet50', pretrained=True)

COCO panoptic val5k models:

	name	backbone	box AP	segm AP	PQ	url	size
0	DETR	R50	38.8	31.1	43.4	download	165Mb
1	DETR-DC5	R50	40.2	31.9	44.6	download	165Mb
2	DETR	R101	40.1	33	45.1	download	237Mb

Checkout our panoptic colab to see how to use and visualize DETR's panoptic segmentation prediction.

Notebooks

We provide a few notebooks in colab to help you get a grasp on DETR:

DETR's hands on Colab Notebook: Shows how to load a model from hub, generate predictions, then visualize the attention of the model (similar to the figures of the paper)
Standalone Colab Notebook: In this notebook, we demonstrate how to implement a simplified version of DETR from the grounds up in 50 lines of Python, then visualize the predictions. It is a good starting point if you want to gain better understanding the architecture and poke around before diving in the codebase.
Panoptic Colab Notebook: Demonstrates how to use DETR for panoptic segmentation and plot the predictions.

Usage - Object detection

There are no extra compiled components in DETR and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/facebookresearch/detr.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

(optional) to work with panoptic install panopticapi:

pip install git+https://github.com/cocodataset/panopticapi.git

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

To train baseline DETR on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco

A single epoch takes 28 minutes, so 300 epoch training takes around 6 days on a single machine with 8 V100 cards. To ease reproduction of our results we provide results and training logs for 150 epoch schedule (3 days on a single machine), achieving 39.5/60.3 AP/AP50.

We train DETR with AdamW setting learning rate in the transformer to 1e-4 and 1e-5 in the backbone. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Evaluation

To evaluate DETR R50 on COCO val5k with a single GPU run:

python main.py --batch_size 2 --no_aux_loss --eval --resume https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth --coco_path /path/to/coco

We provide results for all DETR detection models in this gist. Note that numbers vary depending on batch size (number of images) per GPU. Non-DC5 models were trained with batch size 2, and DC5 with 1, so DC5 models show a significant drop in AP if evaluated with more than 1 image per GPU.

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Train baseline DETR-6-6 model on 4 nodes for 300 epochs:

python run_with_submitit.py --timeout 3000 --coco_path /path/to/coco

Usage - Segmentation

We show that it is relatively straightforward to extend DETR to predict segmentation masks. We mainly demonstrate strong panoptic segmentation results.

Data preparation

For panoptic segmentation, you need the panoptic annotations additionally to the coco dataset (see above for the coco dataset). You need to download and extract the annotations. We expect the directory structure to be the following:

path/to/coco_panoptic/
  annotations/  # annotation json files
  panoptic_train2017/    # train panoptic annotations
  panoptic_val2017/      # val panoptic annotations

Training

We recommend training segmentation in two stages: first train DETR to detect all the boxes, and then train the segmentation head. For panoptic segmentation, DETR must learn to detect boxes for both stuff and things classes. You can train it on a single node with 8 gpus for 300 epochs with:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco  --coco_panoptic_path /path/to/coco_panoptic --dataset_file coco_panoptic --output_dir /output/path/box_model

For instance segmentation, you can simply train a normal box model (or used a pre-trained one we provide).

Once you have a box model checkpoint, you need to freeze it, and train the segmentation head in isolation. For panoptic segmentation you can train on a single node with 8 gpus for 25 epochs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --masks --epochs 25 --lr_drop 15 --coco_path /path/to/coco  --coco_panoptic_path /path/to/coco_panoptic  --dataset_file coco_panoptic --frozen_weights /output/path/box_model/checkpoint.pth --output_dir /output/path/segm_model

For instance segmentation only, simply remove the dataset_file and coco_panoptic_path arguments from the above command line.

License

DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot