Convert Figma logo to code with AI

facebookresearch logodetr

End-to-End Object Detection with Transformers

13,458
2,437
13,458
255

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

51,450

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

77,006

Models and examples built with TensorFlow

24,600

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

OpenMMLab Detection Toolbox and Benchmark

Quick Overview

DETR (DEtection TRansformer) is an end-to-end object detection model developed by Facebook AI Research. It uses a transformer architecture to directly predict object bounding boxes and classes, eliminating the need for many hand-designed components like anchor generation and non-maximum suppression.

Pros

  • Simplifies the object detection pipeline by removing the need for many hand-designed components
  • Achieves competitive performance with state-of-the-art object detection systems
  • Easily extensible to more complex tasks like panoptic segmentation
  • Provides a more interpretable object detection process

Cons

  • May require more computational resources compared to traditional object detection methods
  • Performance on small objects can be suboptimal compared to some other methods
  • Training can be slower than some traditional object detection approaches
  • Requires a good understanding of transformer architectures for fine-tuning and customization

Code Examples

  1. Loading a pre-trained DETR model:
import torch
from torchvision.models import detection

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
model.eval()
  1. Performing inference on an image:
from PIL import Image
import torchvision.transforms as T

transform = T.Compose([
    T.Resize(800),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

img = Image.open('path/to/image.jpg')
img_tensor = transform(img).unsqueeze(0)

with torch.no_grad():
    outputs = model(img_tensor)
  1. Visualizing the results:
from matplotlib import pyplot as plt
import numpy as np

probas = outputs['pred_logits'].softmax(-1)[0, :, :-1]
keep = probas.max(-1).values > 0.7

bboxes_scaled = rescale_bboxes(outputs['pred_boxes'][0, keep], img.size)
plot_results(img, probas[keep], bboxes_scaled)

Getting Started

To get started with DETR:

  1. Install the required dependencies:
pip install torch torchvision
  1. Clone the repository:
git clone https://github.com/facebookresearch/detr.git
cd detr
  1. Run inference on an image:
import torch
from PIL import Image
import requests
from io import BytesIO

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
im = Image.open(requests.get(url, stream=True).raw)

# standard PyTorch mean-std input image normalization
transform = T.Compose([
    T.Resize(800),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# run inference
outputs = model(transform(im).unsqueeze(0))
print(outputs['pred_logits'], outputs['pred_boxes'])

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope: Covers a wide range of NLP tasks and models
  • Extensive documentation and community support
  • Regular updates and new model implementations

Cons of transformers

  • Larger codebase, potentially more complex to navigate
  • May have higher computational requirements for some tasks
  • Less specialized for computer vision tasks compared to DETR

Code comparison

DETR (object detection):

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
img = Image.open(image_path)
outputs = model(img)

transformers (text classification):

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

Summary

transformers offers a comprehensive suite of NLP tools and models, with strong community support and frequent updates. It's ideal for a wide range of text-based tasks. DETR, on the other hand, is more specialized for object detection in computer vision. While transformers has a broader scope, it may be more complex and resource-intensive for some applications compared to the more focused DETR.

51,450

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

Pros of YOLOv5

  • Faster inference speed and real-time performance
  • Easier to deploy and use in production environments
  • More extensive documentation and community support

Cons of YOLOv5

  • Generally lower accuracy compared to DETR, especially for small objects
  • Less flexible architecture, primarily designed for object detection

Code Comparison

DETR (PyTorch):

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
model.eval()
outputs = model(images)

YOLOv5 (PyTorch):

model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
results = model(images)
results.print()

Both DETR and YOLOv5 are popular object detection models, but they have different strengths and use cases. DETR offers a novel approach using transformers, potentially achieving higher accuracy, especially for complex scenes. However, YOLOv5 excels in speed and ease of use, making it more suitable for real-time applications and deployment on edge devices. The choice between the two depends on the specific requirements of the project, balancing factors such as accuracy, speed, and deployment constraints.

77,006

Models and examples built with TensorFlow

Pros of models

  • Broader scope: Covers a wide range of machine learning models and applications
  • Extensive documentation and tutorials for various use cases
  • Active community and frequent updates from Google's TensorFlow team

Cons of models

  • Larger and more complex repository, potentially overwhelming for beginners
  • Less focused on a specific task compared to DETR's object detection specialization
  • May require more setup and configuration for specific use cases

Code Comparison

DETR (PyTorch):

class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)

models (TensorFlow):

class DetectionModel(tf.keras.Model):
    def __init__(self, num_classes):
        super(DetectionModel, self).__init__()
        self.backbone = tf.keras.applications.ResNet50(include_top=False)
        self.conv = tf.keras.layers.Conv2D(num_classes, 3, padding='same')
        self.reshape = tf.keras.layers.Reshape((-1, num_classes))
24,600

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Pros of Mask_RCNN

  • Well-established and widely adopted in the computer vision community
  • Provides instance segmentation in addition to object detection
  • Extensive documentation and community support

Cons of Mask_RCNN

  • Generally slower inference time compared to DETR
  • More complex architecture, potentially harder to modify or extend
  • May require more computational resources for training and inference

Code Comparison

Mask_RCNN:

import mrcnn.model as modellib

model = modellib.MaskRCNN(mode="inference", config=config, model_dir=MODEL_DIR)
model.load_weights(WEIGHTS_PATH, by_name=True)
results = model.detect([image], verbose=1)

DETR:

from models import build_model

model, criterion = build_model(args)
checkpoint = torch.load(args.resume, map_location='cpu')
model.load_state_dict(checkpoint['model'])
outputs = model(images)

Both repositories provide powerful object detection capabilities, but they differ in their approaches and features. Mask_RCNN offers instance segmentation and has a more established presence in the community, while DETR introduces a novel end-to-end approach with potentially faster inference times and a simpler architecture.

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

  • More comprehensive and feature-rich, offering a wider range of object detection and segmentation models
  • Better documentation and community support, making it easier for beginners to get started
  • Faster training and inference times for many models

Cons of Detectron2

  • Steeper learning curve due to its extensive features and options
  • Larger codebase, which can be more challenging to customize or modify
  • Potentially higher computational requirements for some models

Code Comparison

Detectron2:

from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

DETR:

from detr import DETRDemo
import torch

model = DETRDemo(num_classes=91)
model.load_state_dict(torch.load("path/to/weights.pth"))
outputs = model(image)

Both repositories provide powerful object detection capabilities, but Detectron2 offers a more comprehensive toolkit with a wider range of models and features. DETR, on the other hand, introduces a novel end-to-end approach using transformers, which can be simpler to understand and implement for specific use cases.

OpenMMLab Detection Toolbox and Benchmark

Pros of mmdetection

  • Extensive model zoo with a wide variety of pre-trained models
  • Modular design allowing easy customization and extension
  • Comprehensive documentation and tutorials

Cons of mmdetection

  • Steeper learning curve due to its complexity
  • Potentially slower inference compared to DETR's end-to-end approach

Code Comparison

mmdetection:

from mmdet.apis import init_detector, inference_detector

config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')

DETR:

import torch
from models import build_model

model, _, postprocessors = build_model(num_classes=91)
checkpoint = torch.load('detr_demo.pth', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()
outputs = model(images)

Both repositories offer powerful object detection capabilities, but mmdetection provides a more comprehensive toolkit with multiple algorithms, while DETR focuses on a novel end-to-end approach using transformers.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DE⫶TR: End-to-End Object Detection with Transformers

Support Ukraine

PyTorch training code and pretrained models for DETR (DEtection TRansformer). We replace the full complex hand-crafted object detection pipeline with a Transformer, and match Faster R-CNN with a ResNet-50, obtaining 42 AP on COCO using half the computation power (FLOPs) and the same number of parameters. Inference in 50 lines of PyTorch.

DETR

What it is. Unlike traditional computer vision techniques, DETR approaches object detection as a direct set prediction problem. It consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient.

About the code. We believe that object detection should not be more difficult than classification, and should not require complex libraries for training and inference. DETR is very simple to implement and experiment with, and we provide a standalone Colab Notebook showing how to do inference with DETR in only a few lines of PyTorch code. Training code follows this idea - it is not a library, but simply a main.py importing model and criterion definitions with standard training loops.

Additionnally, we provide a Detectron2 wrapper in the d2/ folder. See the readme there for more information.

For details see End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.

See our blog post to learn more about end to end object detection with transformers.

Model Zoo

We provide baseline DETR and DETR-DC5 models, and plan to include more in future. AP is computed on COCO 2017 val5k, and inference time is over the first 100 val5k COCO images, with torchscript transformer.

name backbone schedule inf_time box AP url size
0 DETR R50 500 0.036 42.0 model | logs 159Mb
1 DETR-DC5 R50 500 0.083 43.3 model | logs 159Mb
2 DETR R101 500 0.050 43.5 model | logs 232Mb
3 DETR-DC5 R101 500 0.097 44.9 model | logs 232Mb

COCO val5k evaluation results can be found in this gist.

The models are also available via torch hub, to load DETR R50 with pretrained weights simply do:

model = torch.hub.load('facebookresearch/detr:main', 'detr_resnet50', pretrained=True)

COCO panoptic val5k models:

name backbone box AP segm AP PQ url size
0 DETR R50 38.8 31.1 43.4 download 165Mb
1 DETR-DC5 R50 40.2 31.9 44.6 download 165Mb
2 DETR R101 40.1 33 45.1 download 237Mb

Checkout our panoptic colab to see how to use and visualize DETR's panoptic segmentation prediction.

Notebooks

We provide a few notebooks in colab to help you get a grasp on DETR:

  • DETR's hands on Colab Notebook: Shows how to load a model from hub, generate predictions, then visualize the attention of the model (similar to the figures of the paper)
  • Standalone Colab Notebook: In this notebook, we demonstrate how to implement a simplified version of DETR from the grounds up in 50 lines of Python, then visualize the predictions. It is a good starting point if you want to gain better understanding the architecture and poke around before diving in the codebase.
  • Panoptic Colab Notebook: Demonstrates how to use DETR for panoptic segmentation and plot the predictions.

Usage - Object detection

There are no extra compiled components in DETR and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/facebookresearch/detr.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

(optional) to work with panoptic install panopticapi:

pip install git+https://github.com/cocodataset/panopticapi.git

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

To train baseline DETR on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco 

A single epoch takes 28 minutes, so 300 epoch training takes around 6 days on a single machine with 8 V100 cards. To ease reproduction of our results we provide results and training logs for 150 epoch schedule (3 days on a single machine), achieving 39.5/60.3 AP/AP50.

We train DETR with AdamW setting learning rate in the transformer to 1e-4 and 1e-5 in the backbone. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Evaluation

To evaluate DETR R50 on COCO val5k with a single GPU run:

python main.py --batch_size 2 --no_aux_loss --eval --resume https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth --coco_path /path/to/coco

We provide results for all DETR detection models in this gist. Note that numbers vary depending on batch size (number of images) per GPU. Non-DC5 models were trained with batch size 2, and DC5 with 1, so DC5 models show a significant drop in AP if evaluated with more than 1 image per GPU.

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Train baseline DETR-6-6 model on 4 nodes for 300 epochs:

python run_with_submitit.py --timeout 3000 --coco_path /path/to/coco

Usage - Segmentation

We show that it is relatively straightforward to extend DETR to predict segmentation masks. We mainly demonstrate strong panoptic segmentation results.

Data preparation

For panoptic segmentation, you need the panoptic annotations additionally to the coco dataset (see above for the coco dataset). You need to download and extract the annotations. We expect the directory structure to be the following:

path/to/coco_panoptic/
  annotations/  # annotation json files
  panoptic_train2017/    # train panoptic annotations
  panoptic_val2017/      # val panoptic annotations

Training

We recommend training segmentation in two stages: first train DETR to detect all the boxes, and then train the segmentation head. For panoptic segmentation, DETR must learn to detect boxes for both stuff and things classes. You can train it on a single node with 8 gpus for 300 epochs with:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco  --coco_panoptic_path /path/to/coco_panoptic --dataset_file coco_panoptic --output_dir /output/path/box_model

For instance segmentation, you can simply train a normal box model (or used a pre-trained one we provide).

Once you have a box model checkpoint, you need to freeze it, and train the segmentation head in isolation. For panoptic segmentation you can train on a single node with 8 gpus for 25 epochs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --masks --epochs 25 --lr_drop 15 --coco_path /path/to/coco  --coco_panoptic_path /path/to/coco_panoptic  --dataset_file coco_panoptic --frozen_weights /output/path/box_model/checkpoint.pth --output_dir /output/path/segm_model

For instance segmentation only, simply remove the dataset_file and coco_panoptic_path arguments from the above command line.

License

DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.