deit

Official DeiT repository

4,224

573

4,224

Top Related Projects

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Swin-Transformer

14,946

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

vit-pytorch

23,231

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

mae

7,868

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Quick Overview

DeiT (Data-efficient image Transformers) is a project by Facebook Research that introduces a novel approach to training Vision Transformers (ViT) with less data and computational resources. It demonstrates that Transformers can be competitive with convolutional neural networks in image classification tasks, even when trained on mid-sized datasets like ImageNet without large-scale pre-training.

Pros

Achieves high accuracy on image classification tasks with less training data
Reduces computational requirements compared to traditional ViT models
Introduces a distillation token for knowledge transfer from CNN teachers
Provides pre-trained models and easy integration with PyTorch

Cons

May still require significant computational resources for training from scratch
Performance on very small datasets or specific domains might be limited
Requires careful hyperparameter tuning for optimal results
Limited to image classification tasks, not directly applicable to other computer vision problems

Code Examples

Loading a pre-trained DeiT model:

import torch
from timm import create_model

model = create_model('deit_base_patch16_224', pretrained=True)
model.eval()

Preprocessing an image for inference:

from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

img = Image.open('path/to/image.jpg')
img_tensor = transform(img).unsqueeze(0)

Performing inference with the model:

with torch.no_grad():
    output = model(img_tensor)

probabilities = torch.nn.functional.softmax(output[0], dim=0)
predicted_class = probabilities.argmax().item()

Getting Started

To get started with DeiT, follow these steps:

Install the required dependencies:

pip install torch torchvision timm

Clone the repository:

git clone https://github.com/facebookresearch/deit.git
cd deit

Load a pre-trained model and run inference:

import torch
from timm import create_model
from torchvision import transforms
from PIL import Image

# Load model
model = create_model('deit_base_patch16_224', pretrained=True)
model.eval()

# Preprocess image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open('path/to/image.jpg')
img_tensor = transform(img).unsqueeze(0)

# Run inference
with torch.no_grad():
    output = model(img_tensor)
probabilities = torch.nn.functional.softmax(output[0], dim=0)
predicted_class = probabilities.argmax().item()
print(f"Predicted class: {predicted_class}")

Competitor Comparisons

vision_transformer

11,291

Pros of vision_transformer

More comprehensive implementation with additional features and variants
Better documentation and explanations of the architecture
Includes pre-trained models and evaluation scripts

Cons of vision_transformer

Less focus on efficiency and optimization compared to DeiT
May be more complex to understand and modify for beginners
Fewer training recipes and hyperparameter suggestions

Code Comparison

vision_transformer:

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

DeiT:

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

Both implementations are similar, but DeiT allows for custom scaling and has a slightly different initialization approach.

pytorch-image-models

34,518

Pros of pytorch-image-models

Extensive collection of pre-trained models and architectures
Regular updates and active community support
Comprehensive documentation and examples

Cons of pytorch-image-models

Larger repository size due to extensive model collection
May have a steeper learning curve for beginners

Code Comparison

DeiT:

model = deit_small_patch16_224(pretrained=True)
model.eval()

pytorch-image-models:

import timm
model = timm.create_model('deit_small_patch16_224', pretrained=True)
model.eval()

Summary

pytorch-image-models offers a wider range of models and architectures, making it suitable for various computer vision tasks. It benefits from regular updates and active community support. However, its extensive collection may lead to a larger repository size and potentially a steeper learning curve for beginners.

DeiT focuses specifically on Data-efficient Image Transformers, providing a more targeted approach for users interested in this particular architecture. It may be easier to navigate for those specifically working with vision transformers.

Both repositories offer pre-trained models and are built on PyTorch, making them valuable resources for computer vision researchers and practitioners.

Swin-Transformer

14,946

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Pros of Swin-Transformer

Hierarchical structure allows for better handling of varying scales in images
More efficient for high-resolution images due to local attention mechanism
Achieves state-of-the-art performance on various vision tasks

Cons of Swin-Transformer

More complex architecture, potentially harder to implement and fine-tune
May require more computational resources for training and inference
Less suitable for tasks that don't benefit from hierarchical representation

Code Comparison

Swin-Transformer:

class SwinTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=4, in_chans=3, num_classes=1000,
                 embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24],
                 window_size=7, mlp_ratio=4., qkv_bias=True, qk_scale=None,
                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0.1,
                 norm_layer=nn.LayerNorm, ape=False, patch_norm=True,
                 use_checkpoint=False, **kwargs):
        super().__init__()
        # ... (implementation details)

DeiT:

class DistilledVisionTransformer(VisionTransformer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dist_token = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
        num_patches = self.patch_embed.num_patches
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 2, self.embed_dim))
        self.head_dist = nn.Linear(self.embed_dim, self.num_classes) if self.num_classes > 0 else nn.Identity()

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope: Supports a wide range of NLP tasks and models
Extensive documentation and community support
Regular updates and new model implementations

Cons of transformers

Larger codebase, potentially more complex to navigate
May have higher computational requirements due to its comprehensive nature

Code comparison

transformers:

from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

DeiT:

import torch
from deit import deit_tiny_patch16_224

model = deit_tiny_patch16_224(pretrained=True)
model.eval()

Key differences

transformers focuses on NLP tasks, while DeiT specializes in vision transformers
transformers offers a unified API for various models, whereas DeiT is more focused on specific vision transformer architectures
DeiT's codebase is smaller and more specialized, potentially easier to understand for vision-specific tasks
transformers provides more extensive pre-processing tools and utilities

Use cases

Choose transformers for general NLP tasks or when working with multiple model architectures
Opt for DeiT when specifically working with vision transformers or image classification tasks

vit-pytorch

23,231

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Pros of vit-pytorch

Simpler implementation, making it easier to understand and modify
More flexible architecture, allowing for easier experimentation
Includes additional variants and improvements beyond the basic ViT

Cons of vit-pytorch

Less optimized for performance compared to DeiT
May require more manual setup and configuration
Lacks some of the advanced training techniques used in DeiT

Code Comparison

vit-pytorch:

v = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

DeiT:

model = deit_small_patch16_224(pretrained=True)
model.head = nn.Linear(model.head.in_features, num_classes)

The vit-pytorch example shows a more customizable initialization, while DeiT demonstrates a simpler approach using pre-trained models. vit-pytorch allows for easy modification of various parameters, whereas DeiT focuses on providing optimized, pre-configured models that can be fine-tuned for specific tasks.

mae

7,868

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Pros of MAE

Utilizes self-supervised learning, reducing reliance on labeled data
Achieves higher accuracy on ImageNet benchmarks
More efficient training process, especially for large datasets

Cons of MAE

Requires more computational resources for pre-training
May be more complex to implement and fine-tune for specific tasks
Less suitable for smaller datasets or limited computing environments

Code Comparison

MAE:

def forward_encoder(self, x, mask_ratio):
    # mask token embedding
    mask_token = self.mask_token.expand(x.shape[0], x.shape[1] + 1 - x.shape[1], -1)
    x = torch.cat([x[:, 1:, :], mask_token], dim=1)  # no cls token
    x = x + self.pos_embed[:, 1:, :]

DeiT:

def forward_features(self, x):
    B = x.shape[0]
    x = self.patch_embed(x)

    cls_tokens = self.cls_token.expand(B, -1, -1)
    x = torch.cat((cls_tokens, x), dim=1)
    x = x + self.pos_embed

MAE focuses on masking and reconstructing image patches, while DeiT emphasizes distillation and transformer-based architectures. MAE's approach allows for more efficient pre-training on large datasets, potentially leading to better performance on downstream tasks. However, DeiT may be more suitable for scenarios with limited computational resources or smaller datasets.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Data-Efficient architectures and training for Image classification

This repository contains PyTorch evaluation code, training code and pretrained models for the following papers:

DeiT Data-Efficient Image Transformers, ICML 2021 [bib]

@InProceedings{pmlr-v139-touvron21a,
  title =     {Training data-efficient image transformers &amp; distillation through attention},
  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
  booktitle = {International Conference on Machine Learning},
  pages =     {10347--10357},
  year =      {2021},
  volume =    {139},
  month =     {July}
}

CaiT (Going deeper with Image Transformers), ICCV 2021 [bib]

@InProceedings{Touvron_2021_ICCV,
    author    = {Touvron, Hugo and Cord, Matthieu and Sablayrolles, Alexandre and Synnaeve, Gabriel and J\'egou, Herv\'e},
    title     = {Going Deeper With Image Transformers},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {32-42}
}

ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training), TPAMI 2022 [bib]

@article{touvron2021resmlp,
  title={ResMLP: Feedforward networks for image classification with data-efficient training},
  author={Hugo Touvron and Piotr Bojanowski and Mathilde Caron and Matthieu Cord and Alaaeldin El-Nouby and Edouard Grave and Gautier Izacard and Armand Joulin and Gabriel Synnaeve and Jakob Verbeek and Herv'e J'egou},
  journal={arXiv preprint arXiv:2105.03404},
  year={2021},
}

PatchConvnet (Augmenting Convolutional networks with attention-based aggregation) [bib]

@article{touvron2021patchconvnet,
  title={Augmenting Convolutional networks with attention-based aggregation},
  author={Hugo Touvron and Matthieu Cord and Alaaeldin El-Nouby and Piotr Bojanowski and Armand Joulin and Gabriel Synnaeve and Jakob Verbeek and Herve Jegou},
  journal={arXiv preprint arXiv:2112.13692},
  year={2021},
}

3Things (Three things everyone should know about Vision Transformers), ECCV 2022 [bib]

@article{Touvron2022ThreeTE,
  title={Three things everyone should know about Vision Transformers},
  author={Hugo Touvron and Matthieu Cord and Alaaeldin El-Nouby and Jakob Verbeek and Herve Jegou},
  journal={arXiv preprint arXiv:2203.09795},
  year={2022},
}

DeiT III (DeiT III: Revenge of the ViT), ECCV 2022 [bib]

@article{Touvron2022DeiTIR,
  title={DeiT III: Revenge of the ViT},
  author={Hugo Touvron and Matthieu Cord and Herve Jegou},
  journal={arXiv preprint arXiv:2204.07118},
  year={2022},
}

Cosub (Co-training 2L Submodels for Visual Recognition), CVPR 2023 [bib]

@article{Touvron2022Cotraining2S,
  title={Co-training 2L Submodels for Visual Recognition},
  author={Hugo Touvron and Matthieu Cord and Maxime Oquab and Piotr Bojanowski and Jakob Verbeek and Herv'e J'egou},
  journal={arXiv preprint arXiv:2212.04884},
  year={2022},
}

If you find this repository useful, please consider giving a star â and cite the relevant papers.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot