Convert Figma logo to code with AI

facebookresearch logodeit

Official DeiT repository

3,999
548
3,999
20

Top Related Projects

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

7,138

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Quick Overview

DeiT (Data-efficient image Transformers) is a project by Facebook Research that introduces a novel approach to training Vision Transformers (ViT) with less data and computational resources. It demonstrates that Transformers can be competitive with convolutional neural networks in image classification tasks, even when trained on mid-sized datasets like ImageNet without large-scale pre-training.

Pros

  • Achieves high accuracy on image classification tasks with less training data
  • Reduces computational requirements compared to traditional ViT models
  • Introduces a distillation token for knowledge transfer from CNN teachers
  • Provides pre-trained models and easy integration with PyTorch

Cons

  • May still require significant computational resources for training from scratch
  • Performance on very small datasets or specific domains might be limited
  • Requires careful hyperparameter tuning for optimal results
  • Limited to image classification tasks, not directly applicable to other computer vision problems

Code Examples

  1. Loading a pre-trained DeiT model:
import torch
from timm import create_model

model = create_model('deit_base_patch16_224', pretrained=True)
model.eval()
  1. Preprocessing an image for inference:
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

img = Image.open('path/to/image.jpg')
img_tensor = transform(img).unsqueeze(0)
  1. Performing inference with the model:
with torch.no_grad():
    output = model(img_tensor)

probabilities = torch.nn.functional.softmax(output[0], dim=0)
predicted_class = probabilities.argmax().item()

Getting Started

To get started with DeiT, follow these steps:

  1. Install the required dependencies:
pip install torch torchvision timm
  1. Clone the repository:
git clone https://github.com/facebookresearch/deit.git
cd deit
  1. Load a pre-trained model and run inference:
import torch
from timm import create_model
from torchvision import transforms
from PIL import Image

# Load model
model = create_model('deit_base_patch16_224', pretrained=True)
model.eval()

# Preprocess image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open('path/to/image.jpg')
img_tensor = transform(img).unsqueeze(0)

# Run inference
with torch.no_grad():
    output = model(img_tensor)
probabilities = torch.nn.functional.softmax(output[0], dim=0)
predicted_class = probabilities.argmax().item()
print(f"Predicted class: {predicted_class}")

Competitor Comparisons

Pros of vision_transformer

  • More comprehensive implementation with additional features and variants
  • Better documentation and explanations of the architecture
  • Includes pre-trained models and evaluation scripts

Cons of vision_transformer

  • Less focus on efficiency and optimization compared to DeiT
  • May be more complex to understand and modify for beginners
  • Fewer training recipes and hyperparameter suggestions

Code Comparison

vision_transformer:

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

DeiT:

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

Both implementations are similar, but DeiT allows for custom scaling and has a slightly different initialization approach.

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Pros of pytorch-image-models

  • Extensive collection of pre-trained models and architectures
  • Regular updates and active community support
  • Comprehensive documentation and examples

Cons of pytorch-image-models

  • Larger repository size due to extensive model collection
  • May have a steeper learning curve for beginners

Code Comparison

DeiT:

model = deit_small_patch16_224(pretrained=True)
model.eval()

pytorch-image-models:

import timm
model = timm.create_model('deit_small_patch16_224', pretrained=True)
model.eval()

Summary

pytorch-image-models offers a wider range of models and architectures, making it suitable for various computer vision tasks. It benefits from regular updates and active community support. However, its extensive collection may lead to a larger repository size and potentially a steeper learning curve for beginners.

DeiT focuses specifically on Data-efficient Image Transformers, providing a more targeted approach for users interested in this particular architecture. It may be easier to navigate for those specifically working with vision transformers.

Both repositories offer pre-trained models and are built on PyTorch, making them valuable resources for computer vision researchers and practitioners.

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Pros of Swin-Transformer

  • Hierarchical structure allows for better handling of varying scales in images
  • More efficient for high-resolution images due to local attention mechanism
  • Achieves state-of-the-art performance on various vision tasks

Cons of Swin-Transformer

  • More complex architecture, potentially harder to implement and fine-tune
  • May require more computational resources for training and inference
  • Less suitable for tasks that don't benefit from hierarchical representation

Code Comparison

Swin-Transformer:

class SwinTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=4, in_chans=3, num_classes=1000,
                 embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24],
                 window_size=7, mlp_ratio=4., qkv_bias=True, qk_scale=None,
                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0.1,
                 norm_layer=nn.LayerNorm, ape=False, patch_norm=True,
                 use_checkpoint=False, **kwargs):
        super().__init__()
        # ... (implementation details)

DeiT:

class DistilledVisionTransformer(VisionTransformer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dist_token = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
        num_patches = self.patch_embed.num_patches
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 2, self.embed_dim))
        self.head_dist = nn.Linear(self.embed_dim, self.num_classes) if self.num_classes > 0 else nn.Identity()

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope: Supports a wide range of NLP tasks and models
  • Extensive documentation and community support
  • Regular updates and new model implementations

Cons of transformers

  • Larger codebase, potentially more complex to navigate
  • May have higher computational requirements due to its comprehensive nature

Code comparison

transformers:

from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

DeiT:

import torch
from deit import deit_tiny_patch16_224

model = deit_tiny_patch16_224(pretrained=True)
model.eval()

Key differences

  • transformers focuses on NLP tasks, while DeiT specializes in vision transformers
  • transformers offers a unified API for various models, whereas DeiT is more focused on specific vision transformer architectures
  • DeiT's codebase is smaller and more specialized, potentially easier to understand for vision-specific tasks
  • transformers provides more extensive pre-processing tools and utilities

Use cases

  • Choose transformers for general NLP tasks or when working with multiple model architectures
  • Opt for DeiT when specifically working with vision transformers or image classification tasks

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Pros of vit-pytorch

  • Simpler implementation, making it easier to understand and modify
  • More flexible architecture, allowing for easier experimentation
  • Includes additional variants and improvements beyond the basic ViT

Cons of vit-pytorch

  • Less optimized for performance compared to DeiT
  • May require more manual setup and configuration
  • Lacks some of the advanced training techniques used in DeiT

Code Comparison

vit-pytorch:

v = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

DeiT:

model = deit_small_patch16_224(pretrained=True)
model.head = nn.Linear(model.head.in_features, num_classes)

The vit-pytorch example shows a more customizable initialization, while DeiT demonstrates a simpler approach using pre-trained models. vit-pytorch allows for easy modification of various parameters, whereas DeiT focuses on providing optimized, pre-configured models that can be fine-tuned for specific tasks.

7,138

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Pros of MAE

  • Utilizes self-supervised learning, reducing reliance on labeled data
  • Achieves higher accuracy on ImageNet benchmarks
  • More efficient training process, especially for large datasets

Cons of MAE

  • Requires more computational resources for pre-training
  • May be more complex to implement and fine-tune for specific tasks
  • Less suitable for smaller datasets or limited computing environments

Code Comparison

MAE:

def forward_encoder(self, x, mask_ratio):
    # mask token embedding
    mask_token = self.mask_token.expand(x.shape[0], x.shape[1] + 1 - x.shape[1], -1)
    x = torch.cat([x[:, 1:, :], mask_token], dim=1)  # no cls token
    x = x + self.pos_embed[:, 1:, :]

DeiT:

def forward_features(self, x):
    B = x.shape[0]
    x = self.patch_embed(x)

    cls_tokens = self.cls_token.expand(B, -1, -1)
    x = torch.cat((cls_tokens, x), dim=1)
    x = x + self.pos_embed

MAE focuses on masking and reconstructing image patches, while DeiT emphasizes distillation and transformer-based architectures. MAE's approach allows for more efficient pre-training on large datasets, potentially leading to better performance on downstream tasks. However, DeiT may be more suitable for scenarios with limited computational resources or smaller datasets.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Data-Efficient architectures and training for Image classification

This repository contains PyTorch evaluation code, training code and pretrained models for the following papers:

DeiT Data-Efficient Image Transformers, ICML 2021 [bib]
@InProceedings{pmlr-v139-touvron21a,
  title =     {Training data-efficient image transformers & distillation through attention},
  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
  booktitle = {International Conference on Machine Learning},
  pages =     {10347--10357},
  year =      {2021},
  volume =    {139},
  month =     {July}
}
CaiT (Going deeper with Image Transformers), ICCV 2021 [bib]
@InProceedings{Touvron_2021_ICCV,
    author    = {Touvron, Hugo and Cord, Matthieu and Sablayrolles, Alexandre and Synnaeve, Gabriel and J\'egou, Herv\'e},
    title     = {Going Deeper With Image Transformers},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {32-42}
}
ResMLP (ResMLP: Feedforward networks for image classification with data-efficient training), TPAMI 2022 [bib]
@article{touvron2021resmlp,
  title={ResMLP: Feedforward networks for image classification with data-efficient training},
  author={Hugo Touvron and Piotr Bojanowski and Mathilde Caron and Matthieu Cord and Alaaeldin El-Nouby and Edouard Grave and Gautier Izacard and Armand Joulin and Gabriel Synnaeve and Jakob Verbeek and Herv'e J'egou},
  journal={arXiv preprint arXiv:2105.03404},
  year={2021},
}
PatchConvnet (Augmenting Convolutional networks with attention-based aggregation) [bib]
@article{touvron2021patchconvnet,
  title={Augmenting Convolutional networks with attention-based aggregation},
  author={Hugo Touvron and Matthieu Cord and Alaaeldin El-Nouby and Piotr Bojanowski and Armand Joulin and Gabriel Synnaeve and Jakob Verbeek and Herve Jegou},
  journal={arXiv preprint arXiv:2112.13692},
  year={2021},
}
3Things (Three things everyone should know about Vision Transformers), ECCV 2022 [bib]
@article{Touvron2022ThreeTE,
  title={Three things everyone should know about Vision Transformers},
  author={Hugo Touvron and Matthieu Cord and Alaaeldin El-Nouby and Jakob Verbeek and Herve Jegou},
  journal={arXiv preprint arXiv:2203.09795},
  year={2022},
}
DeiT III (DeiT III: Revenge of the ViT), ECCV 2022 [bib]
@article{Touvron2022DeiTIR,
  title={DeiT III: Revenge of the ViT},
  author={Hugo Touvron and Matthieu Cord and Herve Jegou},
  journal={arXiv preprint arXiv:2204.07118},
  year={2022},
}
Cosub (Co-training 2L Submodels for Visual Recognition), CVPR 2023 [bib]
@article{Touvron2022Cotraining2S,
  title={Co-training 2L Submodels for Visual Recognition},
  author={Hugo Touvron and Matthieu Cord and Maxime Oquab and Piotr Bojanowski and Jakob Verbeek and Herv'e J'egou},
  journal={arXiv preprint arXiv:2212.04884},
  year={2022},
}
If you find this repository useful, please consider giving a star ⭐ and cite the relevant papers.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.