Convert Figma logo to code with AI

facebookresearch logomae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

7,432
1,230
7,432
126

Top Related Projects

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

38,368

TensorFlow code and pre-trained models for BERT

16,111

Datasets, Transforms and Models specific to Computer Vision

Quick Overview

MAE (Masked Autoencoders) is a self-supervised learning framework for computer vision tasks, developed by Facebook Research. It uses a masked image modeling approach to pre-train vision transformers, achieving state-of-the-art results on various downstream tasks such as image classification and object detection.

Pros

  • Achieves high performance on various computer vision tasks
  • Requires less labeled data for training compared to traditional supervised methods
  • Scalable to large datasets and model sizes
  • Demonstrates good transfer learning capabilities

Cons

  • Computationally intensive, requiring significant resources for training
  • May not be as effective for small-scale datasets or limited computing environments
  • Requires fine-tuning for specific downstream tasks
  • Complex architecture that may be challenging to understand and implement for beginners

Code Examples

  1. Loading a pre-trained MAE model:
import torch
from mae import mae_vit_base_patch16

# Load pre-trained MAE model
model = mae_vit_base_patch16()
model.load_state_dict(torch.load('path/to/pretrained/weights.pth'))
  1. Preparing input data for MAE:
from torchvision import transforms

# Define image transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Apply transformations to an image
image = Image.open('path/to/image.jpg')
input_tensor = transform(image).unsqueeze(0)
  1. Performing inference with MAE:
# Set model to evaluation mode
model.eval()

# Perform inference
with torch.no_grad():
    output = model(input_tensor)

# Process output (e.g., for image classification)
predicted_class = torch.argmax(output, dim=1)

Getting Started

To get started with MAE:

  1. Clone the repository:

    git clone https://github.com/facebookresearch/mae.git
    cd mae
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained weights or train your own model:

    import torch
    from mae import mae_vit_base_patch16
    
    # Load pre-trained weights
    model = mae_vit_base_patch16()
    model.load_state_dict(torch.load('path/to/pretrained/weights.pth'))
    
    # Or train your own model (simplified example)
    from mae.engine_pretrain import train_one_epoch
    
    for epoch in range(num_epochs):
        train_one_epoch(model, data_loader, optimizer, device, epoch)
    
  4. Use the model for downstream tasks or fine-tuning as needed.

Competitor Comparisons

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

  • Broader scope: Supports multiple NLP tasks including text generation, summarization, and question answering
  • More versatile: Can be fine-tuned for various downstream tasks with minimal modifications
  • Active development: Regularly updated with new models and features

Cons of UniLM

  • More complex: Requires deeper understanding of NLP concepts to utilize effectively
  • Larger model size: Generally requires more computational resources for training and inference
  • Less focused: May not achieve state-of-the-art performance on specific vision tasks like MAE

Code Comparison

MAE (PyTorch):

import torch
from mae import mae_vit_base_patch16

model = mae_vit_base_patch16()
x = torch.randn(1, 3, 224, 224)
loss = model(x)

UniLM (PyTorch):

from transformers import UniLMForConditionalGeneration, UniLMTokenizer

model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs)

Pros of vision_transformer

  • More established and widely recognized implementation of Vision Transformers
  • Extensive documentation and examples for various use cases
  • Supports a broader range of Vision Transformer variants and architectures

Cons of vision_transformer

  • Less focus on self-supervised learning techniques
  • May require more computational resources for training and inference
  • Limited integration with other advanced vision techniques like masked autoencoders

Code Comparison

vision_transformer:

class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000,
                 embed_dim=768, depth=12, num_heads=12, mlp_ratio=4., qkv_bias=True,
                 representation_size=None, distilled=False, drop_rate=0.):
        super().__init__()
        # ... (implementation details)

mae:

class MaskedAutoencoderViT(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3,
                 embed_dim=1024, depth=24, num_heads=16,
                 decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16,
                 mlp_ratio=4., norm_layer=nn.LayerNorm, norm_pix_loss=False):
        super().__init__()
        # ... (implementation details)

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope: Supports a wide range of NLP tasks and models
  • Extensive documentation and community support
  • Regular updates and new model implementations

Cons of transformers

  • Larger codebase, potentially more complex to navigate
  • May have higher computational requirements for some models

Code comparison

mae:

model = mae_vit_base_patch16()
model.eval()
x = torch.randn(1, 3, 224, 224)
loss, y, mask = model(x, mask_ratio=0.75)

transformers:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

Key differences

  • mae focuses on masked autoencoders for vision tasks
  • transformers provides a wide range of NLP models and tasks
  • mae's codebase is more specialized and compact
  • transformers offers more flexibility and pre-trained models
26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

  • Multimodal learning: CLIP can understand both images and text, enabling versatile applications
  • Zero-shot learning capabilities: Can classify images into arbitrary categories without fine-tuning
  • Robust performance across diverse datasets and tasks

Cons of CLIP

  • Computationally intensive: Requires significant resources for training and inference
  • Limited to image-text pairs: May not capture more complex relationships between modalities
  • Potential biases: Can inherit biases present in the large-scale training data

Code Comparison

CLIP example:

import torch
from PIL import Image
from clip import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

MAE example:

import torch
from models_mae import mae_vit_base_patch16_dec512d8b
from util.pos_embed import interpolate_pos_embed

model = mae_vit_base_patch16_dec512d8b()
checkpoint = torch.load('mae_pretrain_vit_base.pth', map_location='cpu')
msg = model.load_state_dict(checkpoint['model'], strict=False)
38,368

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Widely adopted and extensively used in NLP tasks
  • Comprehensive documentation and extensive community support
  • Pre-trained models available for various languages and domains

Cons of BERT

  • Computationally intensive, requiring significant resources for training
  • Limited to text-based tasks, not suitable for image processing
  • Potential for biases in pre-trained models based on training data

Code Comparison

BERT (Python):

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

MAE (Python):

import torch
from mae import mae_vit_base_patch16

model = mae_vit_base_patch16()
model.load_state_dict(torch.load('mae_pretrain_vit_base.pth'))

Key Differences

  • BERT focuses on natural language processing tasks, while MAE is designed for computer vision applications
  • BERT uses transformer architecture for text, MAE uses Vision Transformers for images
  • BERT requires tokenization of text input, MAE works directly with image pixel data
  • BERT has a larger ecosystem of pre-trained models and fine-tuning scripts, MAE is more specialized for self-supervised learning in computer vision
16,111

Datasets, Transforms and Models specific to Computer Vision

Pros of vision

  • Broader scope: Covers a wide range of computer vision tasks and models
  • More established: Longer development history and larger community support
  • Extensive documentation: Comprehensive guides and examples for various use cases

Cons of vision

  • Less focused: Not specialized in self-supervised learning like MAE
  • Potentially slower adoption of cutting-edge techniques: May take longer to implement the latest research advancements

Code comparison

MAE:

model = mae_vit_base_patch16()
model.load_checkpoint(args.checkpoint)

vision:

model = torchvision.models.resnet50(pretrained=True)
model.eval()

MAE focuses on masked autoencoders for self-supervised learning, while vision provides a broader set of pre-trained models and utilities for various computer vision tasks. MAE's code is more specialized for its specific approach, whereas vision offers a more general-purpose interface for accessing and using different models.

Both repositories are valuable for different use cases. MAE is ideal for researchers and practitioners interested in self-supervised learning and masked autoencoders, while vision is better suited for those working on a wide range of computer vision applications and requiring access to various pre-trained models and utilities.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Masked Autoencoders: A PyTorch Implementation

This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}
  • The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU.

  • This repo is a modification on the DeiT repo. Installation and preparation follow that repo.

  • This repo is based on timm==0.3.2, for which a fix is needed to work with PyTorch 1.8.1+.

Catalog

  • Visualization demo
  • Pre-trained checkpoints + fine-tuning code
  • Pre-training code

Visualization demo

Run our interactive visualization demo using Colab notebook (no GPU needed):

Fine-tuning with pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:

ViT-Base ViT-Large ViT-Huge
pre-trained checkpoint download download download
md5 8cad7c b8b06e 9bdbb0

The fine-tuning instruction is in FINETUNE.md.

By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):

ViT-B ViT-L ViT-H ViT-H448 prev best
ImageNet-1K (no external data) 83.6 85.9 86.9 87.8 87.1
following are evaluation of the same model weights (fine-tuned in original ImageNet-1K):
ImageNet-Corruption (error rate) 51.7 41.8 33.8 36.8 42.5
ImageNet-Adversarial 35.9 57.1 68.2 76.7 35.8
ImageNet-Rendition 48.3 59.9 64.4 66.5 48.7
ImageNet-Sketch 34.5 45.3 49.6 50.9 36.0
following are transfer learning by fine-tuning the pre-trained MAE on the target dataset:
iNaturalists 2017 70.5 75.7 79.3 83.4 75.4
iNaturalists 2018 75.4 80.1 83.0 86.8 81.2
iNaturalists 2019 80.5 83.4 85.7 88.3 84.1
Places205 63.9 65.8 65.9 66.8 66.0
Places365 57.9 59.4 59.8 60.3 58.0

Pre-training

The pre-training instruction is in PRETRAIN.md.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.