mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

7,868

1,289

7,868

131

View on GitHub

Top Related Projects

unilm

21,155

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

CLIP

29,576

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

bert

39,267

TensorFlow code and pre-trained models for BERT

vision

16,776

Datasets, Transforms and Models specific to Computer Vision

Quick Overview

MAE (Masked Autoencoders) is a self-supervised learning framework for computer vision tasks, developed by Facebook Research. It uses a masked image modeling approach to pre-train vision transformers, achieving state-of-the-art results on various downstream tasks such as image classification and object detection.

Pros

Achieves high performance on various computer vision tasks
Requires less labeled data for training compared to traditional supervised methods
Scalable to large datasets and model sizes
Demonstrates good transfer learning capabilities

Cons

Computationally intensive, requiring significant resources for training
May not be as effective for small-scale datasets or limited computing environments
Requires fine-tuning for specific downstream tasks
Complex architecture that may be challenging to understand and implement for beginners

Code Examples

Loading a pre-trained MAE model:

import torch
from mae import mae_vit_base_patch16

# Load pre-trained MAE model
model = mae_vit_base_patch16()
model.load_state_dict(torch.load('path/to/pretrained/weights.pth'))

Preparing input data for MAE:

from torchvision import transforms

# Define image transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Apply transformations to an image
image = Image.open('path/to/image.jpg')
input_tensor = transform(image).unsqueeze(0)

Performing inference with MAE:

# Set model to evaluation mode
model.eval()

# Perform inference
with torch.no_grad():
    output = model(input_tensor)

# Process output (e.g., for image classification)
predicted_class = torch.argmax(output, dim=1)

Getting Started

To get started with MAE:

Clone the repository:

git clone https://github.com/facebookresearch/mae.git
cd mae

Install dependencies:
```
pip install -r requirements.txt
```

Download pre-trained weights or train your own model:

import torch
from mae import mae_vit_base_patch16

# Load pre-trained weights
model = mae_vit_base_patch16()
model.load_state_dict(torch.load('path/to/pretrained/weights.pth'))

# Or train your own model (simplified example)
from mae.engine_pretrain import train_one_epoch

for epoch in range(num_epochs):
    train_one_epoch(model, data_loader, optimizer, device, epoch)

Use the model for downstream tasks or fine-tuning as needed.

Competitor Comparisons

unilm

21,155

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

Broader scope: Supports multiple NLP tasks including text generation, summarization, and question answering
More versatile: Can be fine-tuned for various downstream tasks with minimal modifications
Active development: Regularly updated with new models and features

Cons of UniLM

More complex: Requires deeper understanding of NLP concepts to utilize effectively
Larger model size: Generally requires more computational resources for training and inference
Less focused: May not achieve state-of-the-art performance on specific vision tasks like MAE

Code Comparison

MAE (PyTorch):

import torch
from mae import mae_vit_base_patch16

model = mae_vit_base_patch16()
x = torch.randn(1, 3, 224, 224)
loss = model(x)

UniLM (PyTorch):

from transformers import UniLMForConditionalGeneration, UniLMTokenizer

model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs)

vision_transformer

11,291

Pros of vision_transformer

More established and widely recognized implementation of Vision Transformers
Extensive documentation and examples for various use cases
Supports a broader range of Vision Transformer variants and architectures

Cons of vision_transformer

Less focus on self-supervised learning techniques
May require more computational resources for training and inference
Limited integration with other advanced vision techniques like masked autoencoders

Code Comparison

vision_transformer:

class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000,
                 embed_dim=768, depth=12, num_heads=12, mlp_ratio=4., qkv_bias=True,
                 representation_size=None, distilled=False, drop_rate=0.):
        super().__init__()
        # ... (implementation details)

mae:

class MaskedAutoencoderViT(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3,
                 embed_dim=1024, depth=24, num_heads=16,
                 decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16,
                 mlp_ratio=4., norm_layer=nn.LayerNorm, norm_pix_loss=False):
        super().__init__()
        # ... (implementation details)

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope: Supports a wide range of NLP tasks and models
Extensive documentation and community support
Regular updates and new model implementations

Cons of transformers

Larger codebase, potentially more complex to navigate
May have higher computational requirements for some models

Code comparison

mae:

model = mae_vit_base_patch16()
model.eval()
x = torch.randn(1, 3, 224, 224)
loss, y, mask = model(x, mask_ratio=0.75)

transformers:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

Key differences

mae focuses on masked autoencoders for vision tasks
transformers provides a wide range of NLP models and tasks
mae's codebase is more specialized and compact
transformers offers more flexibility and pre-trained models

CLIP

29,576

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

Multimodal learning: CLIP can understand both images and text, enabling versatile applications
Zero-shot learning capabilities: Can classify images into arbitrary categories without fine-tuning
Robust performance across diverse datasets and tasks

Cons of CLIP

Computationally intensive: Requires significant resources for training and inference
Limited to image-text pairs: May not capture more complex relationships between modalities
Potential biases: Can inherit biases present in the large-scale training data

Code Comparison

CLIP example:

import torch
from PIL import Image
from clip import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

MAE example:

import torch
from models_mae import mae_vit_base_patch16_dec512d8b
from util.pos_embed import interpolate_pos_embed

model = mae_vit_base_patch16_dec512d8b()
checkpoint = torch.load('mae_pretrain_vit_base.pth', map_location='cpu')
msg = model.load_state_dict(checkpoint['model'], strict=False)

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

Widely adopted and extensively used in NLP tasks
Comprehensive documentation and extensive community support
Pre-trained models available for various languages and domains

Cons of BERT

Computationally intensive, requiring significant resources for training
Limited to text-based tasks, not suitable for image processing
Potential for biases in pre-trained models based on training data

Code Comparison

BERT (Python):

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

MAE (Python):

import torch
from mae import mae_vit_base_patch16

model = mae_vit_base_patch16()
model.load_state_dict(torch.load('mae_pretrain_vit_base.pth'))

Key Differences

BERT focuses on natural language processing tasks, while MAE is designed for computer vision applications
BERT uses transformer architecture for text, MAE uses Vision Transformers for images
BERT requires tokenization of text input, MAE works directly with image pixel data
BERT has a larger ecosystem of pre-trained models and fine-tuning scripts, MAE is more specialized for self-supervised learning in computer vision

vision

16,776

Datasets, Transforms and Models specific to Computer Vision

Pros of vision

Broader scope: Covers a wide range of computer vision tasks and models
More established: Longer development history and larger community support
Extensive documentation: Comprehensive guides and examples for various use cases

Cons of vision

Less focused: Not specialized in self-supervised learning like MAE
Potentially slower adoption of cutting-edge techniques: May take longer to implement the latest research advancements

Code comparison

MAE:

model = mae_vit_base_patch16()
model.load_checkpoint(args.checkpoint)

vision:

model = torchvision.models.resnet50(pretrained=True)
model.eval()

MAE focuses on masked autoencoders for self-supervised learning, while vision provides a broader set of pre-trained models and utilities for various computer vision tasks. MAE's code is more specialized for its specific approach, whereas vision offers a more general-purpose interface for accessing and using different models.

Both repositories are valuable for different use cases. MAE is ideal for researchers and practitioners interested in self-supervised learning and masked autoencoders, while vision is better suited for those working on a wide range of computer vision applications and requiring access to various pre-trained models and utilities.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Masked Autoencoders: A PyTorch Implementation

This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU.
This repo is a modification on the DeiT repo. Installation and preparation follow that repo.
This repo is based on timm==0.3.2, for which a fix is needed to work with PyTorch 1.8.1+.

Catalog

Visualization demo
Pre-trained checkpoints + fine-tuning code
Pre-training code

Visualization demo

Run our interactive visualization demo using Colab notebook (no GPU needed):

Fine-tuning with pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:

	ViT-Base	ViT-Large	ViT-Huge
pre-trained checkpoint	download	download	download
md5	`8cad7c`	`b8b06e`	`9bdbb0`

The fine-tuning instruction is in FINETUNE.md.

By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):

	ViT-B	ViT-L	ViT-H	ViT-H₄₄₈	prev best
ImageNet-1K (no external data)	83.6	85.9	86.9	87.8	87.1
following are evaluation of the same model weights (fine-tuned in original ImageNet-1K):
ImageNet-Corruption (error rate)	51.7	41.8	33.8	36.8	42.5
ImageNet-Adversarial	35.9	57.1	68.2	76.7	35.8
ImageNet-Rendition	48.3	59.9	64.4	66.5	48.7
ImageNet-Sketch	34.5	45.3	49.6	50.9	36.0
following are transfer learning by fine-tuning the pre-trained MAE on the target dataset:
iNaturalists 2017	70.5	75.7	79.3	83.4	75.4
iNaturalists 2018	75.4	80.1	83.0	86.8	81.2
iNaturalists 2019	80.5	83.4	85.7	88.3	84.1
Places205	63.9	65.8	65.9	66.8	66.0
Places365	57.9	59.4	59.8	60.3	58.0

Pre-training

The pre-training instruction is in PRETRAIN.md.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot