Top Related Projects
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
TensorFlow code and pre-trained models for BERT
Datasets, Transforms and Models specific to Computer Vision
Quick Overview
MAE (Masked Autoencoders) is a self-supervised learning framework for computer vision tasks, developed by Facebook Research. It uses a masked image modeling approach to pre-train vision transformers, achieving state-of-the-art results on various downstream tasks such as image classification and object detection.
Pros
- Achieves high performance on various computer vision tasks
- Requires less labeled data for training compared to traditional supervised methods
- Scalable to large datasets and model sizes
- Demonstrates good transfer learning capabilities
Cons
- Computationally intensive, requiring significant resources for training
- May not be as effective for small-scale datasets or limited computing environments
- Requires fine-tuning for specific downstream tasks
- Complex architecture that may be challenging to understand and implement for beginners
Code Examples
- Loading a pre-trained MAE model:
import torch
from mae import mae_vit_base_patch16
# Load pre-trained MAE model
model = mae_vit_base_patch16()
model.load_state_dict(torch.load('path/to/pretrained/weights.pth'))
- Preparing input data for MAE:
from torchvision import transforms
# Define image transformations
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Apply transformations to an image
image = Image.open('path/to/image.jpg')
input_tensor = transform(image).unsqueeze(0)
- Performing inference with MAE:
# Set model to evaluation mode
model.eval()
# Perform inference
with torch.no_grad():
output = model(input_tensor)
# Process output (e.g., for image classification)
predicted_class = torch.argmax(output, dim=1)
Getting Started
To get started with MAE:
-
Clone the repository:
git clone https://github.com/facebookresearch/mae.git cd mae
-
Install dependencies:
pip install -r requirements.txt
-
Download pre-trained weights or train your own model:
import torch from mae import mae_vit_base_patch16 # Load pre-trained weights model = mae_vit_base_patch16() model.load_state_dict(torch.load('path/to/pretrained/weights.pth')) # Or train your own model (simplified example) from mae.engine_pretrain import train_one_epoch for epoch in range(num_epochs): train_one_epoch(model, data_loader, optimizer, device, epoch)
-
Use the model for downstream tasks or fine-tuning as needed.
Competitor Comparisons
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Pros of UniLM
- Broader scope: Supports multiple NLP tasks including text generation, summarization, and question answering
- More versatile: Can be fine-tuned for various downstream tasks with minimal modifications
- Active development: Regularly updated with new models and features
Cons of UniLM
- More complex: Requires deeper understanding of NLP concepts to utilize effectively
- Larger model size: Generally requires more computational resources for training and inference
- Less focused: May not achieve state-of-the-art performance on specific vision tasks like MAE
Code Comparison
MAE (PyTorch):
import torch
from mae import mae_vit_base_patch16
model = mae_vit_base_patch16()
x = torch.randn(1, 3, 224, 224)
loss = model(x)
UniLM (PyTorch):
from transformers import UniLMForConditionalGeneration, UniLMTokenizer
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs)
Pros of vision_transformer
- More established and widely recognized implementation of Vision Transformers
- Extensive documentation and examples for various use cases
- Supports a broader range of Vision Transformer variants and architectures
Cons of vision_transformer
- Less focus on self-supervised learning techniques
- May require more computational resources for training and inference
- Limited integration with other advanced vision techniques like masked autoencoders
Code Comparison
vision_transformer:
class VisionTransformer(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000,
embed_dim=768, depth=12, num_heads=12, mlp_ratio=4., qkv_bias=True,
representation_size=None, distilled=False, drop_rate=0.):
super().__init__()
# ... (implementation details)
mae:
class MaskedAutoencoderViT(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_chans=3,
embed_dim=1024, depth=24, num_heads=16,
decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16,
mlp_ratio=4., norm_layer=nn.LayerNorm, norm_pix_loss=False):
super().__init__()
# ... (implementation details)
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Broader scope: Supports a wide range of NLP tasks and models
- Extensive documentation and community support
- Regular updates and new model implementations
Cons of transformers
- Larger codebase, potentially more complex to navigate
- May have higher computational requirements for some models
Code comparison
mae:
model = mae_vit_base_patch16()
model.eval()
x = torch.randn(1, 3, 224, 224)
loss, y, mask = model(x, mask_ratio=0.75)
transformers:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
Key differences
- mae focuses on masked autoencoders for vision tasks
- transformers provides a wide range of NLP models and tasks
- mae's codebase is more specialized and compact
- transformers offers more flexibility and pre-trained models
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Pros of CLIP
- Multimodal learning: CLIP can understand both images and text, enabling versatile applications
- Zero-shot learning capabilities: Can classify images into arbitrary categories without fine-tuning
- Robust performance across diverse datasets and tasks
Cons of CLIP
- Computationally intensive: Requires significant resources for training and inference
- Limited to image-text pairs: May not capture more complex relationships between modalities
- Potential biases: Can inherit biases present in the large-scale training data
Code Comparison
CLIP example:
import torch
from PIL import Image
from clip import clip
model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
MAE example:
import torch
from models_mae import mae_vit_base_patch16_dec512d8b
from util.pos_embed import interpolate_pos_embed
model = mae_vit_base_patch16_dec512d8b()
checkpoint = torch.load('mae_pretrain_vit_base.pth', map_location='cpu')
msg = model.load_state_dict(checkpoint['model'], strict=False)
TensorFlow code and pre-trained models for BERT
Pros of BERT
- Widely adopted and extensively used in NLP tasks
- Comprehensive documentation and extensive community support
- Pre-trained models available for various languages and domains
Cons of BERT
- Computationally intensive, requiring significant resources for training
- Limited to text-based tasks, not suitable for image processing
- Potential for biases in pre-trained models based on training data
Code Comparison
BERT (Python):
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
MAE (Python):
import torch
from mae import mae_vit_base_patch16
model = mae_vit_base_patch16()
model.load_state_dict(torch.load('mae_pretrain_vit_base.pth'))
Key Differences
- BERT focuses on natural language processing tasks, while MAE is designed for computer vision applications
- BERT uses transformer architecture for text, MAE uses Vision Transformers for images
- BERT requires tokenization of text input, MAE works directly with image pixel data
- BERT has a larger ecosystem of pre-trained models and fine-tuning scripts, MAE is more specialized for self-supervised learning in computer vision
Datasets, Transforms and Models specific to Computer Vision
Pros of vision
- Broader scope: Covers a wide range of computer vision tasks and models
- More established: Longer development history and larger community support
- Extensive documentation: Comprehensive guides and examples for various use cases
Cons of vision
- Less focused: Not specialized in self-supervised learning like MAE
- Potentially slower adoption of cutting-edge techniques: May take longer to implement the latest research advancements
Code comparison
MAE:
model = mae_vit_base_patch16()
model.load_checkpoint(args.checkpoint)
vision:
model = torchvision.models.resnet50(pretrained=True)
model.eval()
MAE focuses on masked autoencoders for self-supervised learning, while vision provides a broader set of pre-trained models and utilities for various computer vision tasks. MAE's code is more specialized for its specific approach, whereas vision offers a more general-purpose interface for accessing and using different models.
Both repositories are valuable for different use cases. MAE is ideal for researchers and practitioners interested in self-supervised learning and masked autoencoders, while vision is better suited for those working on a wide range of computer vision applications and requiring access to various pre-trained models and utilities.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Masked Autoencoders: A PyTorch Implementation
This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:
@Article{MaskedAutoencoders2021,
author = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
journal = {arXiv:2111.06377},
title = {Masked Autoencoders Are Scalable Vision Learners},
year = {2021},
}
-
The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU.
-
This repo is a modification on the DeiT repo. Installation and preparation follow that repo.
-
This repo is based on
timm==0.3.2
, for which a fix is needed to work with PyTorch 1.8.1+.
Catalog
- Visualization demo
- Pre-trained checkpoints + fine-tuning code
- Pre-training code
Visualization demo
Run our interactive visualization demo using Colab notebook (no GPU needed):
Fine-tuning with pre-trained checkpoints
The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:
ViT-Base | ViT-Large | ViT-Huge | |
---|---|---|---|
pre-trained checkpoint | download | download | download |
md5 | 8cad7c | b8b06e | 9bdbb0 |
The fine-tuning instruction is in FINETUNE.md.
By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):
ViT-B | ViT-L | ViT-H | ViT-H448 | prev best | |
---|---|---|---|---|---|
ImageNet-1K (no external data) | 83.6 | 85.9 | 86.9 | 87.8 | 87.1 |
following are evaluation of the same model weights (fine-tuned in original ImageNet-1K): | |||||
ImageNet-Corruption (error rate) | 51.7 | 41.8 | 33.8 | 36.8 | 42.5 |
ImageNet-Adversarial | 35.9 | 57.1 | 68.2 | 76.7 | 35.8 |
ImageNet-Rendition | 48.3 | 59.9 | 64.4 | 66.5 | 48.7 |
ImageNet-Sketch | 34.5 | 45.3 | 49.6 | 50.9 | 36.0 |
following are transfer learning by fine-tuning the pre-trained MAE on the target dataset: | |||||
iNaturalists 2017 | 70.5 | 75.7 | 79.3 | 83.4 | 75.4 |
iNaturalists 2018 | 75.4 | 80.1 | 83.0 | 86.8 | 81.2 |
iNaturalists 2019 | 80.5 | 83.4 | 85.7 | 88.3 | 84.1 |
Places205 | 63.9 | 65.8 | 65.9 | 66.8 | 66.0 |
Places365 | 57.9 | 59.4 | 59.8 | 60.3 | 58.0 |
Pre-training
The pre-training instruction is in PRETRAIN.md.
License
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
Top Related Projects
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
TensorFlow code and pre-trained models for BERT
Datasets, Transforms and Models specific to Computer Vision
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot