Convert Figma logo to code with AI

facebookresearch logomoco-v3

PyTorch implementation of MoCo v3 https//arxiv.org/abs/2104.02057

1,199
156
1,199
16

Top Related Projects

1,988

PyTorch implementation of SwAV https//arxiv.org/abs/2006.09882

4,047

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners

1,146

PyTorch implementation of SimSiam https//arxiv.org/abs/2011.10566

Quick Overview

MoCo v3 is a self-supervised learning framework for computer vision tasks, developed by Facebook AI Research. It builds upon the previous versions of MoCo (Momentum Contrast) and incorporates the latest advancements in contrastive learning and vision transformers to achieve state-of-the-art performance on various downstream tasks.

Pros

  • Achieves high performance on various computer vision tasks with minimal fine-tuning
  • Utilizes vision transformers, which have shown great potential in image recognition
  • Provides a more efficient and effective alternative to supervised pre-training
  • Offers flexibility in terms of model architecture and training strategies

Cons

  • Requires significant computational resources for training, especially with large datasets
  • May be complex to implement and fine-tune for specific use cases
  • Limited documentation and examples for beginners
  • Potential overfitting on certain datasets or tasks if not properly configured

Code Examples

  1. Loading a pre-trained MoCo v3 model:
import torch
from moco.model_moco import MoCo

# Load pre-trained MoCo v3 model
model = MoCo(dim=256, K=65536, m=0.99, T=0.1, mlp=True, vit=True)
checkpoint = torch.load('moco_v3_vit_base.pth', map_location="cpu")
model.load_state_dict(checkpoint['state_dict'])
model.eval()
  1. Extracting features from an image:
from torchvision import transforms
from PIL import Image

# Prepare image transformation
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load and transform image
image = Image.open('example.jpg').convert('RGB')
img_tensor = transform(image).unsqueeze(0)

# Extract features
with torch.no_grad():
    features = model.encoder_q(img_tensor)
  1. Fine-tuning MoCo v3 for image classification:
import torch.nn as nn

# Add a linear classifier on top of the encoder
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Freeze the encoder weights
for param in model.encoder_q.parameters():
    param.requires_grad = False

# Train only the classifier
optimizer = torch.optim.SGD(model.fc.parameters(), lr=0.01, momentum=0.9)
criterion = nn.CrossEntropyLoss()

# Training loop (simplified)
for inputs, labels in dataloader:
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Getting Started

To get started with MoCo v3:

  1. Clone the repository:

    git clone https://github.com/facebookresearch/moco-v3.git
    cd moco-v3
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models:

    wget https://dl.fbaipublicfiles.com/moco-v3/vit-b-300ep/vit-b-300ep.pth.tar
    
  4. Use the pre-trained model in your project as shown in the code examples above.

Competitor Comparisons

1,988

PyTorch implementation of SwAV https//arxiv.org/abs/2006.09882

Pros of SwAV

  • Utilizes a clustering-based approach, which can be more effective for certain types of data
  • Supports multi-crop augmentation, potentially leading to better performance on smaller objects
  • Offers a more memory-efficient implementation, beneficial for large-scale training

Cons of SwAV

  • May require more careful hyperparameter tuning compared to MoCo v3
  • Could be less effective on datasets with highly imbalanced classes
  • Potentially more complex to implement and understand for beginners

Code Comparison

SwAV:

loss = swav_loss(output, queue, epoch)
loss.backward()
optimizer.step()

MoCo v3:

loss = moco_loss(q, k, queue)
loss.backward()
optimizer.step()
update_queue(queue, k)

Both repositories implement self-supervised learning methods for visual representation learning. SwAV focuses on clustering-based approaches, while MoCo v3 uses contrastive learning techniques. SwAV may offer better performance in certain scenarios, especially with multi-crop augmentation, but might require more careful tuning. MoCo v3, on the other hand, could be easier to implement and more robust across different datasets. The choice between the two depends on the specific use case and available computational resources.

4,047

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners

Pros of SimCLR

  • Simpler architecture without requiring a memory bank or momentum encoder
  • Achieves strong performance with larger batch sizes and longer training
  • More flexible in terms of data augmentation strategies

Cons of SimCLR

  • Generally requires more computational resources for training
  • May not perform as well with smaller batch sizes or shorter training durations
  • Less memory-efficient due to the need for large batch sizes

Code Comparison

SimCLR:

# Data augmentation
image = random_crop_and_resize(image)
image = random_color_distortion(image)
image = random_gaussian_blur(image)

# Projection head
representation = resnet50(image)
projection = mlp_head(representation)

MoCo v3:

# Data augmentation
image = random_crop_and_resize(image)
image = random_color_distortion(image)

# Momentum encoder and queue
q = encoder_q(image)
k = encoder_k(image)
queue.enqueue(k)

Both SimCLR and MoCo v3 are self-supervised learning frameworks for visual representation learning. SimCLR focuses on contrastive learning with large batch sizes, while MoCo v3 uses a momentum encoder and memory bank approach. SimCLR's simplicity and flexibility come at the cost of higher computational requirements, whereas MoCo v3 is more efficient but may have a slightly more complex implementation.

1,146

PyTorch implementation of SimSiam https//arxiv.org/abs/2011.10566

Pros of SimSiam

  • Simpler architecture without requiring a momentum encoder or large batches
  • Potentially faster training due to fewer components
  • Demonstrates that neither negative pairs nor momentum encoders are necessary for contrastive learning

Cons of SimSiam

  • May require more careful hyperparameter tuning to prevent collapsing
  • Potentially less robust to different datasets or architectures compared to MoCo v3
  • Slightly lower performance on some benchmarks compared to MoCo v3

Code Comparison

SimSiam:

# SimSiam prediction MLP
self.predictor = nn.Sequential(nn.Linear(dim, pred_dim, bias=False),
                               nn.BatchNorm1d(pred_dim),
                               nn.ReLU(inplace=True),
                               nn.Linear(pred_dim, dim))

MoCo v3:

# MoCo v3 projection MLP
self.projector = nn.Sequential(nn.Linear(dim_in, hidden_dim),
                               nn.BatchNorm1d(hidden_dim),
                               nn.ReLU(inplace=True),
                               nn.Linear(hidden_dim, dim_out))

Both repositories implement self-supervised learning methods for visual representation learning. SimSiam offers a simpler approach without momentum encoders, while MoCo v3 builds upon previous versions with improved performance. The code snippets show the different network architectures used in each method.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MoCo v3 for Self-supervised ResNet and ViT

Introduction

This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

The original MoCo v3 was implemented in Tensorflow and run in TPUs. This repo re-implements in PyTorch and GPUs. Despite the library and numerical differences, this repo reproduces the results and observations in the paper.

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning. All results in these tables are based on a batch size of 4096.

Pre-trained models and configs can be found at CONFIG.md.

ResNet-50, linear classification

pretrain
epochs
pretrain
crops
linear
acc
100 2x224 68.9
300 2x224 72.8
1000 2x224 74.6

ViT, linear classification

model pretrain
epochs
pretrain
crops
linear
acc
ViT-Small 300 2x224 73.2
ViT-Base 300 2x224 76.7

ViT, end-to-end fine-tuning

model pretrain
epochs
pretrain
crops
e2e
acc
ViT-Small 300 2x224 81.4
ViT-Base 300 2x224 83.2

The end-to-end fine-tuning results are obtained using the DeiT repo, using all the default DeiT configs. ViT-B is fine-tuned for 150 epochs (vs DeiT-B's 300ep, which has 81.8% accuracy).

Usage: Preparation

Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo v1/2, this repo contains minimal modifications on the official PyTorch ImageNet code. We assume the user can successfully run the official PyTorch ImageNet code. For ViT models, install timm (timm==0.4.9).

The code has been tested with CUDA 10.2/CuDNN 7.6.5, PyTorch 1.9.0 and timm 0.4.9.

Usage: Self-supervised Pre-Training

Below are three examples for MoCo v3 pre-training.

ResNet-50 with 2-node (16-GPU) training, batch 4096

On the first node, run:

python main_moco.py \
  --moco-m-cos --crop-min=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run the same command with --rank 1. With a batch size of 4096, the training can fit into 2 nodes with a total of 16 Volta 32G GPUs.

ViT-Small with 1-node (8-GPU) training, batch 1024

python main_moco.py \
  -a vit_small -b 1024 \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Base with 8-node training, batch 4096

With a batch size of 4096, ViT-Base is trained with 8 nodes:

python main_moco.py \
  -a vit_base \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 8 --rank 0 \
  [your imagenet-folder with train and val folders]

On other nodes, run the same command with --rank 1, ..., --rank 7 respectively.

Notes:

  1. The batch size specified by -b is the total batch size across all GPUs.
  2. The learning rate specified by --lr is the base lr, and is adjusted by the linear lr scaling rule in this line.
  3. Using a smaller batch size has a more stable result (see paper), but has lower speed. Using a large batch size is critical for good speed in TPUs (as we did in the paper).
  4. In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.

Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to DEiT format:

python convert_to_deit.py \
  --input [your checkpoint path]/[your checkpoint file].pth.tar \
  --output [target checkpoint file].pth

Then run the training (in the DeiT repo) with the converted checkpoint:

python $DEIT_DIR/main.py \
  --resume [target checkpoint file].pth \
  --epochs 150

This gives us 83.2% accuracy for ViT-Base with 150-epoch fine-tuning.

Note:

  1. We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.
  2. Our ViT-Small is with heads=12 in the Transformer block, while by default in DeiT it is heads=6. Please modify the DeiT code accordingly when fine-tuning our ViT-Small model.

Model Configs

See the commands listed in CONFIG.md for specific model configs, including our recommended hyper-parameters and pre-trained reference models.

Transfer Learning

See the instructions in the transfer dir.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@Article{chen2021mocov3,
  author  = {Xinlei Chen* and Saining Xie* and Kaiming He},
  title   = {An Empirical Study of Training Self-Supervised Vision Transformers},
  journal = {arXiv preprint arXiv:2104.02057},
  year    = {2021},
}