vissl
VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
Top Related Projects
Tensors and Dynamic neural networks in Python with strong GPU acceleration
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
PyTorch package for the discrete VAE used for DALL·E.
Quick Overview
VISSL (Vision Transformer Self-Supervised Learning) is a PyTorch-based library for self-supervised learning of visual representations. It provides a flexible and modular framework for training and evaluating self-supervised vision models, with a focus on Vision Transformers (ViTs) and other state-of-the-art architectures.
Pros
- Flexible and Modular Design: VISSL offers a highly configurable and extensible framework, allowing researchers and developers to easily experiment with different self-supervised learning approaches, model architectures, and training strategies.
- State-of-the-Art Performance: The library includes implementations of various state-of-the-art self-supervised learning methods, such as DINO, MoCo, and SwAV, which have demonstrated impressive results on a wide range of computer vision tasks.
- Extensive Documentation and Examples: VISSL comes with detailed documentation, tutorials, and example scripts, making it easier for users to get started and understand the library's capabilities.
- Active Development and Community: The project is actively maintained by the Facebook AI Research team and has a growing community of contributors, ensuring ongoing improvements and support.
Cons
- Steep Learning Curve: While the library is well-documented, the complexity of self-supervised learning and the flexibility of VISSL's design can make it challenging for newcomers to get started.
- Resource-Intensive Training: Training self-supervised models, especially on large-scale datasets, can be computationally expensive and require significant hardware resources, such as high-end GPUs.
- Limited Support for Non-Vision Tasks: VISSL is primarily focused on self-supervised learning for computer vision tasks, and its applicability to other domains, such as natural language processing or speech recognition, may be limited.
- Potential Bias in Pretrained Models: As with any machine learning model, the pretrained weights provided by VISSL may inherit biases present in the training data, which can impact downstream applications.
Code Examples
# Example: Training a DINO model on ImageNet
from vissl.config.attr_dict import AttrDict
from vissl.utils.hydra_config import compose_hydra_configuration
cfg = compose_hydra_configuration(
[
"config=dino/dino_vit_small_patch16",
"dataset=imagenet",
"num_nodes=1",
"num_gpus=8",
]
)
from vissl.trainer import train_model
train_model(cfg)
This code example demonstrates how to train a DINO (Distillation with no Labels) model on the ImageNet dataset using VISSL.
# Example: Evaluating a pretrained model on a downstream task
from vissl.models.model_registry import model_zoo
from vissl.utils.checkpoint import init_model_from_checkpoint
# Load a pretrained DINO model
model = model_zoo.get_model_and_optimizers("dino_vit_small_patch16")
model = init_model_from_checkpoint(model, "path/to/checkpoint.torch")
# Evaluate the model on a downstream task
from vissl.evaluation.evaluation_interface import EvaluationInterface
evaluator = EvaluationInterface(model, dataset_name="imagenet")
top1_acc, top5_acc = evaluator.run_evaluation()
This example shows how to load a pretrained DINO model and use the EvaluationInterface
to evaluate its performance on the ImageNet dataset.
# Example: Extracting features using a pretrained model
from vissl.models.model_registry import model_zoo
from vissl.utils.checkpoint import init_model_from_checkpoint
from vissl.data.data_helper import get_data_loader
# Load a pretrained DINO model
model = model_zoo.get_model_and_optimizers("dino_vit_small_patch16")
model = init_model_from_checkpoint(model, "path/to/checkpoint.torch")
# Extract features from a dataset
data_loader = get_data_loader("imagenet", model.device)
features = model.forward_features(data_loader)
This example demonstrates how to use a pretrained DINO model to extract features from a dataset.
Competitor Comparisons
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- PyTorch is a widely-used and well-established deep learning framework, with a large and active community.
- PyTorch provides a flexible and intuitive API, making it easy for developers to build and experiment with complex models.
- PyTorch has extensive documentation and a wealth of pre-built models and utilities, which can save developers a significant amount of time.
Cons of PyTorch
- PyTorch can be more resource-intensive than some other deep learning frameworks, particularly for large-scale deployments.
- PyTorch's dynamic computational graph can make it more difficult to optimize for production environments, where a static graph may be preferred.
Code Comparison
PyTorch:
import torch
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.fc1 = nn.Linear(100, 50)
self.fc2 = nn.Linear(50, 10)
VISSL:
from vissl.models.model_helpers import build_model
from vissl.config.attr_dict import AttrDict
config = AttrDict({
"MODEL": {
"name": "resnet50",
"params": {
"num_classes": 1000,
},
},
})
model = build_model(config)
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Pros of CLIP
- CLIP is a state-of-the-art multimodal model that can perform a wide range of zero-shot tasks, making it highly versatile.
- The model is pre-trained on a large and diverse dataset, which gives it strong performance on a variety of tasks.
- CLIP is open-sourced and available for use by the research community, which can lead to further advancements and applications.
Cons of CLIP
- CLIP is a large and complex model, which can make it computationally expensive to use, especially on resource-constrained devices.
- The model's performance can be sensitive to the specific task and dataset, and may not always outperform specialized models.
- The training process for CLIP is not as well-documented as some other models, which can make it more challenging to understand and extend.
Code Comparison
VISSL:
from vissl.config.attr_dict import AttrDict
from vissl.utils.hydra_config import compose_hydra_configuration
cfg = compose_hydra_configuration(["config=pretrain/ssl/simclr"])
CLIP:
import clip
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Pros of JAX
- More flexible and general-purpose ML framework
- Better performance on TPUs and multi-GPU setups
- Simpler API with functional programming paradigm
Cons of JAX
- Less focused on computer vision tasks specifically
- Fewer pre-built models and datasets for vision
- Steeper learning curve for those new to functional programming
Code Comparison
VISSL (PyTorch-based):
from vissl.models import build_model
from vissl.data import build_dataset
model = build_model(cfg)
dataset = build_dataset(cfg)
JAX:
import jax
import jax.numpy as jnp
def model(params, inputs):
# Define model architecture
return outputs
dataset = load_dataset() # Custom implementation
VISSL is more specialized for computer vision tasks, providing ready-to-use components for self-supervised learning. JAX offers a more flexible approach, allowing for custom implementations of models and datasets using its functional programming paradigm.
VISSL includes more vision-specific features out-of-the-box, while JAX requires more manual implementation but offers greater flexibility across various ML domains.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Transformers provides a wide range of pre-trained models for various NLP tasks, making it easy to fine-tune and use in your own projects.
- The library has extensive documentation and a large community, providing ample support and resources for users.
- Transformers integrates well with popular deep learning frameworks like PyTorch and TensorFlow, allowing for seamless integration into your existing workflows.
Cons of Transformers
- Transformers is primarily focused on NLP tasks, while VISSL is more versatile, covering a broader range of computer vision applications.
- The Transformers library can be more complex to set up and configure, especially for users new to the field of NLP.
Code Comparison
Transformers (Hugging Face):
from transformers import pipeline
# Load a pre-trained model for sentiment analysis
sentiment_analyzer = pipeline('sentiment-analysis')
# Classify the sentiment of a given text
result = sentiment_analyzer('This movie was amazing!')
print(result)
VISSL (Facebook Research):
import torch
from vissl.config.attr_dict import AttrDict
from vissl.models import build_model
# Load a pre-trained VISSL model
cfg = AttrDict({"MODEL": {"TRUNK": {"NAME": "resnet50"}}})
model = build_model(cfg)
# Forward pass through the model
input_image = torch.randn(1, 3, 224, 224)
output = model(input_image)
print(output.shape)
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- DeepSpeed provides efficient memory management and gradient accumulation, allowing for training of larger models with limited GPU memory.
- DeepSpeed supports mixed precision training, which can significantly improve training speed and reduce memory usage.
- DeepSpeed offers advanced features like zero-redundancy parallelism, which can further optimize the training process.
Cons of DeepSpeed
- DeepSpeed may have a steeper learning curve compared to VISSL, as it requires more configuration and setup.
- The documentation for DeepSpeed, while comprehensive, may not be as user-friendly as the VISSL documentation.
- DeepSpeed is primarily focused on training large language models, while VISSL is more geared towards computer vision tasks.
Code Comparison
VISSL (PyTorch-based):
from vissl.config.attr_dict import AttrDict
from vissl.utils.hydra_config import compose_hydra_configuration
cfg = compose_hydra_configuration(["config=pretrain/swav/swav_800ep_pretrain"])
model = build_model(cfg.MODEL)
DeepSpeed (PyTorch-based):
import deepspeed
from deepspeed.runtime.zero.stage3 import ZeroStage3Optimizer
model = ...
optimizer = ZeroStage3Optimizer(model.parameters(), ...)
PyTorch package for the discrete VAE used for DALL·E.
Pros of DALL-E
- DALL-E is a state-of-the-art text-to-image generation model, capable of producing highly realistic and creative images from textual descriptions.
- The model has been trained on a vast dataset, allowing it to generate a wide variety of images across different domains.
- DALL-E has demonstrated impressive capabilities in tasks such as image editing, inpainting, and style transfer.
Cons of DALL-E
- DALL-E is a proprietary model developed by OpenAI, and the code and training data are not publicly available, limiting the ability to reproduce or extend the research.
- The model's training process and architectural details are not fully transparent, making it difficult to understand the inner workings and potential biases.
- DALL-E's deployment and usage are subject to OpenAI's policies and restrictions, which may limit its accessibility and flexibility for certain applications.
Code Comparison
VISSL (PyTorch):
from vissl.config.attr_dict import AttrDict
from vissl.models.model_helpers import get_model
from vissl.utils.checkpoint import init_model_from_checkpoint
# Load the model
cfg = AttrDict({"MODEL": {"NAME": "resnet50"}})
model = get_model(cfg)
# Initialize the model from a checkpoint
checkpoint = init_model_from_checkpoint(cfg, "path/to/checkpoint.pth")
model.load_state_dict(checkpoint["model_state_dict"])
DALL-E (not publicly available):
# DALL-E code is not publicly available
# The model is a proprietary creation of OpenAI
# and the details of its implementation are not disclosed
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
What's New
Below we share, in reverse chronological order, the updates and new releases in VISSL. All VISSL releases are available here.
- [Feb 2022]: Releasing SEER 10B parameters model implementation and model weights.
- [Feb 2022]: Releasing implementation of Fairness Benchmarks for computer vision models proposed in the paper.
- [Jan 2022]: Implementation for Geolocalization test (gps prediction for an image) released in VISSL.
- [Jan 2022]: Add BEiT transformer implementation and ClassyVision ViT.
- [Nov 2021]: Vissl Release 0.1.6 We have released a new version of VISSL. Please see our release notes for more information.
- [Oct 2021]: AugLy data augmentations support introduced in this commit.
- [Oct 2021]: XCiT: Cross-Covariance Image Transformers code released in this commit.
- [Sept 2021]: VISSL master branch renamed to main in this PR in VISSL.
- [August 2021]: Instance Retrieval benchmark implemented and available in VISSL.
- [July 2021]: Fully Sharded Data Parallel integrated in VISSL and announced in blog.
- [May 2021]: DINO: Emerging Properties in Self-Supervised Vision Transformers code released.
- [May 2021]: VISSL relicensed under MIT License.
- [May 2021]: Barlow Twins: Self-Supervised Learning via Redundancy Reduction code released.
- [April 2021]: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases code released.
- [March 2021]: Added most benchmark datasets used in VTAB and CLIP benchmark tasks.
- [February 2021]: Added Vision Transformers (ViT) backbone and training self-supervision with ViT.
- [January 2021]: VISSL v0.1.5 released.
Introduction
VISSL is a computer VIsion library for state-of-the-art Self-Supervised Learning research with PyTorch. VISSL aims to accelerate research cycle in self-supervised learning: from designing a new self-supervised task to evaluating the learned representations. Key features include:
-
Reproducible implementation of SOTA in Self-Supervision: All existing SOTA in Self-Supervision are implemented - SwAV, SimCLR, MoCo(v2), PIRL, NPID, NPID++, DeepClusterV2, ClusterFit, RotNet, Jigsaw. Also supports supervised trainings.
-
Benchmark suite: Variety of benchmarks tasks including linear image classification (places205, imagenet1k, voc07, food, CLEVR, dsprites, UCF101, stanford cars and many more), full finetuning, semi-supervised benchmark, nearest neighbor benchmark, object detection (Pascal VOC and COCO).
-
Ease of Usability: easy to use using yaml configuration system based on Hydra.
-
Modular: Easy to design new tasks and reuse the existing components from other tasks (objective functions, model trunk and heads, data transforms, etc.). The modular components are simple drop-in replacements in yaml config files.
-
Scalability: Easy to train model on 1-gpu, multi-gpu and multi-node. Several components for large scale trainings provided as simple config file plugs: Activation checkpointing, ZeRO, FP16, LARC, Stateful data sampler, data class to handle invalid images, large model backbones like RegNets, etc.
-
Model Zoo: Over 60 pre-trained self-supervised model weights.
Installation
See INSTALL.md
.
Getting Started
Install VISSL by following the installation instructions. After installation, please see Getting Started with VISSL and the Colab Notebook to learn about basic usage.
Documentation
Learn more about VISSL at our documentation. And see the projects/ for some projects built on top of VISSL.
Tutorials
Get started with VISSL by trying one of the Colab tutorial notebooks.
- Train SimCLR on 1-gpu
- Extracting Features from a pretrained model
- Benchmark task: Full finetuning on ImageNet-1K
- Benchmark task: Linear image classification on ImageNet-1K
- Large scale training (fp16, LARC, ZeRO)
- Using a pre-trained model in inference mode
Model Zoo and Baselines
We provide a large set of baseline results and trained models available for download in the VISSL Model Zoo.
Contributors
VISSL is written and maintained by the Facebook AI Research.
Development
We welcome new contributions to VISSL and we will be actively maintaining this library! Please refer to CONTRIBUTING.md
for full instructions on how to run the code, tests and linter, and submit your pull requests.
License
VISSL is released under MIT license.
Citing VISSL
If you find VISSL useful in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.
@misc{goyal2021vissl,
author = {Priya Goyal and Quentin Duval and Jeremy Reizenstein and Matthew Leavitt and Min Xu and
Benjamin Lefaudeux and Mannat Singh and Vinicius Reis and Mathilde Caron and Piotr Bojanowski and
Armand Joulin and Ishan Misra},
title = {VISSL},
howpublished = {\url{https://github.com/facebookresearch/vissl}},
year = {2021}
}
Top Related Projects
Tensors and Dynamic neural networks in Python with strong GPU acceleration
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
PyTorch package for the discrete VAE used for DALL·E.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot