Convert Figma logo to code with AI

salesforce logoLAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

10,062
979
10,062
476

Top Related Projects

26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

5,489

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Quick Overview

LAVIS (Language-Vision Intelligence Suite) is an open-source deep learning library for language-vision research and applications. It provides a comprehensive set of tools and pre-trained models for various vision-language tasks, including image captioning, visual question answering, and image-text retrieval.

Pros

  • Comprehensive collection of vision-language models and datasets
  • Easy-to-use API for both training and inference
  • Supports multiple vision-language tasks in a unified framework
  • Provides pre-trained models for quick deployment and fine-tuning

Cons

  • Steep learning curve for beginners in vision-language tasks
  • Limited documentation for some advanced features
  • Requires significant computational resources for training large models
  • Dependency on specific versions of PyTorch and other libraries

Code Examples

  1. Loading a pre-trained model for image captioning:
from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess(
    name="blip_caption", model_type="base_coco", is_eval=True, device="cuda"
)
  1. Performing image captioning:
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to("cuda")
caption = model.generate({"image": image})
print(caption[0])
  1. Visual question answering:
model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip_vqa", model_type="vqav2", is_eval=True, device="cuda"
)

image = vis_processors["eval"](raw_image).unsqueeze(0).to("cuda")
question = "What is the color of the car?"
question = txt_processors["eval"](question)

answer = model.predict_answers(
    samples={"image": image, "text_input": question},
    inference_method="generate",
)[0]
print(answer)

Getting Started

  1. Install LAVIS:
pip install salesforce-lavis
  1. Import and use LAVIS in your Python script:
from lavis.models import load_model_and_preprocess

# Load a pre-trained model
model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip_caption", model_type="base_coco", is_eval=True, device="cuda"
)

# Use the model for inference
# (See code examples above for specific tasks)

Competitor Comparisons

26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

  • Pioneered the concept of contrastive language-image pre-training
  • Highly versatile for various vision-language tasks without fine-tuning
  • Robust zero-shot capabilities for image classification

Cons of CLIP

  • Limited to image-text matching and classification tasks
  • Less flexible for complex vision-language tasks like visual question answering
  • Requires significant computational resources for training and inference

Code Comparison

CLIP usage:

import torch
from PIL import Image
import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

LAVIS usage:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_vqa", "base")
image = vis_processors["eval"](image).unsqueeze(0)
question = txt_processors["eval"](question)

answer = model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope, covering a wide range of NLP tasks and models
  • Larger community and more frequent updates
  • Extensive documentation and tutorials

Cons of transformers

  • Steeper learning curve due to its extensive features
  • Can be overwhelming for users focused solely on vision-language tasks
  • Larger package size and potentially higher resource requirements

Code comparison

LAVIS:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})

transformers:

from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)

Both repositories offer powerful tools for working with vision-language models. LAVIS focuses specifically on vision-language tasks, providing a more streamlined experience for these applications. transformers, on the other hand, offers a broader range of NLP capabilities but may require more setup for vision-language tasks.

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • More established and mature project with a larger community
  • Supports a wider range of sequence-to-sequence tasks beyond vision-language models
  • Offers more extensive documentation and examples

Cons of fairseq

  • Less focused on vision-language tasks compared to LAVIS
  • May have a steeper learning curve for newcomers to the field
  • Potentially more complex setup and configuration process

Code Comparison

LAVIS example:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})

fairseq example:

from fairseq.models.roberta import RobertaModel

roberta = RobertaModel.from_pretrained('path/to/roberta.large', checkpoint_file='model.pt')
tokens = roberta.encode('Hello world!')
features = roberta.extract_features(tokens)
logprobs = roberta.predict('mask', tokens)

Both repositories provide powerful tools for natural language processing and vision-language tasks. LAVIS is more specialized in vision-language models, while fairseq offers a broader range of sequence-to-sequence capabilities. The choice between them depends on the specific requirements of your project and your familiarity with the respective ecosystems.

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

  • Broader scope: Covers a wider range of tasks including natural language understanding, generation, and vision-language tasks
  • More extensive pre-training: Utilizes larger datasets and more diverse pre-training objectives
  • Active development: Frequent updates and new model releases

Cons of UniLM

  • Higher computational requirements: Generally requires more resources for training and inference
  • Steeper learning curve: May be more complex to implement and fine-tune for specific tasks
  • Less focused on vision-language tasks: LAVIS offers more specialized models for multimodal applications

Code Comparison

UniLM example:

from unilm import UniLMTokenizer, UniLMForConditionalGeneration

tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")

LAVIS example:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip_caption", model_type="base_coco"
)

Both repositories offer powerful language models, but UniLM provides a more general-purpose toolkit, while LAVIS focuses on vision-language tasks with specialized models and easier integration for multimodal applications.

Pros of Vision Transformer

  • Focused specifically on vision transformers, providing a deep dive into this architecture
  • Includes implementations of various ViT variants and improvements
  • Backed by Google Research, potentially offering cutting-edge advancements

Cons of Vision Transformer

  • Limited to vision tasks, lacking multi-modal capabilities
  • Less comprehensive library of pre-trained models compared to LAVIS
  • May require more domain expertise to use effectively

Code Comparison

LAVIS example:

from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess("blip_caption", "large")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
caption = model.generate({"image": image})

Vision Transformer example:

import tensorflow as tf
from vit_keras import vit

model = vit.vit_b16(
    image_size=224,
    activation='softmax',
    pretrained=True,
    include_top=True,
    pretrained_top=True
)
5,489

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Pros of MMF

  • More extensive dataset support, including VQA, Visual Dialog, and TextVQA
  • Modular architecture allowing easier integration of new models and tasks
  • Stronger focus on multi-modal pretraining and transfer learning

Cons of MMF

  • Less user-friendly documentation compared to LAVIS
  • Steeper learning curve for beginners
  • Fewer pre-trained models available out-of-the-box

Code Comparison

MMF:

from mmf.models.mmbt import MMBT
from mmf.common.sample import Sample
from mmf.utils.build import build_model

config = {"model": "mmbt", "model_config": {}}
model = build_model(config)

LAVIS:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")

MMF offers a more modular approach, allowing for detailed configuration, while LAVIS provides a simpler, more streamlined API for loading pre-trained models and processors.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README



LAVIS - A Library for Language-Vision Intelligence

What's New: 🎉

A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.

A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.

  • [Model Release] May 2023, released implementation of InstructBLIP
    Paper, Project Page

A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.

A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new state-of-the-art on zero-shot captioning (on NoCaps 121.6 CIDEr score vs previous best 113.2). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new zero-shot instructed vision-to-language generation capabilities for various interesting applications!

  • Jan 2023, LAVIS is now available on PyPI for installation!
  • [Model Release] Dec 2022, released implementation of Img2LLM-VQA (CVPR 2023, "From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models", by Jiaxian Guo et al)
    Paper, Project Page, Open In Colab

A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training!

  • [Model Release] Oct 2022, released implementation of PNP-VQA (EMNLP Findings 2022, "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", by Anthony T.M.H. et al),
    Paper, Project Page, Open In Colab)

A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance.

Technical Report and Citing LAVIS

You can find more details in our technical report.

If you're using LAVIS in your research or applications, please cite it using this BibTeX:

@inproceedings{li-etal-2023-lavis,
    title = "{LAVIS}: A One-stop Library for Language-Vision Intelligence",
    author = "Li, Dongxu  and
      Li, Junnan  and
      Le, Hung  and
      Wang, Guangsen  and
      Savarese, Silvio  and
      Hoi, Steven C.H.",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-demo.3",
    pages = "31--41",
    abstract = "We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.",
}

Table of Contents

Introduction

LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets. It features a unified interface design to access

  • 10+ tasks (retrieval, captioning, visual question answering, multimodal classification etc.);
  • 20+ datasets (COCO, Flickr, Nocaps, Conceptual Commons, SBU, etc.);
  • 30+ pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, including ALBEF, BLIP, ALPRO, CLIP.



Key features of LAVIS include:

  • Unified and Modular Interface: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.

  • Easy Off-the-shelf Inference and Feature Extraction: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.

  • Reproducible Model Zoo and Training Recipes: easily replicate and extend state-of-the-art models on existing and new tasks.

  • Dataset Zoo and Automatic Downloading Tools: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.

The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.

TasksSupported ModelsSupported Datasets
Image-text Pre-trainingALBEF, BLIPCOCO, VisualGenome, SBU ConceptualCaptions
Image-text RetrievalALBEF, BLIP, CLIPCOCO, Flickr30k
Text-image RetrievalALBEF, BLIP, CLIPCOCO, Flickr30k
Visual Question AnsweringALBEF, BLIPVQAv2, OKVQA, A-OKVQA
Image CaptioningBLIPCOCO, NoCaps
Image ClassificationCLIPImageNet
Natural Language Visual Reasoning (NLVR)ALBEF, BLIPNLVR2
Visual Entailment (VE)ALBEFSNLI-VE
Visual DialogueBLIPVisDial
Video-text RetrievalBLIP, ALPROMSRVTT, DiDeMo
Text-video RetrievalBLIP, ALPROMSRVTT, DiDeMo
Video Question Answering (VideoQA)BLIP, ALPROMSRVTT, MSVD
Video DialogueVGD-GPTAVSD
Multimodal Feature ExtractionALBEF, CLIP, BLIP, ALPROcustomized
Text-to-image Generation[COMING SOON]

Installation

  1. (Optional) Creating conda environment
conda create -n lavis python=3.8
conda activate lavis
  1. install from PyPI
pip install salesforce-lavis
  1. Or, for development, you may build from source
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .

Getting Started

Model Zoo

Model zoo summarizes supported models in LAVIS, to view:

from lavis.models import model_zoo
print(model_zoo)
# ==================================================
# Architectures                  Types
# ==================================================
# albef_classification           ve
# albef_feature_extractor        base
# albef_nlvr                     nlvr
# albef_pretrain                 base
# albef_retrieval                coco, flickr
# albef_vqa                      vqav2
# alpro_qa                       msrvtt, msvd
# alpro_retrieval                msrvtt, didemo
# blip_caption                   base_coco, large_coco
# blip_classification            base
# blip_feature_extractor         base
# blip_nlvr                      nlvr
# blip_pretrain                  base
# blip_retrieval                 coco, flickr
# blip_vqa                       vqav2, okvqa, aokvqa
# clip_feature_extractor         ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
# clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
# gpt_dialogue                   base

Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.

import torch
from PIL import Image
# setup device to use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load sample image
raw_image = Image.open("docs/_static/merlion.png").convert("RGB")

This example image shows Merlion park (source), a landmark in Singapore.

Image Captioning

In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each pre-trained model with its preprocessors (transforms), accessed via load_model_and_preprocess().

import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']

Visual question answering (VQA)

BLIP model is able to answer free-form questions about images in natural language. To access the VQA model, simply replace the name and model_type arguments passed to load_model_and_preprocess().

from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
# ask a random question.
question = "Which city is this photo taken?"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)
model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
# ['singapore']

Unified Feature Extraction Interface

LAVIS provides a unified interface to extract features from each architecture. To extract features, we load the feature extractor variants of each model. The multimodal feature can be used for multimodal classification. The low-dimensional unimodal features can be used to compute cross-modal similarity.

from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
caption = "a large fountain spewing water into the air"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text_input = txt_processors["eval"](caption)
sample = {"image": image, "text_input": [text_input]}

features_multimodal = model.extract_features(sample)
print(features_multimodal.multimodal_embeds.shape)
# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks

features_image = model.extract_features(sample, mode="image")
features_text = model.extract_features(sample, mode="text")
print(features_image.image_embeds.shape)
# torch.Size([1, 197, 768])
print(features_text.text_embeds.shape)
# torch.Size([1, 12, 768])

# low-dimensional projected features
print(features_image.image_embeds_proj.shape)
# torch.Size([1, 197, 256])
print(features_text.text_embeds_proj.shape)
# torch.Size([1, 12, 256])
similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()
print(similarity)
# tensor([[0.2622]])

Load Datasets

LAVIS inherently supports a wide variety of common language-vision datasets by providing automatic download tools to help download and organize these datasets. After downloading, to load the datasets, use the following code:

from lavis.datasets.builders import dataset_zoo
dataset_names = dataset_zoo.get_names()
print(dataset_names)
# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
#  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
#  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
#  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']

After downloading the images, we can use load_dataset() to obtain the dataset.

from lavis.datasets.builders import load_dataset
coco_dataset = load_dataset("coco_caption")
print(coco_dataset.keys())
# dict_keys(['train', 'val', 'test'])
print(len(coco_dataset["train"]))
# 566747
print(coco_dataset["train"][0])
# {'image': <PIL.Image.Image image mode=RGB size=640x480>,
#  'text_input': 'A woman wearing a net on her head cutting a cake. ',
#  'image_id': 0}

If you already host a local copy of the dataset, you can pass in the vis_path argument to change the default location to load images.

coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)

Jupyter Notebook Examples

See examples for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.

Resources and Tools

  • Benchmarks: see Benchmark for instructions to evaluate and train supported models.
  • Dataset Download and Browsing: see Dataset Download for instructions and automatic tools on download common language-vision datasets.
  • GUI Demo: to run the demo locally, run bash run_scripts/run_demo.sh and then follow the instruction on the prompts to view in browser. A web demo is coming soon.

Documentations

For more details and advanced usages, please refer to documentation.

Ethical and Responsible Use

We note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases and inappropriate behaviors in the future.

Contact us

If you have any questions, comments or suggestions, please do not hesitate to contact us at lavis@salesforce.com.

License

BSD 3-Clause License