BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

5,208

685

5,208

132

View on GitHub

Top Related Projects

CLIP

29,576

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

ImageBind

8,625

ImageBind One Embedding Space to Bind Them All

open_clip

11,636

An open source implementation of CLIP.

Quick Overview

BLIP (Bidirectional Language Model for Image Preprocessing) is a Salesforce-developed deep learning model that performs image preprocessing tasks such as image classification, object detection, and image segmentation. It is designed to be efficient and effective for a wide range of computer vision applications.

Pros

Versatile: BLIP can handle a variety of image preprocessing tasks, including classification, object detection, and segmentation.
Efficient: The model is designed to be lightweight and efficient, making it suitable for deployment on edge devices and mobile applications.
Pretrained Weights: BLIP comes with pretrained weights, allowing users to fine-tune the model for their specific use cases without the need for extensive training.
Open-Source: BLIP is an open-source project, allowing developers to contribute to the codebase and customize the model as needed.

Cons

Limited Documentation: The project's documentation could be more comprehensive, making it challenging for new users to get started.
Narrow Scope: While BLIP is versatile, it is primarily focused on image preprocessing tasks and may not be suitable for more complex computer vision applications.
Potential Performance Limitations: As a lightweight model, BLIP may not achieve the same level of performance as larger, more complex computer vision models in certain tasks.
Dependency on Specific Libraries: BLIP relies on specific libraries, such as PyTorch and Transformers, which may require additional setup and configuration.

Code Examples

Here are a few code examples demonstrating how to use BLIP for different image preprocessing tasks:

Image Classification

from blip.classification import BLIPClassifier

# Load the pre-trained BLIP model
classifier = BLIPClassifier()

# Classify an image
image = Image.open('example_image.jpg')
predicted_class = classifier.classify(image)
print(f'Predicted class: {predicted_class}')

Object Detection

from blip.detection import BLIPDetector

# Load the pre-trained BLIP model
detector = BLIPDetector()

# Detect objects in an image
image = Image.open('example_image.jpg')
detected_objects = detector.detect(image)
for obj in detected_objects:
    print(f'Object: {obj.label}, Confidence: {obj.confidence}, Bounding Box: {obj.bbox}')

Image Segmentation

from blip.segmentation import BLIPSegmenter

# Load the pre-trained BLIP model
segmenter = BLIPSegmenter()

# Segment an image
image = Image.open('example_image.jpg')
segmentation_map = segmenter.segment(image)
segmented_image = segmenter.visualize(image, segmentation_map)
segmented_image.save('segmented_image.jpg')

Getting Started

To get started with BLIP, follow these steps:

Clone the BLIP repository:

git clone https://github.com/salesforce/BLIP.git

Navigate to the project directory:
```
cd BLIP
```

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate

Install the required dependencies:
```
pip install -r requirements.txt
```

Explore the available models and tasks:

from blip.classification import BLIPClassifier
from blip.detection import BLIPDetector
from blip.segmentation import BLIPSegmenter

# Load the pre-trained models
classifier = BLIPClassifier()
detector = BLIPDetector()
segmenter = BLIPSegmenter()

Fine-tune the models for your specific use case, if needed.
Integrate the BLIP models into your application and start using them for image preprocessing tasks.

Competitor Comparisons

CLIP

29,576

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

Broader applicability across various vision-language tasks
Larger pre-training dataset (400M image-text pairs)
Zero-shot capabilities for image classification

Cons of CLIP

Less specialized for image captioning and visual question answering
May require more computational resources for fine-tuning
Limited support for text generation tasks

Code Comparison

CLIP usage:

import torch
from PIL import Image
from clip import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

BLIP usage:

from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

vision_transformer

11,291

Pros of Vision Transformer

Focused specifically on vision tasks, potentially offering better performance for image-related applications
Implements the original Vision Transformer (ViT) architecture, providing a foundation for further research and development
Backed by Google Research, potentially benefiting from extensive resources and expertise

Cons of Vision Transformer

Limited to vision tasks, whereas BLIP offers multimodal capabilities (vision and language)
May require more computational resources for training and inference compared to BLIP's efficient design
Less flexibility in terms of downstream tasks and fine-tuning options

Code Comparison

BLIP example:

from models.blip import blip_decoder
model = blip_decoder(pretrained='model_base', image_size=384, vit='base')
model.eval()
image = load_image(image_path)
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)

Vision Transformer example:

import tensorflow as tf
from vit_keras import vit
model = vit.vit_b16(
    image_size=224,
    activation='softmax',
    pretrained=True,
    include_top=True,
    pretrained_top=True
)
prediction = model.predict(image)

ImageBind

8,625

ImageBind One Embedding Space to Bind Them All

Pros of ImageBind

Supports multimodal learning across six modalities (text, image, audio, video, depth, thermal)
Enables zero-shot transfer between modalities without fine-tuning
Offers a more versatile and flexible approach to multimodal tasks

Cons of ImageBind

More complex architecture and potentially higher computational requirements
May require more extensive training data to achieve optimal performance
Less focused on specific vision-language tasks compared to BLIP

Code Comparison

BLIP (image-text matching):

image = transform(image).unsqueeze(0).to(device)
text = ["a dog", "a cat", "a person"]
itm_output = model(image, text, match_head='itm')
itm_scores = itm_output.image_text_similarity

ImageBind (multimodal embedding):

image_data = load_and_transform_vision_data(image_paths, device)
text_data = load_and_transform_text(text_list, device)
embeddings = model({"vision": image_data, "text": text_data})

Both repositories focus on multimodal learning, but ImageBind offers a broader range of modalities and zero-shot transfer capabilities. BLIP is more specialized for vision-language tasks, while ImageBind provides a more flexible framework for various multimodal applications. The code examples illustrate the different approaches: BLIP focuses on image-text matching, while ImageBind generates embeddings for multiple modalities simultaneously.

open_clip

11,636

An open source implementation of CLIP.

Pros of open_clip

Broader range of pre-trained models, including support for various architectures like ViT, ResNet, and EfficientNet
More flexible and customizable training options, allowing users to fine-tune models on custom datasets
Active community development with frequent updates and improvements

Cons of open_clip

Less focus on multimodal tasks compared to BLIP, which excels in image-text understanding
May require more expertise to use effectively due to its flexibility and extensive options
Potentially higher computational requirements for training and fine-tuning

Code Comparison

BLIP example:

from PIL import Image
import requests
from torchvision import transforms
from models.blip import blip_decoder

image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
model = blip_decoder(pretrained='model_base', image_size=384, vit='base')
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)

open_clip example:

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Announcement: BLIP is now officially integrated into LAVIS - a one-stop library for language-and-vision research and applications!

This is the PyTorch code of the BLIP paper [blog]. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

Inference demo
Pre-trained and finetuned checkpoints
Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
Pre-training code
Zero-shot video-text retrieval
Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for:

Image captioning
Open-ended visual question answering
Multimodal / unimodal feature extraction
Image-text matching

Try out the Web demo, integrated into Huggingface Spaces ð¤ using Gradio.

Replicate web demo and Docker image is also available at

Pre-trained checkpoints:

Num. pre-train images	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
14M	Download	-	-
129M	Download	Download	Download

Finetuned checkpoints:

Task	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
Image-Text Retrieval (COCO)	Download	-	Download
Image-Text Retrieval (Flickr30k)	Download	-	Download
Image Captioning (COCO)	-	Download	Download
VQA	Download	Download	-
NLVR2	Download	-	-

Image-Text Retrieval:

Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco

Image-Text Captioning:

Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate

To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py

VQA:

Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_vqa.py

NLVR2:

Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
To evaluate the finetuned BLIP model, run

python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py

Finetune with ViT-L:

In order to finetune a model with ViT-L, simply change the config file to set 'vit' as large. Batch size and learning rate may also need to be adjusted accordingly (please see the paper's appendix for hyper-parameter details). Gradient checkpoint can also be activated in the config file to reduce GPU memory usage.

Pre-train:

Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

Zero-shot video-text retrieval:

Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml.
Install decord with
```
pip install decord
```
To perform zero-shot evaluation, run

python -m torch.distributed.run --nproc_per_node=8 eval_retrieval_video.py

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image source	Filtered web caption	Filtered synthetic caption by ViT-B	Filtered synthetic caption by ViT-L
CC3M+CC12M+SBU	Download	Download	Download
LAION115M	Download	Download	Download

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot