BLIP
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Top Related Projects
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
ImageBind One Embedding Space to Bind Them All
An open source implementation of CLIP.
Quick Overview
BLIP (Bidirectional Language Model for Image Preprocessing) is a Salesforce-developed deep learning model that performs image preprocessing tasks such as image classification, object detection, and image segmentation. It is designed to be efficient and effective for a wide range of computer vision applications.
Pros
- Versatile: BLIP can handle a variety of image preprocessing tasks, including classification, object detection, and segmentation.
- Efficient: The model is designed to be lightweight and efficient, making it suitable for deployment on edge devices and mobile applications.
- Pretrained Weights: BLIP comes with pretrained weights, allowing users to fine-tune the model for their specific use cases without the need for extensive training.
- Open-Source: BLIP is an open-source project, allowing developers to contribute to the codebase and customize the model as needed.
Cons
- Limited Documentation: The project's documentation could be more comprehensive, making it challenging for new users to get started.
- Narrow Scope: While BLIP is versatile, it is primarily focused on image preprocessing tasks and may not be suitable for more complex computer vision applications.
- Potential Performance Limitations: As a lightweight model, BLIP may not achieve the same level of performance as larger, more complex computer vision models in certain tasks.
- Dependency on Specific Libraries: BLIP relies on specific libraries, such as PyTorch and Transformers, which may require additional setup and configuration.
Code Examples
Here are a few code examples demonstrating how to use BLIP for different image preprocessing tasks:
Image Classification
from blip.classification import BLIPClassifier
# Load the pre-trained BLIP model
classifier = BLIPClassifier()
# Classify an image
image = Image.open('example_image.jpg')
predicted_class = classifier.classify(image)
print(f'Predicted class: {predicted_class}')
Object Detection
from blip.detection import BLIPDetector
# Load the pre-trained BLIP model
detector = BLIPDetector()
# Detect objects in an image
image = Image.open('example_image.jpg')
detected_objects = detector.detect(image)
for obj in detected_objects:
print(f'Object: {obj.label}, Confidence: {obj.confidence}, Bounding Box: {obj.bbox}')
Image Segmentation
from blip.segmentation import BLIPSegmenter
# Load the pre-trained BLIP model
segmenter = BLIPSegmenter()
# Segment an image
image = Image.open('example_image.jpg')
segmentation_map = segmenter.segment(image)
segmented_image = segmenter.visualize(image, segmentation_map)
segmented_image.save('segmented_image.jpg')
Getting Started
To get started with BLIP, follow these steps:
-
Clone the BLIP repository:
git clone https://github.com/salesforce/BLIP.git
-
Navigate to the project directory:
cd BLIP
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Explore the available models and tasks:
from blip.classification import BLIPClassifier from blip.detection import BLIPDetector from blip.segmentation import BLIPSegmenter # Load the pre-trained models classifier = BLIPClassifier() detector = BLIPDetector() segmenter = BLIPSegmenter()
-
Fine-tune the models for your specific use case, if needed.
-
Integrate the BLIP models into your application and start using them for image preprocessing tasks.
Competitor Comparisons
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Pros of CLIP
- Broader applicability across various vision-language tasks
- Larger pre-training dataset (400M image-text pairs)
- Zero-shot capabilities for image classification
Cons of CLIP
- Less specialized for image captioning and visual question answering
- May require more computational resources for fine-tuning
- Limited support for text generation tasks
Code Comparison
CLIP usage:
import torch
from PIL import Image
from clip import clip
model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
BLIP usage:
from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Pros of Vision Transformer
- Focused specifically on vision tasks, potentially offering better performance for image-related applications
- Implements the original Vision Transformer (ViT) architecture, providing a foundation for further research and development
- Backed by Google Research, potentially benefiting from extensive resources and expertise
Cons of Vision Transformer
- Limited to vision tasks, whereas BLIP offers multimodal capabilities (vision and language)
- May require more computational resources for training and inference compared to BLIP's efficient design
- Less flexibility in terms of downstream tasks and fine-tuning options
Code Comparison
BLIP example:
from models.blip import blip_decoder
model = blip_decoder(pretrained='model_base', image_size=384, vit='base')
model.eval()
image = load_image(image_path)
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
Vision Transformer example:
import tensorflow as tf
from vit_keras import vit
model = vit.vit_b16(
image_size=224,
activation='softmax',
pretrained=True,
include_top=True,
pretrained_top=True
)
prediction = model.predict(image)
ImageBind One Embedding Space to Bind Them All
Pros of ImageBind
- Supports multimodal learning across six modalities (text, image, audio, video, depth, thermal)
- Enables zero-shot transfer between modalities without fine-tuning
- Offers a more versatile and flexible approach to multimodal tasks
Cons of ImageBind
- More complex architecture and potentially higher computational requirements
- May require more extensive training data to achieve optimal performance
- Less focused on specific vision-language tasks compared to BLIP
Code Comparison
BLIP (image-text matching):
image = transform(image).unsqueeze(0).to(device)
text = ["a dog", "a cat", "a person"]
itm_output = model(image, text, match_head='itm')
itm_scores = itm_output.image_text_similarity
ImageBind (multimodal embedding):
image_data = load_and_transform_vision_data(image_paths, device)
text_data = load_and_transform_text(text_list, device)
embeddings = model({"vision": image_data, "text": text_data})
Both repositories focus on multimodal learning, but ImageBind offers a broader range of modalities and zero-shot transfer capabilities. BLIP is more specialized for vision-language tasks, while ImageBind provides a more flexible framework for various multimodal applications. The code examples illustrate the different approaches: BLIP focuses on image-text matching, while ImageBind generates embeddings for multiple modalities simultaneously.
An open source implementation of CLIP.
Pros of open_clip
- Broader range of pre-trained models, including support for various architectures like ViT, ResNet, and EfficientNet
- More flexible and customizable training options, allowing users to fine-tune models on custom datasets
- Active community development with frequent updates and improvements
Cons of open_clip
- Less focus on multimodal tasks compared to BLIP, which excels in image-text understanding
- May require more expertise to use effectively due to its flexibility and extensive options
- Potentially higher computational requirements for training and fine-tuning
Code Comparison
BLIP example:
from PIL import Image
import requests
from torchvision import transforms
from models.blip import blip_decoder
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
model = blip_decoder(pretrained='model_base', image_size=384, vit='base')
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
open_clip example:
import open_clip
import torch
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Announcement: BLIP is now officially integrated into LAVIS - a one-stop library for language-and-vision research and applications!
This is the PyTorch code of the BLIP paper [blog]. The code has been tested on PyTorch 1.10. To install the dependencies, run
pip install -r requirements.txt
Catalog:
- Inference demo
- Pre-trained and finetuned checkpoints
- Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
- Pre-training code
- Zero-shot video-text retrieval
- Download of bootstrapped pre-training datasets
Inference demo:
Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for:
- Image captioning
- Open-ended visual question answering
- Multimodal / unimodal feature extraction
- Image-text matching
Try out the Web demo, integrated into Huggingface Spaces ð¤ using Gradio.
Replicate web demo and Docker image is also available at
Pre-trained checkpoints:
Num. pre-train images | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L |
---|---|---|---|
14M | Download | - | - |
129M | Download | Download | Download |
Finetuned checkpoints:
Task | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L |
---|---|---|---|
Image-Text Retrieval (COCO) | Download | - | Download |
Image-Text Retrieval (Flickr30k) | Download | - | Download |
Image Captioning (COCO) | - | Download | Download |
VQA | Download | Download | - |
NLVR2 | Download | - | - |
Image-Text Retrieval:
- Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
- To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco \ --evaluate
- To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco
Image-Text Captioning:
- Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
- To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate
- To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py
- To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py
VQA:
- Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
- To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate
- To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_vqa.py
NLVR2:
- Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
- To evaluate the finetuned BLIP model, run
python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate
- To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py
Finetune with ViT-L:
In order to finetune a model with ViT-L, simply change the config file to set 'vit' as large. Batch size and learning rate may also need to be adjusted accordingly (please see the paper's appendix for hyper-parameter details). Gradient checkpoint can also be activated in the config file to reduce GPU memory usage.
Pre-train:
- Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
- In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
- Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain
Zero-shot video-text retrieval:
- Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml.
- Install decord with
pip install decord
- To perform zero-shot evaluation, run
python -m torch.distributed.run --nproc_per_node=8 eval_retrieval_video.py
Pre-training datasets download:
We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.
Image source | Filtered web caption | Filtered synthetic caption by ViT-B | Filtered synthetic caption by ViT-L |
---|---|---|---|
CC3M+CC12M+SBU | Download | Download | Download |
LAION115M | Download | Download | Download |
Citation
If you find this code to be useful for your research, please consider citing.
@inproceedings{li2022blip, title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi}, year={2022}, booktitle={ICML}, }
Acknowledgement
The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.
Top Related Projects
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
ImageBind One Embedding Space to Bind Them All
An open source implementation of CLIP.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot