Convert Figma logo to code with AI

salesforce logoBLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

4,616
615
4,616
125

Top Related Projects

24,594

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

ImageBind One Embedding Space to Bind Them All

An open source implementation of CLIP.

Quick Overview

BLIP (Bidirectional Language Model for Image Preprocessing) is a Salesforce-developed deep learning model that performs image preprocessing tasks such as image classification, object detection, and image segmentation. It is designed to be efficient and effective for a wide range of computer vision applications.

Pros

  • Versatile: BLIP can handle a variety of image preprocessing tasks, including classification, object detection, and segmentation.
  • Efficient: The model is designed to be lightweight and efficient, making it suitable for deployment on edge devices and mobile applications.
  • Pretrained Weights: BLIP comes with pretrained weights, allowing users to fine-tune the model for their specific use cases without the need for extensive training.
  • Open-Source: BLIP is an open-source project, allowing developers to contribute to the codebase and customize the model as needed.

Cons

  • Limited Documentation: The project's documentation could be more comprehensive, making it challenging for new users to get started.
  • Narrow Scope: While BLIP is versatile, it is primarily focused on image preprocessing tasks and may not be suitable for more complex computer vision applications.
  • Potential Performance Limitations: As a lightweight model, BLIP may not achieve the same level of performance as larger, more complex computer vision models in certain tasks.
  • Dependency on Specific Libraries: BLIP relies on specific libraries, such as PyTorch and Transformers, which may require additional setup and configuration.

Code Examples

Here are a few code examples demonstrating how to use BLIP for different image preprocessing tasks:

Image Classification

from blip.classification import BLIPClassifier

# Load the pre-trained BLIP model
classifier = BLIPClassifier()

# Classify an image
image = Image.open('example_image.jpg')
predicted_class = classifier.classify(image)
print(f'Predicted class: {predicted_class}')

Object Detection

from blip.detection import BLIPDetector

# Load the pre-trained BLIP model
detector = BLIPDetector()

# Detect objects in an image
image = Image.open('example_image.jpg')
detected_objects = detector.detect(image)
for obj in detected_objects:
    print(f'Object: {obj.label}, Confidence: {obj.confidence}, Bounding Box: {obj.bbox}')

Image Segmentation

from blip.segmentation import BLIPSegmenter

# Load the pre-trained BLIP model
segmenter = BLIPSegmenter()

# Segment an image
image = Image.open('example_image.jpg')
segmentation_map = segmenter.segment(image)
segmented_image = segmenter.visualize(image, segmentation_map)
segmented_image.save('segmented_image.jpg')

Getting Started

To get started with BLIP, follow these steps:

  1. Clone the BLIP repository:

    git clone https://github.com/salesforce/BLIP.git
    
  2. Navigate to the project directory:

    cd BLIP
    
  3. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate
    
  4. Install the required dependencies:

    pip install -r requirements.txt
    
  5. Explore the available models and tasks:

    from blip.classification import BLIPClassifier
    from blip.detection import BLIPDetector
    from blip.segmentation import BLIPSegmenter
    
    # Load the pre-trained models
    classifier = BLIPClassifier()
    detector = BLIPDetector()
    segmenter = BLIPSegmenter()
    
  6. Fine-tune the models for your specific use case, if needed.

  7. Integrate the BLIP models into your application and start using them for image preprocessing tasks.

Competitor Comparisons

24,594

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

  • Broader applicability across various vision-language tasks
  • Larger pre-training dataset (400M image-text pairs)
  • Zero-shot capabilities for image classification

Cons of CLIP

  • Less specialized for image captioning and visual question answering
  • May require more computational resources for fine-tuning
  • Limited support for text generation tasks

Code Comparison

CLIP usage:

import torch
from PIL import Image
from clip import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

BLIP usage:

from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Pros of Vision Transformer

  • Focused specifically on vision tasks, potentially offering better performance for image-related applications
  • Implements the original Vision Transformer (ViT) architecture, providing a foundation for further research and development
  • Backed by Google Research, potentially benefiting from extensive resources and expertise

Cons of Vision Transformer

  • Limited to vision tasks, whereas BLIP offers multimodal capabilities (vision and language)
  • May require more computational resources for training and inference compared to BLIP's efficient design
  • Less flexibility in terms of downstream tasks and fine-tuning options

Code Comparison

BLIP example:

from models.blip import blip_decoder
model = blip_decoder(pretrained='model_base', image_size=384, vit='base')
model.eval()
image = load_image(image_path)
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)

Vision Transformer example:

import tensorflow as tf
from vit_keras import vit
model = vit.vit_b16(
    image_size=224,
    activation='softmax',
    pretrained=True,
    include_top=True,
    pretrained_top=True
)
prediction = model.predict(image)

ImageBind One Embedding Space to Bind Them All

Pros of ImageBind

  • Supports multimodal learning across six modalities (text, image, audio, video, depth, thermal)
  • Enables zero-shot transfer between modalities without fine-tuning
  • Offers a more versatile and flexible approach to multimodal tasks

Cons of ImageBind

  • More complex architecture and potentially higher computational requirements
  • May require more extensive training data to achieve optimal performance
  • Less focused on specific vision-language tasks compared to BLIP

Code Comparison

BLIP (image-text matching):

image = transform(image).unsqueeze(0).to(device)
text = ["a dog", "a cat", "a person"]
itm_output = model(image, text, match_head='itm')
itm_scores = itm_output.image_text_similarity

ImageBind (multimodal embedding):

image_data = load_and_transform_vision_data(image_paths, device)
text_data = load_and_transform_text(text_list, device)
embeddings = model({"vision": image_data, "text": text_data})

Both repositories focus on multimodal learning, but ImageBind offers a broader range of modalities and zero-shot transfer capabilities. BLIP is more specialized for vision-language tasks, while ImageBind provides a more flexible framework for various multimodal applications. The code examples illustrate the different approaches: BLIP focuses on image-text matching, while ImageBind generates embeddings for multiple modalities simultaneously.

An open source implementation of CLIP.

Pros of open_clip

  • Broader range of pre-trained models, including support for various architectures like ViT, ResNet, and EfficientNet
  • More flexible and customizable training options, allowing users to fine-tune models on custom datasets
  • Active community development with frequent updates and improvements

Cons of open_clip

  • Less focus on multimodal tasks compared to BLIP, which excels in image-text understanding
  • May require more expertise to use effectively due to its flexibility and extensive options
  • Potentially higher computational requirements for training and fine-tuning

Code Comparison

BLIP example:

from PIL import Image
import requests
from torchvision import transforms
from models.blip import blip_decoder

image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
model = blip_decoder(pretrained='model_base', image_size=384, vit='base')
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)

open_clip example:

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Announcement: BLIP is now officially integrated into LAVIS - a one-stop library for language-and-vision research and applications!

This is the PyTorch code of the BLIP paper [blog]. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

  • Inference demo
  • Pre-trained and finetuned checkpoints
  • Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
  • Pre-training code
  • Zero-shot video-text retrieval
  • Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for:

  1. Image captioning
  2. Open-ended visual question answering
  3. Multimodal / unimodal feature extraction
  4. Image-text matching

Try out the Web demo, integrated into Huggingface Spaces 🤗 using Gradio.

Replicate web demo and Docker image is also available at Replicate

Pre-trained checkpoints:

Num. pre-train imagesBLIP w/ ViT-BBLIP w/ ViT-B and CapFilt-LBLIP w/ ViT-L
14MDownload--
129MDownloadDownloadDownload

Finetuned checkpoints:

TaskBLIP w/ ViT-BBLIP w/ ViT-B and CapFilt-LBLIP w/ ViT-L
Image-Text Retrieval (COCO)Download-Download
Image-Text Retrieval (Flickr30k)Download-Download
Image Captioning (COCO)-DownloadDownload
VQADownloadDownload-
NLVR2Download--

Image-Text Retrieval:

  1. Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
  2. To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate
  1. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Image-Text Captioning:

  1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
  2. To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate
  1. To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py 
  1. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py 

VQA:

  1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
  2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate
  1. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_vqa.py 

NLVR2:

  1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
  2. To evaluate the finetuned BLIP model, run
python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate
  1. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py 

Finetune with ViT-L:

In order to finetune a model with ViT-L, simply change the config file to set 'vit' as large. Batch size and learning rate may also need to be adjusted accordingly (please see the paper's appendix for hyper-parameter details). Gradient checkpoint can also be activated in the config file to reduce GPU memory usage.

Pre-train:

  1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
  2. In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
  3. Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain 

Zero-shot video-text retrieval:

  1. Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml.
  2. Install decord with
    pip install decord
  3. To perform zero-shot evaluation, run
python -m torch.distributed.run --nproc_per_node=8 eval_retrieval_video.py

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image sourceFiltered web captionFiltered synthetic caption by ViT-BFiltered synthetic caption by ViT-L
CC3M+CC12M+SBUDownloadDownloadDownload
LAION115MDownloadDownloadDownload

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.