Convert Figma logo to code with AI

facebookresearch logoImageBind

ImageBind One Embedding Space to Bind Them All

8,203
751
8,203
81

Top Related Projects

24,594

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

4,279

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

4,616

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

37,810

TensorFlow code and pre-trained models for BERT

Quick Overview

ImageBind is a project by Facebook Research that introduces a novel approach to joint embedding across six modalities: images, text, audio, depth, thermal, and IMU data. It enables zero-shot classification and retrieval across these modalities, allowing for tasks like audio-to-image retrieval or text-to-thermal generation without explicit training on these tasks.

Pros

  • Enables cross-modal understanding and generation without task-specific training
  • Supports a wide range of modalities (six in total)
  • Demonstrates strong performance in zero-shot classification and retrieval tasks
  • Open-source implementation available for research and development

Cons

  • Requires significant computational resources for training and inference
  • May have limitations in handling complex or highly specialized domain-specific data
  • Potential privacy concerns when dealing with multi-modal personal data
  • Limited documentation and examples for some advanced use cases

Code Examples

  1. Loading pre-trained ImageBind models:
from imagebind import data
from imagebind.models import imagebind_model

# Load pre-trained model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
  1. Preparing inputs for different modalities:
# Prepare inputs for image, text, and audio
inputs = {
    "image": data.load_and_transform_vision_data(["path/to/image.jpg"]),
    "text": data.load_and_transform_text(["A description of the image"]),
    "audio": data.load_and_transform_audio_data(["path/to/audio.wav"]),
}
  1. Generating embeddings for multiple modalities:
# Generate embeddings
with torch.no_grad():
    embeddings = model(inputs)

# Access embeddings for each modality
image_embeddings = embeddings["image"]
text_embeddings = embeddings["text"]
audio_embeddings = embeddings["audio"]

Getting Started

To get started with ImageBind:

  1. Install the required dependencies:
pip install torch torchvision torchaudio
pip install git+https://github.com/facebookresearch/ImageBind.git
  1. Import the necessary modules and load the pre-trained model:
from imagebind import data
from imagebind.models import imagebind_model

model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
  1. Prepare your inputs and generate embeddings:
inputs = {
    "image": data.load_and_transform_vision_data(["path/to/image.jpg"]),
    "text": data.load_and_transform_text(["A description of the image"]),
}

with torch.no_grad():
    embeddings = model(inputs)

Competitor Comparisons

24,594

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

  • More established and widely adopted in the research community
  • Extensive documentation and examples available
  • Supports a broader range of pre-trained models

Cons of CLIP

  • Limited to image-text pairs, while ImageBind supports multiple modalities
  • May require more computational resources for training and inference
  • Less flexible for custom modality combinations

Code Comparison

CLIP:

import torch
from PIL import Image
import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a photo of a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

ImageBind:

import torch
from imagebind import data
from imagebind.models import imagebind_model

inputs = {
    ModalityType.VISION: data.load_and_transform_vision_data(["image.jpg"], device),
    ModalityType.TEXT: data.load_and_transform_text(["a photo of a cat"], device),
}

model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

with torch.no_grad():
    embeddings = model(inputs)

Pros of Vision Transformer

  • Focused specifically on image classification tasks
  • Implements the original Vision Transformer (ViT) architecture
  • Provides pre-trained models for various ViT configurations

Cons of Vision Transformer

  • Limited to visual data processing
  • Less versatile for multi-modal applications
  • Requires more computational resources for training large models

Code Comparison

Vision Transformer:

import tensorflow as tf
from vit_jax import models

model = models.vit_b16(
    num_classes=1000,
    representation_size=None,
)

ImageBind:

import torch
from imagebind import data
from imagebind.models import imagebind_model

model = imagebind_model.imagebind_huge(pretrained=True)
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
}

The Vision Transformer code focuses on initializing a ViT model for image classification, while ImageBind's code demonstrates its multi-modal capabilities by loading both text and vision data for processing.

4,279

Pros of BioGPT

  • Specialized for biomedical text processing and generation
  • Trained on large-scale biomedical literature datasets
  • Supports domain-specific tasks like named entity recognition and relation extraction

Cons of BioGPT

  • Limited to text-based inputs and outputs
  • Focused on a specific domain, potentially less versatile for general applications
  • May require more domain expertise to utilize effectively

Code Comparison

BioGPT:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt")

input_text = "The BRCA1 gene is associated with"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50)

ImageBind:

import torch
from imagebind import data
from imagebind.models import imagebind_model

inputs = {
    ModalityType.TEXT: data.load_and_transform_text(["A dog", "A cat"], device),
    ModalityType.VISION: data.load_and_transform_vision_data(["dog.jpg", "cat.jpg"], device),
}
model = imagebind_model.imagebind_huge(pretrained=True)
embeddings = model(inputs)

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope: Supports a wide range of NLP tasks and models
  • Extensive documentation and community support
  • Regular updates and new model implementations

Cons of transformers

  • Steeper learning curve due to its extensive features
  • Potentially higher computational requirements for some models
  • Less focused on multimodal learning compared to ImageBind

Code Comparison

transformers:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)

ImageBind:

from imagebind import data
from imagebind.models import imagebind_model

inputs = {
    ModalityType.TEXT: data.load_and_transform_text(["A dog", "A cat"], device),
    ModalityType.VISION: data.load_and_transform_vision_data(["dog.jpg", "cat.jpg"], device),
}
model = imagebind_model.imagebind_huge(pretrained=True)
embeddings = model(inputs)

Both repositories offer powerful tools for working with machine learning models, but they focus on different aspects. transformers provides a comprehensive suite for various NLP tasks, while ImageBind specializes in multimodal learning across different types of data.

4,616

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Pros of BLIP

  • Focuses specifically on vision-language tasks like image captioning and visual question answering
  • Provides pre-trained models for immediate use in various vision-language applications
  • Offers a more streamlined approach for developers working primarily with image-text interactions

Cons of BLIP

  • Limited to image and text modalities, lacking support for audio or other sensory inputs
  • May require more domain-specific knowledge for optimal use in vision-language tasks
  • Less versatile for projects requiring multi-modal integration beyond image and text

Code Comparison

BLIP example:

from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

ImageBind example:

import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text = data.load_and_transform_text(["A dog", "A cat"], device)
image = data.load_and_transform_vision_data(["dog.jpg", "cat.jpg"], device)
audio = data.load_and_transform_audio_data(["dog.wav", "cat.wav"], device)

model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

with torch.no_grad():
    embeddings = model({
        ModalityType.TEXT: text,
        ModalityType.VISION: image,
        ModalityType.AUDIO: audio,
    })
37,810

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Widely adopted and well-established in NLP tasks
  • Extensive documentation and community support
  • Pre-trained models available for various languages

Cons of BERT

  • Limited to text modalities only
  • Requires fine-tuning for specific tasks
  • Computationally intensive for large-scale applications

Code Comparison

BERT example:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

ImageBind example:

from imagebind import data
from imagebind.models import imagebind_model
model = imagebind_model.imagebind_huge(pretrained=True)

Key Differences

  • ImageBind focuses on multi-modal learning, while BERT specializes in text processing
  • BERT has a larger ecosystem of pre-trained models and fine-tuning scripts
  • ImageBind offers a more unified approach to handling different data types (text, image, audio)
  • BERT is more suitable for pure NLP tasks, while ImageBind excels in cross-modal applications

Use Cases

  • BERT: Text classification, named entity recognition, question answering
  • ImageBind: Multi-modal retrieval, cross-modal generation, audio-visual understanding

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ImageBind: One Embedding Space To Bind Them All

FAIR, Meta AI

Rohit Girdhar*, Alaaeldin El-Nouby*, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra*

To appear at CVPR 2023 (Highlighted paper)

[Paper] [Blog] [Demo] [Supplementary Video] [BibTex]

PyTorch implementation and pretrained models for ImageBind. For details, see the paper: ImageBind: One Embedding Space To Bind Them All.

ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.

ImageBind

ImageBind model

Emergent zero-shot classification performance.

Model IN1k K400 NYU-D ESC LLVIP Ego4D download
imagebind_huge 77.7 50.0 54.0 66.9 63.4 25.0 checkpoint

Usage

Install pytorch 1.13+ and other 3rd party dependencies.

conda create --name imagebind python=3.10 -y
conda activate imagebind

pip install .

For windows users, you might need to install soundfile for reading/writing audio files. (Thanks @congyue1977)

pip install soundfile

Extract and compare features across modalities (e.g. Image, Text and Audio).

from imagebind import data
import torch
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Text: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "Audio x Text: ",
    torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "Vision x Audio: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

# Expected output:
#
# Vision x Text:
# tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],
#         [3.3836e-05, 9.9994e-01, 2.4118e-05],
#         [4.7997e-05, 1.3496e-02, 9.8646e-01]])
#
# Audio x Text:
# tensor([[1., 0., 0.],
#         [0., 1., 0.],
#         [0., 0., 1.]])
#
# Vision x Audio:
# tensor([[0.8070, 0.1088, 0.0842],
#         [0.1036, 0.7884, 0.1079],
#         [0.0018, 0.0022, 0.9960]])

Model card

Please see the model card for details.

License

ImageBind code and model weights are released under the CC-BY-NC 4.0 license. See LICENSE for additional details.

Contributing

See contributing and the code of conduct.

Citing ImageBind

If you find this repository useful, please consider giving a star :star: and citation

@inproceedings{girdhar2023imagebind,
  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={CVPR},
  year={2023}
}