Top Related Projects
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
LAVIS - A One-stop Library for Language-Vision Intelligence
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Quick Overview
Oscar is an open-source framework for building and deploying large-scale vision-language pre-training and fine-tuning models. It is designed to handle various vision-language tasks, such as image captioning, visual question answering, and image-text retrieval. Oscar aims to bridge the gap between vision and language understanding in AI applications.
Pros
- Supports a wide range of vision-language tasks
- Provides pre-trained models for quick implementation
- Offers flexibility in model architecture and training strategies
- Integrates well with popular deep learning frameworks like PyTorch
Cons
- Requires significant computational resources for training large models
- May have a steep learning curve for beginners in vision-language tasks
- Documentation could be more comprehensive for some advanced features
- Limited support for real-time inference in production environments
Code Examples
- Loading a pre-trained Oscar model:
from oscar.modeling.modeling_bert import BertForImageCaptioning
model = BertForImageCaptioning.from_pretrained('oscar-base-coco')
- Preparing input for image captioning:
from oscar.utils.misc import load_from_yaml_file
from oscar.utils.task_utils import load_and_process_image
config = load_from_yaml_file('path/to/config.yaml')
image = load_and_process_image('path/to/image.jpg', config)
input_ids = tokenizer.encode('What is in this image?', add_special_tokens=True)
inputs = {
'input_ids': torch.tensor([input_ids]),
'attention_mask': torch.tensor([[1] * len(input_ids)]),
'token_type_ids': torch.tensor([[0] * len(input_ids)]),
'img_feats': image.unsqueeze(0)
}
- Generating a caption:
outputs = model.generate(**inputs, max_length=20, num_beams=5)
caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")
Getting Started
To get started with Oscar:
- Install the required dependencies:
pip install torch torchvision transformers
git clone https://github.com/microsoft/Oscar.git
cd Oscar
pip install -e .
- Download pre-trained models and prepare your dataset:
from oscar.utils.misc import download_pretrained
download_pretrained('oscar-base-coco')
- Fine-tune the model on your task:
from oscar.run_task import main
main(task_name='image_captioning', config_file='configs/captioning_coco.yaml')
Competitor Comparisons
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Pros of BLIP
- More versatile, supporting a wider range of vision-language tasks
- Better performance on image captioning and visual question answering
- More recent and actively maintained repository
Cons of BLIP
- Requires more computational resources due to its larger model size
- Less focused on object detection compared to Oscar
- Potentially more complex to implement for specific use cases
Code Comparison
BLIP example:
from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Oscar example:
from oscar.modeling.modeling_bert import OscarForObjectDetection
from oscar.utils.task_utils import load_od_labels
model = OscarForObjectDetection.from_pretrained('oscar-base-finetuned-vqa')
labels = load_od_labels('vqa')
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
labels=labels, return_dict=True)
loss, logits = outputs.loss, outputs.logits
LAVIS - A One-stop Library for Language-Vision Intelligence
Pros of LAVIS
- More comprehensive and versatile, supporting a wider range of vision-language tasks
- Actively maintained with frequent updates and new model implementations
- Extensive documentation and examples for easier integration
Cons of LAVIS
- Higher computational requirements due to its broader scope
- Steeper learning curve for beginners due to its extensive feature set
Code Comparison
LAVIS:
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})
Oscar:
from oscar.modeling.modeling_oscar import OscarForImageCaptioning
from oscar.utils.misc import load_from_yaml_file
config = load_from_yaml_file("path/to/config.yaml")
model = OscarForImageCaptioning.from_pretrained("path/to/checkpoint", config=config)
outputs = model(input_ids, img_feats=img_feats)
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Broader scope, supporting a wide range of NLP tasks and models
- Larger community and more frequent updates
- Extensive documentation and tutorials
Cons of Transformers
- Can be overwhelming for beginners due to its extensive features
- May have higher computational requirements for some models
Code Comparison
Oscar:
from oscar.modeling.modeling_bert import OscarForSequenceClassification
model = OscarForSequenceClassification.from_pretrained("oscar-base")
Transformers:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
Key Differences
- Oscar focuses on vision-language tasks, while Transformers covers a broader range of NLP tasks
- Oscar provides specialized models for multimodal learning, whereas Transformers offers a more general-purpose toolkit
- Transformers has a larger ecosystem of pre-trained models and tools
Use Cases
- Oscar: Ideal for tasks involving both visual and textual data, such as image captioning or visual question answering
- Transformers: Suitable for a wide range of NLP tasks, including text classification, named entity recognition, and machine translation
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Pros of CLIP
- More versatile for general image-text understanding tasks
- Trained on a larger and more diverse dataset
- Better zero-shot performance on various vision tasks
Cons of CLIP
- Less specialized for object-centric tasks
- May require more computational resources for inference
- Limited support for object detection and localization
Code Comparison
CLIP usage example:
import torch
from PIL import Image
from clip import clip
model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
Oscar usage example:
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.utils.caption_eval import evaluate_on_coco_caption
model = BertForImageCaptioning.from_pretrained("oscar-base-coco")
results = evaluate_on_coco_caption(model, args)
Note: The code examples are simplified and may require additional setup and imports to run properly.
Pros of Vision Transformer
- Focuses specifically on vision transformers, providing a more specialized implementation for image-related tasks
- Includes pre-trained models and evaluation scripts, making it easier to get started with vision transformers
- Actively maintained by Google Research, ensuring up-to-date implementations and best practices
Cons of Vision Transformer
- Limited to vision-related tasks, whereas Oscar supports multi-modal learning (vision and language)
- Less comprehensive documentation compared to Oscar, which may make it harder for newcomers to understand and use
Code Comparison
Oscar:
from oscar.modeling.modeling_bert import BertImgModel
model = BertImgModel.from_pretrained('bert-base-uncased')
outputs = model(input_ids, img_feats=img_feats)
Vision Transformer:
import vision_transformer as vit
model = vit.vit_b16(pretrained=True)
logits = model(images)
Summary
Vision Transformer is more specialized for vision tasks and offers pre-trained models, while Oscar provides a more versatile multi-modal approach. The choice between them depends on the specific requirements of your project and whether you need to integrate both vision and language processing.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks
VinVL: Revisiting Visual Representations in Vision-Language Models
Updates
04/17/2023: Visual instruction tuning with GPT-4 is released! Please check out the multimodal model LLaVA: [Project Page] [Paper] [Demo] [Data] [Model]
05/28/2020: Released finetuned models on downstream tasks, please check MODEL_ZOO.md.
05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
01/13/2021: our new work VinVL proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
03/08/2021: Oscar+ pretraining code released, please check the last section in VinVL_MODEL_ZOO.md. All image features and model checkpoints in VinVL are also released. Please check VinVL for details.
04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models.
Introduction
This repository contains source code necessary to reproduce the results presented in the paper Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. We propose a new cross-modal pre-training method Oscar (Object-Semantics Aligned Pre-training). It leverages object tags detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the Microsoft Research Blog post.
Performance
Task | t2i | t2i | i2t | i2t | IC | IC | IC | IC | NoCaps | NoCaps | VQA | NLVR2 | GQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric | R@1 | R@5 | R@1 | R@5 | B@4 | M | C | S | C | S | test-std | test-P | test-std |
SoTA_S | 39.2 | 68.0 | 56.6 | 84.5 | 38.9 | 29.2 | 129.8 | 22.4 | 61.5 | 9.2 | 70.92 | 58.80 | 63.17 |
SoTA_B | 54.0 | 80.8 | 70.0 | 91.1 | 40.5 | 29.7 | 137.6 | 22.8 | 86.58 | 12.38 | 73.67 | 79.30 | - |
SoTA_L | 57.5 | 82.8 | 73.5 | 92.2 | 41.7 | 30.6 | 140.0 | 24.5 | - | - | 74.93 | 81.47 | - |
----- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Oscar_B | 54.0 | 80.8 | 70.0 | 91.1 | 40.5 | 29.7 | 137.6 | 22.8 | 78.8 | 11.7 | 73.44 | 78.36 | 61.62 |
Oscar_L | 57.5 | 82.8 | 73.5 | 92.2 | 41.7 | 30.6 | 140.0 | 24.5 | 80.9 | 11.3 | 73.82 | 80.05 | - |
----- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
VinVL_B | 58.1 | 83.2 | 74.6 | 92.6 | 40.9 | 30.9 | 140.6 | 25.1 | 92.46 | 13.07 | 76.12 | 83.08 | 64.65 |
VinVL_L | 58.8 | 83.5 | 75.4 | 92.9 | 41.0 | 31.1 | 140.9 | 25.2 | - | - | 76.62 | 83.98 | - |
gain | 1.3 | 0.7 | 1.9 | 0.6 | -0.7 | 0.5 | 0.9 | 0.7 | 5.9 | 0.7 | 1.69 | 2.51 | 1.48 |
t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.
Download
We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. Please check VinVL_DOWNLOAD.md for details.
To download checkpoints for the Vanilla OSCAR, please check DOWNLOAD.md for details.
Installation
Check INSTALL.md for installation instructions.
Model Zoo
Check MODEL_ZOO.md for scripts to run oscar downstream finetuning.
Check VinVL_MODEL_ZOO.md for scripts to run oscar+ pretraining and downstream finetuning.
Citations
Please consider citing this paper if you use the code:
@article{li2020oscar,
title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
journal={ECCV 2020},
year={2020}
}
@article{zhang2021vinvl,
title={VinVL: Making Visual Representations Matter in Vision-Language Models},
author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
journal={CVPR 2021},
year={2021}
}
License
Oscar is released under the MIT license. See LICENSE for details.
Top Related Projects
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
LAVIS - A One-stop Library for Language-Vision Intelligence
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot