Oscar

Oscar and VinVL

1,047

253

1,047

142

View on GitHub

Top Related Projects

BLIP

5,208

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

LAVIS

10,382

LAVIS - A One-stop Library for Language-Vision Intelligence

transformers

141,749

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

CLIP

28,019

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Quick Overview

Oscar is an open-source framework for building and deploying large-scale vision-language pre-training and fine-tuning models. It is designed to handle various vision-language tasks, such as image captioning, visual question answering, and image-text retrieval. Oscar aims to bridge the gap between vision and language understanding in AI applications.

Pros

Supports a wide range of vision-language tasks
Provides pre-trained models for quick implementation
Offers flexibility in model architecture and training strategies
Integrates well with popular deep learning frameworks like PyTorch

Cons

Requires significant computational resources for training large models
May have a steep learning curve for beginners in vision-language tasks
Documentation could be more comprehensive for some advanced features
Limited support for real-time inference in production environments

Code Examples

Loading a pre-trained Oscar model:

from oscar.modeling.modeling_bert import BertForImageCaptioning

model = BertForImageCaptioning.from_pretrained('oscar-base-coco')

Preparing input for image captioning:

from oscar.utils.misc import load_from_yaml_file
from oscar.utils.task_utils import load_and_process_image

config = load_from_yaml_file('path/to/config.yaml')
image = load_and_process_image('path/to/image.jpg', config)
input_ids = tokenizer.encode('What is in this image?', add_special_tokens=True)

inputs = {
    'input_ids': torch.tensor([input_ids]),
    'attention_mask': torch.tensor([[1] * len(input_ids)]),
    'token_type_ids': torch.tensor([[0] * len(input_ids)]),
    'img_feats': image.unsqueeze(0)
}

Generating a caption:

outputs = model.generate(**inputs, max_length=20, num_beams=5)
caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")

Getting Started

To get started with Oscar:

Install the required dependencies:

pip install torch torchvision transformers
git clone https://github.com/microsoft/Oscar.git
cd Oscar
pip install -e .

Download pre-trained models and prepare your dataset:

from oscar.utils.misc import download_pretrained

download_pretrained('oscar-base-coco')

Fine-tune the model on your task:

from oscar.run_task import main

main(task_name='image_captioning', config_file='configs/captioning_coco.yaml')

Competitor Comparisons

BLIP

5,208

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Pros of BLIP

More versatile, supporting a wider range of vision-language tasks
Better performance on image captioning and visual question answering
More recent and actively maintained repository

Cons of BLIP

Requires more computational resources due to its larger model size
Less focused on object detection compared to Oscar
Potentially more complex to implement for specific use cases

Code Comparison

BLIP example:

from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Oscar example:

from oscar.modeling.modeling_bert import OscarForObjectDetection
from oscar.utils.task_utils import load_od_labels

model = OscarForObjectDetection.from_pretrained('oscar-base-finetuned-vqa')
labels = load_od_labels('vqa')

outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
                labels=labels, return_dict=True)
loss, logits = outputs.loss, outputs.logits

LAVIS

10,382

LAVIS - A One-stop Library for Language-Vision Intelligence

Pros of LAVIS

More comprehensive and versatile, supporting a wider range of vision-language tasks
Actively maintained with frequent updates and new model implementations
Extensive documentation and examples for easier integration

Cons of LAVIS

Higher computational requirements due to its broader scope
Steeper learning curve for beginners due to its extensive feature set

Code Comparison

LAVIS:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})

Oscar:

from oscar.modeling.modeling_oscar import OscarForImageCaptioning
from oscar.utils.misc import load_from_yaml_file

config = load_from_yaml_file("path/to/config.yaml")
model = OscarForImageCaptioning.from_pretrained("path/to/checkpoint", config=config)
outputs = model(input_ids, img_feats=img_feats)

transformers

141,749

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

Broader scope, supporting a wide range of NLP tasks and models
Larger community and more frequent updates
Extensive documentation and tutorials

Cons of Transformers

Can be overwhelming for beginners due to its extensive features
May have higher computational requirements for some models

Code Comparison

Oscar:

from oscar.modeling.modeling_bert import OscarForSequenceClassification
model = OscarForSequenceClassification.from_pretrained("oscar-base")

Transformers:

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

Key Differences

Oscar focuses on vision-language tasks, while Transformers covers a broader range of NLP tasks
Oscar provides specialized models for multimodal learning, whereas Transformers offers a more general-purpose toolkit
Transformers has a larger ecosystem of pre-trained models and tools

Use Cases

Oscar: Ideal for tasks involving both visual and textual data, such as image captioning or visual question answering
Transformers: Suitable for a wide range of NLP tasks, including text classification, named entity recognition, and machine translation

CLIP

28,019

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

More versatile for general image-text understanding tasks
Trained on a larger and more diverse dataset
Better zero-shot performance on various vision tasks

Cons of CLIP

Less specialized for object-centric tasks
May require more computational resources for inference
Limited support for object detection and localization

Code Comparison

CLIP usage example:

import torch
from PIL import Image
from clip import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

Oscar usage example:

from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.utils.caption_eval import evaluate_on_coco_caption

model = BertForImageCaptioning.from_pretrained("oscar-base-coco")
results = evaluate_on_coco_caption(model, args)

Note: The code examples are simplified and may require additional setup and imports to run properly.

vision_transformer

11,291

Pros of Vision Transformer

Focuses specifically on vision transformers, providing a more specialized implementation for image-related tasks
Includes pre-trained models and evaluation scripts, making it easier to get started with vision transformers
Actively maintained by Google Research, ensuring up-to-date implementations and best practices

Cons of Vision Transformer

Limited to vision-related tasks, whereas Oscar supports multi-modal learning (vision and language)
Less comprehensive documentation compared to Oscar, which may make it harder for newcomers to understand and use

Code Comparison

Oscar:

from oscar.modeling.modeling_bert import BertImgModel

model = BertImgModel.from_pretrained('bert-base-uncased')
outputs = model(input_ids, img_feats=img_feats)

Vision Transformer:

import vision_transformer as vit

model = vit.vit_b16(pretrained=True)
logits = model(images)

Summary

Vision Transformer is more specialized for vision tasks and offers pre-trained models, while Oscar provides a more versatile multi-modal approach. The choice between them depends on the specific requirements of your project and whether you need to integrate both vision and language processing.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

04/17/2023: Visual instruction tuning with GPT-4 is released! Please check out the multimodal model LLaVA: [Project Page] [Paper] [Demo] [Data] [Model]

05/28/2020: Released finetuned models on downstream tasks, please check MODEL_ZOO.md.
05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
01/13/2021: our new work VinVL proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
03/08/2021: Oscar+ pretraining code released, please check the last section in VinVL_MODEL_ZOO.md. All image features and model checkpoints in VinVL are also released. Please check VinVL for details.
04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models.

Introduction

This repository contains source code necessary to reproduce the results presented in the paper Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. We propose a new cross-modal pre-training method Oscar (Object-Semantics Aligned Pre-training). It leverages object tags detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the Microsoft Research Blog post.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	R@1	R@5	R@1	R@5	B@4	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	-
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
Oscar_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	78.8	11.7	73.44	78.36	61.62
Oscar_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	80.9	11.3	73.82	80.05	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Download

We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. Please check VinVL_DOWNLOAD.md for details.

To download checkpoints for the Vanilla OSCAR, please check DOWNLOAD.md for details.

Installation

Check INSTALL.md for installation instructions.

Model Zoo

Check MODEL_ZOO.md for scripts to run oscar downstream finetuning.

Check VinVL_MODEL_ZOO.md for scripts to run oscar+ pretraining and downstream finetuning.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

License

Oscar is released under the MIT license. See LICENSE for details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot