Convert Figma logo to code with AI

microsoft logoOscar

Oscar and VinVL

1,039
251
1,039
142

Top Related Projects

4,727

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

10,062

LAVIS - A One-stop Library for Language-Vision Intelligence

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Quick Overview

Oscar is an open-source framework for building and deploying large-scale vision-language pre-training and fine-tuning models. It is designed to handle various vision-language tasks, such as image captioning, visual question answering, and image-text retrieval. Oscar aims to bridge the gap between vision and language understanding in AI applications.

Pros

  • Supports a wide range of vision-language tasks
  • Provides pre-trained models for quick implementation
  • Offers flexibility in model architecture and training strategies
  • Integrates well with popular deep learning frameworks like PyTorch

Cons

  • Requires significant computational resources for training large models
  • May have a steep learning curve for beginners in vision-language tasks
  • Documentation could be more comprehensive for some advanced features
  • Limited support for real-time inference in production environments

Code Examples

  1. Loading a pre-trained Oscar model:
from oscar.modeling.modeling_bert import BertForImageCaptioning

model = BertForImageCaptioning.from_pretrained('oscar-base-coco')
  1. Preparing input for image captioning:
from oscar.utils.misc import load_from_yaml_file
from oscar.utils.task_utils import load_and_process_image

config = load_from_yaml_file('path/to/config.yaml')
image = load_and_process_image('path/to/image.jpg', config)
input_ids = tokenizer.encode('What is in this image?', add_special_tokens=True)

inputs = {
    'input_ids': torch.tensor([input_ids]),
    'attention_mask': torch.tensor([[1] * len(input_ids)]),
    'token_type_ids': torch.tensor([[0] * len(input_ids)]),
    'img_feats': image.unsqueeze(0)
}
  1. Generating a caption:
outputs = model.generate(**inputs, max_length=20, num_beams=5)
caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")

Getting Started

To get started with Oscar:

  1. Install the required dependencies:
pip install torch torchvision transformers
git clone https://github.com/microsoft/Oscar.git
cd Oscar
pip install -e .
  1. Download pre-trained models and prepare your dataset:
from oscar.utils.misc import download_pretrained

download_pretrained('oscar-base-coco')
  1. Fine-tune the model on your task:
from oscar.run_task import main

main(task_name='image_captioning', config_file='configs/captioning_coco.yaml')

Competitor Comparisons

4,727

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Pros of BLIP

  • More versatile, supporting a wider range of vision-language tasks
  • Better performance on image captioning and visual question answering
  • More recent and actively maintained repository

Cons of BLIP

  • Requires more computational resources due to its larger model size
  • Less focused on object detection compared to Oscar
  • Potentially more complex to implement for specific use cases

Code Comparison

BLIP example:

from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Oscar example:

from oscar.modeling.modeling_bert import OscarForObjectDetection
from oscar.utils.task_utils import load_od_labels

model = OscarForObjectDetection.from_pretrained('oscar-base-finetuned-vqa')
labels = load_od_labels('vqa')

outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
                labels=labels, return_dict=True)
loss, logits = outputs.loss, outputs.logits
10,062

LAVIS - A One-stop Library for Language-Vision Intelligence

Pros of LAVIS

  • More comprehensive and versatile, supporting a wider range of vision-language tasks
  • Actively maintained with frequent updates and new model implementations
  • Extensive documentation and examples for easier integration

Cons of LAVIS

  • Higher computational requirements due to its broader scope
  • Steeper learning curve for beginners due to its extensive feature set

Code Comparison

LAVIS:

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess("blip_caption", "large")
raw_image = Image.open("path/to/image.jpg").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0)
caption = model.generate({"image": image})

Oscar:

from oscar.modeling.modeling_oscar import OscarForImageCaptioning
from oscar.utils.misc import load_from_yaml_file

config = load_from_yaml_file("path/to/config.yaml")
model = OscarForImageCaptioning.from_pretrained("path/to/checkpoint", config=config)
outputs = model(input_ids, img_feats=img_feats)

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

  • Broader scope, supporting a wide range of NLP tasks and models
  • Larger community and more frequent updates
  • Extensive documentation and tutorials

Cons of Transformers

  • Can be overwhelming for beginners due to its extensive features
  • May have higher computational requirements for some models

Code Comparison

Oscar:

from oscar.modeling.modeling_bert import OscarForSequenceClassification
model = OscarForSequenceClassification.from_pretrained("oscar-base")

Transformers:

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

Key Differences

  • Oscar focuses on vision-language tasks, while Transformers covers a broader range of NLP tasks
  • Oscar provides specialized models for multimodal learning, whereas Transformers offers a more general-purpose toolkit
  • Transformers has a larger ecosystem of pre-trained models and tools

Use Cases

  • Oscar: Ideal for tasks involving both visual and textual data, such as image captioning or visual question answering
  • Transformers: Suitable for a wide range of NLP tasks, including text classification, named entity recognition, and machine translation
26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

  • More versatile for general image-text understanding tasks
  • Trained on a larger and more diverse dataset
  • Better zero-shot performance on various vision tasks

Cons of CLIP

  • Less specialized for object-centric tasks
  • May require more computational resources for inference
  • Limited support for object detection and localization

Code Comparison

CLIP usage example:

import torch
from PIL import Image
from clip import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

Oscar usage example:

from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.utils.caption_eval import evaluate_on_coco_caption

model = BertForImageCaptioning.from_pretrained("oscar-base-coco")
results = evaluate_on_coco_caption(model, args)

Note: The code examples are simplified and may require additional setup and imports to run properly.

Pros of Vision Transformer

  • Focuses specifically on vision transformers, providing a more specialized implementation for image-related tasks
  • Includes pre-trained models and evaluation scripts, making it easier to get started with vision transformers
  • Actively maintained by Google Research, ensuring up-to-date implementations and best practices

Cons of Vision Transformer

  • Limited to vision-related tasks, whereas Oscar supports multi-modal learning (vision and language)
  • Less comprehensive documentation compared to Oscar, which may make it harder for newcomers to understand and use

Code Comparison

Oscar:

from oscar.modeling.modeling_bert import BertImgModel

model = BertImgModel.from_pretrained('bert-base-uncased')
outputs = model(input_ids, img_feats=img_feats)

Vision Transformer:

import vision_transformer as vit

model = vit.vit_b16(pretrained=True)
logits = model(images)

Summary

Vision Transformer is more specialized for vision tasks and offers pre-trained models, while Oscar provides a more versatile multi-modal approach. The choice between them depends on the specific requirements of your project and whether you need to integrate both vision and language processing.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

04/17/2023: Visual instruction tuning with GPT-4 is released! Please check out the multimodal model LLaVA: [Project Page] [Paper] [Demo] [Data] [Model]

05/28/2020: Released finetuned models on downstream tasks, please check MODEL_ZOO.md.
05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
01/13/2021: our new work VinVL proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
03/08/2021: Oscar+ pretraining code released, please check the last section in VinVL_MODEL_ZOO.md. All image features and model checkpoints in VinVL are also released. Please check VinVL for details.
04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models.

Introduction

This repository contains source code necessary to reproduce the results presented in the paper Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. We propose a new cross-modal pre-training method Oscar (Object-Semantics Aligned Pre-training). It leverages object tags detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the Microsoft Research Blog post.

Performance

Taskt2it2ii2ti2tICICICICNoCapsNoCapsVQANLVR2GQA
MetricR@1R@5R@1R@5B@4MCSCStest-stdtest-Ptest-std
SoTA_S39.268.056.684.538.929.2129.822.461.59.270.9258.8063.17
SoTA_B54.080.870.091.140.529.7137.622.886.5812.3873.6779.30-
SoTA_L57.582.873.592.241.730.6140.024.5--74.9381.47-
--------------------------------------------
Oscar_B54.080.870.091.140.529.7137.622.878.811.773.4478.3661.62
Oscar_L57.582.873.592.241.730.6140.024.580.911.373.8280.05-
--------------------------------------------
VinVL_B58.183.274.692.640.930.9140.625.192.4613.0776.1283.0864.65
VinVL_L58.883.575.492.941.031.1140.925.2--76.6283.98-
gain1.30.71.90.6-0.70.50.90.75.90.71.692.511.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Download

We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. Please check VinVL_DOWNLOAD.md for details.

To download checkpoints for the Vanilla OSCAR, please check DOWNLOAD.md for details.

Installation

Check INSTALL.md for installation instructions.

Model Zoo

Check MODEL_ZOO.md for scripts to run oscar downstream finetuning.

Check VinVL_MODEL_ZOO.md for scripts to run oscar+ pretraining and downstream finetuning.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

License

Oscar is released under the MIT license. See LICENSE for details.