mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
An open-source NLP research library, built on PyTorch.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Quick Overview
MMF (Multimodal Framework) is an open-source modular framework for vision and language multimodal research. Developed by Facebook AI Research, it provides a platform for training and evaluating AI models on various multimodal tasks, including visual question answering, image captioning, and visual reasoning.
Pros
- Highly modular and extensible architecture
- Supports a wide range of multimodal tasks and datasets
- Includes pre-trained models and easy-to-use evaluation tools
- Active development and community support
Cons
- Steep learning curve for beginners
- Documentation can be overwhelming due to the framework's complexity
- Requires significant computational resources for training large models
- Some features may be experimental or not fully stable
Code Examples
- Loading a pre-trained model and making predictions:
from mmf.models.mmbt import MMBT
from mmf.common.sample import Sample
from mmf.common.registry import registry
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
sample = Sample()
sample.text = "This is a test image"
sample.image = "path/to/image.jpg"
output = model(sample)
print(output)
- Creating a custom dataset:
from mmf.datasets.builders.vqa2 import VQA2Builder
@registry.register_builder("my_custom_dataset")
class MyCustomDatasetBuilder(VQA2Builder):
def __init__(self):
super().__init__()
self.dataset_name = "my_custom_dataset"
self.set_dataset_class(MyCustomDataset)
def build(self, config, dataset_type):
# Custom build logic here
pass
- Training a model:
from mmf.trainers.core.training_loop import TrainingLoop
from mmf.utils.build import build_config, build_trainer
config = build_config()
trainer = build_trainer(config)
training_loop = TrainingLoop(config, trainer)
training_loop.run()
Getting Started
To get started with MMF:
- Install MMF:
pip install --upgrade pip
pip install mmf
- Download the demo data:
mmf_cli download_data --datasets visual_genome
- Train a model:
mmf_run config=projects/visual_bert/configs/vqa2/defaults.yaml \
model=visual_bert dataset=vqa2 run_type=train_val
- Evaluate a pre-trained model:
mmf_run config=projects/visual_bert/configs/vqa2/defaults.yaml \
model=visual_bert dataset=vqa2 run_type=val \
checkpoint.resume_file=<path_to_pretrained_model>
Competitor Comparisons
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Broader scope, covering a wide range of NLP tasks and models
- Larger community and more frequent updates
- Extensive documentation and tutorials
Cons of Transformers
- Steeper learning curve for beginners
- Less focus on multimodal tasks compared to MMF
Code Comparison
MMF:
from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)
Transformers:
from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
Summary
Transformers offers a more comprehensive NLP toolkit with broader community support, while MMF specializes in multimodal tasks. Transformers may be more challenging for beginners but provides extensive resources. MMF offers a more streamlined approach for specific multimodal applications. The code examples demonstrate the different focus areas, with MMF handling image-text inputs and Transformers processing text-only data.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Pros of UniLM
- Broader scope, covering multiple NLP tasks beyond multimodal
- More active development with frequent updates
- Larger community and more extensive documentation
Cons of UniLM
- Steeper learning curve due to its broader scope
- Less focused on multimodal tasks compared to MMF
- May require more computational resources for some tasks
Code Comparison
MMF example:
from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)
UniLM example:
from unilm import UniLMForConditionalGeneration
model = UniLMForConditionalGeneration.from_pretrained("unilm-base-cased")
output = model(input_ids, attention_mask=attention_mask)
Key Differences
- MMF is specifically designed for multimodal tasks, while UniLM is a more general-purpose NLP toolkit
- UniLM offers pre-training and fine-tuning for various NLP tasks, whereas MMF focuses on multimodal fusion and reasoning
- MMF provides more out-of-the-box solutions for multimodal problems, while UniLM requires more customization for such tasks
Use Cases
- Choose MMF for dedicated multimodal projects, especially those involving vision and language
- Opt for UniLM when working on a wider range of NLP tasks or when flexibility across different language models is required
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- More extensive documentation and tutorials
- Broader focus on general NLP tasks
- Larger community and more frequent updates
Cons of AllenNLP
- Steeper learning curve for beginners
- Less focus on multimodal tasks
Code Comparison
MMF example:
from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)
AllenNLP example:
from allennlp.predictors import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/snli-roberta-large-2020.06.09.tar.gz")
result = predictor.predict(premise="A person on a horse jumps over a broken down airplane.", hypothesis="A person is outdoors, on a horse.")
MMF is more focused on multimodal tasks, providing a simpler API for combining image and text inputs. AllenNLP offers a more general-purpose NLP toolkit with a wider range of pre-trained models and tasks.
Both libraries are built on PyTorch and provide high-level APIs for working with complex NLP models. AllenNLP's architecture is more modular, allowing for greater customization, while MMF offers more out-of-the-box solutions for multimodal problems.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More extensive support for sequence-to-sequence tasks like machine translation
- Larger community and more frequent updates
- Better documentation and examples for various NLP tasks
Cons of fairseq
- Steeper learning curve for beginners
- Less focus on multimodal tasks compared to MMF
- May require more computational resources for some models
Code Comparison
MMF example:
from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)
fairseq example:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('transformer.wmt19.en-de')
translated = model.translate('Hello world!')
Both repositories offer powerful tools for natural language processing and machine learning tasks. MMF specializes in multimodal learning and provides a more accessible entry point for beginners working with vision and language tasks. fairseq, on the other hand, excels in sequence-to-sequence tasks and offers a wider range of models and techniques for advanced NLP research and applications.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Pros of CLIP
- More versatile for zero-shot image classification and retrieval tasks
- Simpler architecture, making it easier to understand and implement
- Better performance on a wide range of visual tasks without fine-tuning
Cons of CLIP
- Limited to image-text tasks, while MMF supports multiple modalities
- Less flexibility for customizing model architecture and training pipeline
- Fewer pre-trained models and datasets available out-of-the-box
Code Comparison
MMF:
from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)
CLIP:
import clip
model, preprocess = clip.load("ViT-B/32")
image = preprocess(image).unsqueeze(0)
text = clip.tokenize(["a photo of a cat", "a photo of a dog"])
logits_per_image, logits_per_text = model(image, text)
MMF offers a more comprehensive framework for multimodal tasks, while CLIP provides a simpler and more efficient approach for image-text tasks. MMF's code is more modular and customizable, whereas CLIP's implementation is more straightforward and focused on specific use cases.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-the-art vision and language models and has powered multiple research projects at Facebook AI Research. See full list of project inside or built on MMF here.
MMF is powered by PyTorch, allows distributed training and is un-opinionated, scalable and fast. Use MMF to bootstrap for your next vision and language multimodal research project by following the installation instructions. Take a look at list of MMF features here.
MMF also acts as starter codebase for challenges around vision and language datasets (The Hateful Memes, TextVQA, TextCaps and VQA challenges). MMF was formerly known as Pythia. The next video shows an overview of how datasets and models work inside MMF. Checkout MMF's video overview.
Installation
Follow installation instructions in the documentation.
Documentation
Learn more about MMF here.
Citation
If you use MMF in your work or use any models published in MMF, please cite:
@misc{singh2020mmf,
author = {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and
Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi},
title = {MMF: A multimodal framework for vision and language research},
howpublished = {\url{https://github.com/facebookresearch/mmf}},
year = {2020}
}
License
MMF is licensed under BSD license available in LICENSE file
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
An open-source NLP research library, built on PyTorch.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot