mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

5,560

939

5,560

149

View on GitHub

Top Related Projects

transformers

141,749

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

unilm

21,155

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

allennlp

11,843

An open-source NLP research library, built on PyTorch.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

CLIP

28,019

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Quick Overview

MMF (Multimodal Framework) is an open-source modular framework for vision and language multimodal research. Developed by Facebook AI Research, it provides a platform for training and evaluating AI models on various multimodal tasks, including visual question answering, image captioning, and visual reasoning.

Pros

Highly modular and extensible architecture
Supports a wide range of multimodal tasks and datasets
Includes pre-trained models and easy-to-use evaluation tools
Active development and community support

Cons

Steep learning curve for beginners
Documentation can be overwhelming due to the framework's complexity
Requires significant computational resources for training large models
Some features may be experimental or not fully stable

Code Examples

Loading a pre-trained model and making predictions:

from mmf.models.mmbt import MMBT
from mmf.common.sample import Sample
from mmf.common.registry import registry

model = MMBT.from_pretrained("mmbt.hateful_memes.images")
sample = Sample()
sample.text = "This is a test image"
sample.image = "path/to/image.jpg"
output = model(sample)
print(output)

Creating a custom dataset:

from mmf.datasets.builders.vqa2 import VQA2Builder

@registry.register_builder("my_custom_dataset")
class MyCustomDatasetBuilder(VQA2Builder):
    def __init__(self):
        super().__init__()
        self.dataset_name = "my_custom_dataset"
        self.set_dataset_class(MyCustomDataset)

    def build(self, config, dataset_type):
        # Custom build logic here
        pass

Training a model:

from mmf.trainers.core.training_loop import TrainingLoop
from mmf.utils.build import build_config, build_trainer

config = build_config()
trainer = build_trainer(config)
training_loop = TrainingLoop(config, trainer)
training_loop.run()

Getting Started

To get started with MMF:

Install MMF:

pip install --upgrade pip
pip install mmf

Download the demo data:

mmf_cli download_data --datasets visual_genome

Train a model:

mmf_run config=projects/visual_bert/configs/vqa2/defaults.yaml \
    model=visual_bert dataset=vqa2 run_type=train_val

Evaluate a pre-trained model:

mmf_run config=projects/visual_bert/configs/vqa2/defaults.yaml \
    model=visual_bert dataset=vqa2 run_type=val \
    checkpoint.resume_file=<path_to_pretrained_model>

Competitor Comparisons

transformers

141,749

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

Broader scope, covering a wide range of NLP tasks and models
Larger community and more frequent updates
Extensive documentation and tutorials

Cons of Transformers

Steeper learning curve for beginners
Less focus on multimodal tasks compared to MMF

Code Comparison

MMF:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

Transformers:

from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Summary

Transformers offers a more comprehensive NLP toolkit with broader community support, while MMF specializes in multimodal tasks. Transformers may be more challenging for beginners but provides extensive resources. MMF offers a more streamlined approach for specific multimodal applications. The code examples demonstrate the different focus areas, with MMF handling image-text inputs and Transformers processing text-only data.

unilm

21,155

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

Broader scope, covering multiple NLP tasks beyond multimodal
More active development with frequent updates
Larger community and more extensive documentation

Cons of UniLM

Steeper learning curve due to its broader scope
Less focused on multimodal tasks compared to MMF
May require more computational resources for some tasks

Code Comparison

MMF example:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

UniLM example:

from unilm import UniLMForConditionalGeneration
model = UniLMForConditionalGeneration.from_pretrained("unilm-base-cased")
output = model(input_ids, attention_mask=attention_mask)

Key Differences

MMF is specifically designed for multimodal tasks, while UniLM is a more general-purpose NLP toolkit
UniLM offers pre-training and fine-tuning for various NLP tasks, whereas MMF focuses on multimodal fusion and reasoning
MMF provides more out-of-the-box solutions for multimodal problems, while UniLM requires more customization for such tasks

Use Cases

Choose MMF for dedicated multimodal projects, especially those involving vision and language
Opt for UniLM when working on a wider range of NLP tasks or when flexibility across different language models is required

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

More extensive documentation and tutorials
Broader focus on general NLP tasks
Larger community and more frequent updates

Cons of AllenNLP

Steeper learning curve for beginners
Less focus on multimodal tasks

Code Comparison

MMF example:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

AllenNLP example:

from allennlp.predictors import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/snli-roberta-large-2020.06.09.tar.gz")
result = predictor.predict(premise="A person on a horse jumps over a broken down airplane.", hypothesis="A person is outdoors, on a horse.")

MMF is more focused on multimodal tasks, providing a simpler API for combining image and text inputs. AllenNLP offers a more general-purpose NLP toolkit with a wider range of pre-trained models and tasks.

Both libraries are built on PyTorch and provide high-level APIs for working with complex NLP models. AllenNLP's architecture is more modular, allowing for greater customization, while MMF offers more out-of-the-box solutions for multimodal problems.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

More extensive support for sequence-to-sequence tasks like machine translation
Larger community and more frequent updates
Better documentation and examples for various NLP tasks

Cons of fairseq

Steeper learning curve for beginners
Less focus on multimodal tasks compared to MMF
May require more computational resources for some models

Code Comparison

MMF example:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

fairseq example:

from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('transformer.wmt19.en-de')
translated = model.translate('Hello world!')

Both repositories offer powerful tools for natural language processing and machine learning tasks. MMF specializes in multimodal learning and provides a more accessible entry point for beginners working with vision and language tasks. fairseq, on the other hand, excels in sequence-to-sequence tasks and offers a wider range of models and techniques for advanced NLP research and applications.

CLIP

28,019

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

More versatile for zero-shot image classification and retrieval tasks
Simpler architecture, making it easier to understand and implement
Better performance on a wide range of visual tasks without fine-tuning

Cons of CLIP

Limited to image-text tasks, while MMF supports multiple modalities
Less flexibility for customizing model architecture and training pipeline
Fewer pre-trained models and datasets available out-of-the-box

Code Comparison

MMF:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

CLIP:

import clip
model, preprocess = clip.load("ViT-B/32")
image = preprocess(image).unsqueeze(0)
text = clip.tokenize(["a photo of a cat", "a photo of a dog"])
logits_per_image, logits_per_text = model(image, text)

MMF offers a more comprehensive framework for multimodal tasks, while CLIP provides a simpler and more efficient approach for image-text tasks. MMF's code is more modular and customizable, whereas CLIP's implementation is more straightforward and focused on specific use cases.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-the-art vision and language models and has powered multiple research projects at Facebook AI Research. See full list of project inside or built on MMF here.

MMF is powered by PyTorch, allows distributed training and is un-opinionated, scalable and fast. Use MMF to bootstrap for your next vision and language multimodal research project by following the installation instructions. Take a look at list of MMF features here.

MMF also acts as starter codebase for challenges around vision and language datasets (The Hateful Memes, TextVQA, TextCaps and VQA challenges). MMF was formerly known as Pythia. The next video shows an overview of how datasets and models work inside MMF. Checkout MMF's video overview.

Installation

Follow installation instructions in the documentation.

Documentation

Learn more about MMF here.

Citation

If you use MMF in your work or use any models published in MMF, please cite:

@misc{singh2020mmf,
  author =       {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and
                 Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi},
  title =        {MMF: A multimodal framework for vision and language research},
  howpublished = {\url{https://github.com/facebookresearch/mmf}},
  year =         {2020}
}

License

MMF is licensed under BSD license available in LICENSE file

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot