Convert Figma logo to code with AI

facebookresearch logommf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

5,489
935
5,489
150

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

11,750

An open-source NLP research library, built on PyTorch.

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Quick Overview

MMF (Multimodal Framework) is an open-source modular framework for vision and language multimodal research. Developed by Facebook AI Research, it provides a platform for training and evaluating AI models on various multimodal tasks, including visual question answering, image captioning, and visual reasoning.

Pros

  • Highly modular and extensible architecture
  • Supports a wide range of multimodal tasks and datasets
  • Includes pre-trained models and easy-to-use evaluation tools
  • Active development and community support

Cons

  • Steep learning curve for beginners
  • Documentation can be overwhelming due to the framework's complexity
  • Requires significant computational resources for training large models
  • Some features may be experimental or not fully stable

Code Examples

  1. Loading a pre-trained model and making predictions:
from mmf.models.mmbt import MMBT
from mmf.common.sample import Sample
from mmf.common.registry import registry

model = MMBT.from_pretrained("mmbt.hateful_memes.images")
sample = Sample()
sample.text = "This is a test image"
sample.image = "path/to/image.jpg"
output = model(sample)
print(output)
  1. Creating a custom dataset:
from mmf.datasets.builders.vqa2 import VQA2Builder

@registry.register_builder("my_custom_dataset")
class MyCustomDatasetBuilder(VQA2Builder):
    def __init__(self):
        super().__init__()
        self.dataset_name = "my_custom_dataset"
        self.set_dataset_class(MyCustomDataset)

    def build(self, config, dataset_type):
        # Custom build logic here
        pass
  1. Training a model:
from mmf.trainers.core.training_loop import TrainingLoop
from mmf.utils.build import build_config, build_trainer

config = build_config()
trainer = build_trainer(config)
training_loop = TrainingLoop(config, trainer)
training_loop.run()

Getting Started

To get started with MMF:

  1. Install MMF:
pip install --upgrade pip
pip install mmf
  1. Download the demo data:
mmf_cli download_data --datasets visual_genome
  1. Train a model:
mmf_run config=projects/visual_bert/configs/vqa2/defaults.yaml \
    model=visual_bert dataset=vqa2 run_type=train_val
  1. Evaluate a pre-trained model:
mmf_run config=projects/visual_bert/configs/vqa2/defaults.yaml \
    model=visual_bert dataset=vqa2 run_type=val \
    checkpoint.resume_file=<path_to_pretrained_model>

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of Transformers

  • Broader scope, covering a wide range of NLP tasks and models
  • Larger community and more frequent updates
  • Extensive documentation and tutorials

Cons of Transformers

  • Steeper learning curve for beginners
  • Less focus on multimodal tasks compared to MMF

Code Comparison

MMF:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

Transformers:

from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Summary

Transformers offers a more comprehensive NLP toolkit with broader community support, while MMF specializes in multimodal tasks. Transformers may be more challenging for beginners but provides extensive resources. MMF offers a more streamlined approach for specific multimodal applications. The code examples demonstrate the different focus areas, with MMF handling image-text inputs and Transformers processing text-only data.

19,863

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

  • Broader scope, covering multiple NLP tasks beyond multimodal
  • More active development with frequent updates
  • Larger community and more extensive documentation

Cons of UniLM

  • Steeper learning curve due to its broader scope
  • Less focused on multimodal tasks compared to MMF
  • May require more computational resources for some tasks

Code Comparison

MMF example:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

UniLM example:

from unilm import UniLMForConditionalGeneration
model = UniLMForConditionalGeneration.from_pretrained("unilm-base-cased")
output = model(input_ids, attention_mask=attention_mask)

Key Differences

  • MMF is specifically designed for multimodal tasks, while UniLM is a more general-purpose NLP toolkit
  • UniLM offers pre-training and fine-tuning for various NLP tasks, whereas MMF focuses on multimodal fusion and reasoning
  • MMF provides more out-of-the-box solutions for multimodal problems, while UniLM requires more customization for such tasks

Use Cases

  • Choose MMF for dedicated multimodal projects, especially those involving vision and language
  • Opt for UniLM when working on a wider range of NLP tasks or when flexibility across different language models is required
11,750

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

  • More extensive documentation and tutorials
  • Broader focus on general NLP tasks
  • Larger community and more frequent updates

Cons of AllenNLP

  • Steeper learning curve for beginners
  • Less focus on multimodal tasks

Code Comparison

MMF example:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

AllenNLP example:

from allennlp.predictors import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/snli-roberta-large-2020.06.09.tar.gz")
result = predictor.predict(premise="A person on a horse jumps over a broken down airplane.", hypothesis="A person is outdoors, on a horse.")

MMF is more focused on multimodal tasks, providing a simpler API for combining image and text inputs. AllenNLP offers a more general-purpose NLP toolkit with a wider range of pre-trained models and tasks.

Both libraries are built on PyTorch and provide high-level APIs for working with complex NLP models. AllenNLP's architecture is more modular, allowing for greater customization, while MMF offers more out-of-the-box solutions for multimodal problems.

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • More extensive support for sequence-to-sequence tasks like machine translation
  • Larger community and more frequent updates
  • Better documentation and examples for various NLP tasks

Cons of fairseq

  • Steeper learning curve for beginners
  • Less focus on multimodal tasks compared to MMF
  • May require more computational resources for some models

Code Comparison

MMF example:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

fairseq example:

from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('transformer.wmt19.en-de')
translated = model.translate('Hello world!')

Both repositories offer powerful tools for natural language processing and machine learning tasks. MMF specializes in multimodal learning and provides a more accessible entry point for beginners working with vision and language tasks. fairseq, on the other hand, excels in sequence-to-sequence tasks and offers a wider range of models and techniques for advanced NLP research and applications.

26,479

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

  • More versatile for zero-shot image classification and retrieval tasks
  • Simpler architecture, making it easier to understand and implement
  • Better performance on a wide range of visual tasks without fine-tuning

Cons of CLIP

  • Limited to image-text tasks, while MMF supports multiple modalities
  • Less flexibility for customizing model architecture and training pipeline
  • Fewer pre-trained models and datasets available out-of-the-box

Code Comparison

MMF:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
output = model(image, text)

CLIP:

import clip
model, preprocess = clip.load("ViT-B/32")
image = preprocess(image).unsqueeze(0)
text = clip.tokenize(["a photo of a cat", "a photo of a dog"])
logits_per_image, logits_per_text = model(image, text)

MMF offers a more comprehensive framework for multimodal tasks, while CLIP provides a simpler and more efficient approach for image-text tasks. MMF's code is more modular and customizable, whereas CLIP's implementation is more straightforward and focused on specific use cases.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README


MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-the-art vision and language models and has powered multiple research projects at Facebook AI Research. See full list of project inside or built on MMF here.

MMF is powered by PyTorch, allows distributed training and is un-opinionated, scalable and fast. Use MMF to bootstrap for your next vision and language multimodal research project by following the installation instructions. Take a look at list of MMF features here.

MMF also acts as starter codebase for challenges around vision and language datasets (The Hateful Memes, TextVQA, TextCaps and VQA challenges). MMF was formerly known as Pythia. The next video shows an overview of how datasets and models work inside MMF. Checkout MMF's video overview.

Installation

Follow installation instructions in the documentation.

Documentation

Learn more about MMF here.

Citation

If you use MMF in your work or use any models published in MMF, please cite:

@misc{singh2020mmf,
  author =       {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and
                 Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi},
  title =        {MMF: A multimodal framework for vision and language research},
  howpublished = {\url{https://github.com/facebookresearch/mmf}},
  year =         {2020}
}

License

MMF is licensed under BSD license available in LICENSE file