Convert Figma logo to code with AI

facebookresearch logoMask2Former

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

2,505
381
2,505
158

Top Related Projects

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

24,600

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

OpenMMLab Detection Toolbox and Benchmark

13,458

End-to-End Object Detection with Transformers

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)

77,006

Models and examples built with TensorFlow

Quick Overview

Mask2Former is a state-of-the-art universal image segmentation framework developed by Facebook Research. It unifies instance, semantic, and panoptic segmentation tasks under a single architecture, achieving superior performance across various benchmarks.

Pros

  • Versatile: Handles multiple segmentation tasks with a single model
  • High performance: Achieves state-of-the-art results on various benchmarks
  • Efficient: Uses a unified architecture, reducing the need for task-specific models
  • Extensible: Can be easily adapted to new segmentation tasks

Cons

  • Complex architecture: May be challenging to understand and implement for beginners
  • Resource-intensive: Requires significant computational resources for training and inference
  • Limited documentation: Some aspects of the project may lack detailed explanations
  • Dependency on specific libraries: Requires particular versions of PyTorch and other dependencies

Code Examples

  1. Installing Mask2Former:
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
python -m pip install 'git+https://github.com/facebookresearch/Mask2Former.git'
  1. Importing and using a pre-trained model:
from detectron2.config import get_cfg
from mask2former import add_maskformer2_config
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
add_maskformer2_config(cfg)
cfg.merge_from_file("configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml")
cfg.MODEL.WEIGHTS = "model_final_f07440.pkl"
predictor = DefaultPredictor(cfg)

outputs = predictor(image)
  1. Visualizing the results:
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog

v = Visualizer(image[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TEST[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2.imshow("Result", out.get_image()[:, :, ::-1])
cv2.waitKey(0)

Getting Started

  1. Install the required dependencies:

    pip install torch torchvision
    pip install 'git+https://github.com/facebookresearch/detectron2.git'
    pip install 'git+https://github.com/facebookresearch/Mask2Former.git'
    
  2. Clone the repository:

    git clone https://github.com/facebookresearch/Mask2Former.git
    cd Mask2Former
    
  3. Download a pre-trained model and configuration file from the project's model zoo.

  4. Use the provided scripts for inference or training on your own dataset.

Competitor Comparisons

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

  • More comprehensive and versatile, supporting a wider range of computer vision tasks
  • Longer development history, resulting in a more mature and stable codebase
  • Larger community and more extensive documentation

Cons of Detectron2

  • Higher complexity and steeper learning curve for beginners
  • Potentially slower inference time for specific tasks compared to Mask2Former

Code Comparison

Detectron2:

from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
predictor = DefaultPredictor(cfg)

Mask2Former:

from mask2former import add_mask2former_config
from detectron2.config import get_cfg

cfg = get_cfg()
add_mask2former_config(cfg)
cfg.merge_from_file("configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml")

Both repositories build upon the Detectron2 framework, with Mask2Former focusing specifically on advanced instance and panoptic segmentation tasks. Mask2Former offers improved performance for these specific tasks, while Detectron2 provides a more comprehensive toolkit for various computer vision applications. The code comparison shows similarities in configuration setup, with Mask2Former adding its specific configurations to the Detectron2 base.

24,600

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Pros of Mask_RCNN

  • Simpler architecture, making it easier to understand and implement
  • Well-established and widely used in various applications
  • Extensive documentation and community support

Cons of Mask_RCNN

  • Lower performance on complex segmentation tasks compared to Mask2Former
  • Less flexible in handling different types of segmentation problems
  • May require more manual tuning for optimal results

Code Comparison

Mask_RCNN:

from mrcnn import utils
import mrcnn.model as modellib

model = modellib.MaskRCNN(mode="inference", config=config, model_dir=MODEL_DIR)
model.load_weights(WEIGHTS_PATH, by_name=True)
results = model.detect([image], verbose=1)

Mask2Former:

from detectron2.config import get_cfg
from mask2former import add_maskformer2_config

cfg = get_cfg()
add_maskformer2_config(cfg)
cfg.merge_from_file("configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

OpenMMLab Detection Toolbox and Benchmark

Pros of mmdetection

  • Broader scope: Supports a wide range of object detection, instance segmentation, and panoptic segmentation algorithms
  • Extensive documentation and tutorials: Easier for beginners to get started and understand the codebase
  • Modular design: Allows for easy customization and extension of components

Cons of mmdetection

  • Larger codebase: May be more complex to navigate and understand for specific tasks
  • Potentially slower inference: Due to its generalized nature, it might not be as optimized for specific architectures

Code Comparison

mmdetection:

from mmdet.apis import init_detector, inference_detector

config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'test.jpg')

Mask2Former:

from detectron2.config import get_cfg
from mask2former import add_maskformer2_config
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
add_maskformer2_config(cfg)
cfg.merge_from_file("configs/coco/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)
13,458

End-to-End Object Detection with Transformers

Pros of DETR

  • Pioneered end-to-end object detection using transformers
  • Simpler architecture with no need for anchor boxes or non-maximum suppression
  • Flexible and easily extendable to other tasks like panoptic segmentation

Cons of DETR

  • Slower convergence and longer training times compared to Mask2Former
  • Lower performance on small objects and dense scenes
  • Less efficient in terms of computational resources

Code Comparison

DETR:

class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)

Mask2Former:

class Mask2Former(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.pixel_decoder = PixelDecoder(feature_channels)
        self.transformer_decoder = TransformerDecoder(num_layers, hidden_dim)
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)

Both repositories implement transformer-based architectures for object detection and segmentation tasks. While DETR introduced the concept, Mask2Former builds upon it with improved efficiency and performance, especially for instance and panoptic segmentation.

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)

Pros of MaskFormer

  • Introduced a unified framework for instance and semantic segmentation
  • Simpler architecture compared to Mask2Former
  • Established a strong baseline for panoptic segmentation tasks

Cons of MaskFormer

  • Lower performance on certain benchmarks compared to Mask2Former
  • Less flexible in handling different segmentation tasks
  • May require more computational resources for training

Code Comparison

MaskFormer:

class MaskFormer(nn.Module):
    def __init__(self, backbone, transformer, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.query_embed = nn.Embedding(num_queries, hidden_dim)

Mask2Former:

class Mask2Former(nn.Module):
    def __init__(self, backbone, transformer, num_queries, num_feature_levels):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.query_embed = nn.Embedding(num_queries, hidden_dim)
        self.level_embed = nn.Embedding(num_feature_levels, hidden_dim)

The main difference in the code is that Mask2Former introduces an additional level_embed parameter, allowing it to handle multi-scale feature maps more effectively. This contributes to its improved performance across various segmentation tasks.

77,006

Models and examples built with TensorFlow

Pros of models

  • Broader scope: Covers a wide range of machine learning tasks and models
  • Official TensorFlow implementation: Ensures compatibility and optimized performance
  • Extensive documentation and community support

Cons of models

  • Larger and more complex repository structure
  • May require more setup and configuration for specific tasks
  • Less focused on a single task or model architecture

Code comparison

Mask2Former (PyTorch):

from mask2former import Mask2Former

model = Mask2Former(num_classes=80)
outputs = model(images)

models (TensorFlow):

import tensorflow as tf
from official.vision.modeling import maskrcnn_model

model = maskrcnn_model.MaskRCNNModel(num_classes=80)
outputs = model(images, training=False)

Summary

Mask2Former is a specialized repository focused on instance and semantic segmentation using the Mask2Former architecture. It offers a more streamlined experience for this specific task.

models is a comprehensive collection of TensorFlow implementations for various machine learning tasks. It provides a wider range of models and applications but may require more setup for specific use cases.

The choice between the two depends on the specific requirements of your project and your preferred deep learning framework (PyTorch vs. TensorFlow).

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022)

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

[arXiv] [Project] [BibTeX]


Features

  • A single architecture for panoptic, instance and semantic segmentation.
  • Support major segmentation datasets: ADE20K, Cityscapes, COCO, Mapillary Vistas.

Updates

  • Add Google Colab demo.
  • Video instance segmentation is now supported! Please check our tech report for more details.

Installation

See installation instructions.

Getting Started

See Preparing Datasets for Mask2Former.

See Getting Started with Mask2Former.

Run our demo using Colab: Open In Colab

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo: Hugging Face Spaces

Replicate web demo and docker image is available here: Replicate

Advanced usage

See Advanced Usage of Mask2Former.

Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the Mask2Former Model Zoo.

License

Shield: License: MIT

The majority of Mask2Former is licensed under a MIT License.

However portions of the project are available under separate license terms: Swin-Transformer-Semantic-Segmentation is licensed under the MIT license, Deformable-DETR is licensed under the Apache-2.0 License.

Citing Mask2Former

If you use Mask2Former in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@inproceedings{cheng2021mask2former,
  title={Masked-attention Mask Transformer for Universal Image Segmentation},
  author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
  journal={CVPR},
  year={2022}
}

If you find the code useful, please also consider the following BibTeX entry.

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}

Acknowledgement

Code is largely based on MaskFormer (https://github.com/facebookresearch/MaskFormer).