MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)

1,397

154

1,397

View on GitHub

Top Related Projects

Mask2Former

2,774

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

detectron2

32,239

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Mask_RCNN

25,093

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

mmdetection

30,903

OpenMMLab Detection Toolbox and Benchmark

detr

14,276

End-to-End Object Detection with Transformers

Detectron

26,342

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Quick Overview

MaskFormer is a unified framework for image segmentation tasks developed by Facebook Research. It combines the strengths of mask classification and pixel classification approaches, offering a flexible and efficient solution for various segmentation problems, including panoptic, instance, and semantic segmentation.

Pros

Versatile: Handles multiple segmentation tasks with a single architecture
State-of-the-art performance: Achieves competitive results across various benchmarks
Efficient: Utilizes a transformer-based design for effective feature extraction
Easy integration: Can be incorporated into existing segmentation pipelines

Cons

Computational complexity: May require significant computational resources for training and inference
Learning curve: Understanding and implementing the architecture might be challenging for beginners
Limited pre-trained models: Fewer pre-trained models available compared to some other segmentation frameworks
Dependency on specific libraries: Requires specific versions of PyTorch and other dependencies

Code Examples

Loading a pre-trained MaskFormer model:

from detectron2.config import get_cfg
from mask_former import add_mask_former_config
from mask_former_model import MaskFormerModel

cfg = get_cfg()
add_mask_former_config(cfg)
cfg.merge_from_file("path/to/config.yaml")
model = MaskFormerModel(cfg)

Performing inference on an image:

import torch
from PIL import Image

image = Image.open("path/to/image.jpg")
inputs = torch.as_tensor(np.asarray(image).transpose(2, 0, 1))
outputs = model([{"image": inputs}])

Visualizing segmentation results:

from detectron2.utils.visualizer import Visualizer

v = Visualizer(image, metadata=metadata, scale=1.2)
out = v.draw_instance_predictions(outputs[0]["instances"].to("cpu"))
out.save("output.png")

Getting Started

Install dependencies:

pip install torch torchvision
pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install 'git+https://github.com/facebookresearch/MaskFormer.git'

Download a pre-trained model:

wget https://dl.fbaipublicfiles.com/maskformer/mask_former/coco/panoptic/maskformer_panoptic_R50_bs16_50ep/model_final_94dc52.pkl

Run inference:

from mask_former.config import add_mask_former_config
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
add_mask_former_config(cfg)
cfg.merge_from_file("configs/coco/panoptic-segmentation/maskformer_R50_bs16_50ep.yaml")
cfg.MODEL.WEIGHTS = "model_final_94dc52.pkl"
predictor = DefaultPredictor(cfg)

# Use predictor on your images
outputs = predictor(image)

Competitor Comparisons

Mask2Former

2,774

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

Pros of Mask2Former

Improved performance on various segmentation tasks
More efficient architecture with reduced computational complexity
Better handling of multi-scale features and long-range dependencies

Cons of Mask2Former

Potentially more complex to implement and fine-tune
May require more training data to achieve optimal results

Code Comparison

MaskFormer:

class MaskFormer(nn.Module):
    def __init__(self, backbone, transformer, num_classes):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)

Mask2Former:

class Mask2Former(nn.Module):
    def __init__(self, backbone, pixel_decoder, transformer_decoder, num_classes):
        super().__init__()
        self.backbone = backbone
        self.pixel_decoder = pixel_decoder
        self.transformer_decoder = transformer_decoder
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)

The main difference in the code is the introduction of a pixel_decoder and transformer_decoder in Mask2Former, which contributes to its improved performance and efficiency in handling multi-scale features and long-range dependencies.

detectron2

32,239

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

More comprehensive and versatile, supporting a wider range of computer vision tasks
Larger community and more extensive documentation
Modular design allows for easier customization and extension

Cons of Detectron2

Steeper learning curve due to its complexity and extensive features
Potentially slower inference time for specific tasks compared to specialized models

Code Comparison

MaskFormer:

from maskformer import MaskFormer

model = MaskFormer(num_classes=150)
outputs = model(images)

Detectron2:

from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

Summary

Detectron2 is a more comprehensive computer vision library, offering a wide range of models and tasks. It has a larger community and more extensive documentation, making it suitable for various projects. However, its complexity can lead to a steeper learning curve.

MaskFormer, on the other hand, is more specialized, focusing on instance segmentation tasks. It may offer simpler implementation for specific use cases and potentially faster inference times for certain tasks.

The choice between the two depends on the project requirements, desired flexibility, and the specific computer vision tasks at hand.

Mask_RCNN

25,093

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Pros of Mask_RCNN

Well-established and widely adopted in the computer vision community
Extensive documentation and community support
Suitable for a wide range of instance segmentation tasks

Cons of Mask_RCNN

Generally slower inference time compared to MaskFormer
May struggle with complex scenes or overlapping objects
Limited flexibility in handling different segmentation paradigms

Code Comparison

MaskFormer:

outputs = model(pixel_values=pixel_values)
segmentation = outputs.segmentation
class_queries = outputs.class_queries

Mask_RCNN:

results = model.detect([image], verbose=0)
r = results[0]
masks = r['masks']
class_ids = r['class_ids']

MaskFormer uses a unified approach for various segmentation tasks, while Mask_RCNN is specifically designed for instance segmentation. MaskFormer's code is more concise and flexible, allowing for easier adaptation to different segmentation paradigms. Mask_RCNN's code is more explicit in separating masks and class IDs, which can be beneficial for certain applications but may require more post-processing for others.

mmdetection

30,903

OpenMMLab Detection Toolbox and Benchmark

Pros of mmdetection

Broader scope: Supports a wide range of object detection, instance segmentation, and panoptic segmentation algorithms
Extensive documentation and tutorials: Offers comprehensive guides for various tasks and model configurations
Active community: Regular updates, bug fixes, and contributions from a large user base

Cons of mmdetection

Steeper learning curve: More complex codebase due to its extensive feature set
Potentially slower inference: May have higher overhead for simple tasks compared to MaskFormer's focused approach

Code Comparison

MaskFormer:

outputs = model(images)
mask_cls_results = outputs["pred_logits"]
mask_pred_results = outputs["pred_masks"]

mmdetection:

results = model(return_loss=False, rescale=True, **data)
bbox_results, mask_results = results[:2]

MaskFormer focuses on a unified approach for segmentation tasks, while mmdetection provides a more comprehensive toolkit for various detection and segmentation algorithms. MaskFormer's code is generally more straightforward for its specific use case, while mmdetection offers more flexibility at the cost of increased complexity.

detr

14,276

End-to-End Object Detection with Transformers

Pros of DETR

Pioneered end-to-end object detection using transformers
Simpler architecture with no need for anchor boxes or non-maximum suppression
Versatile framework adaptable to various tasks beyond object detection

Cons of DETR

Generally slower convergence compared to MaskFormer
May struggle with small object detection
Higher computational complexity, especially for large images

Code Comparison

DETR:

class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)

MaskFormer:

class MaskFormer(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.mask_embed = MLP(hidden_dim, hidden_dim, hidden_dim, 3)

Both repositories share similar high-level structures, utilizing a backbone and transformer. The main difference lies in their output heads: DETR focuses on bounding box prediction, while MaskFormer emphasizes mask generation for instance segmentation.

Detectron

26,342

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Pros of Detectron

More established and mature project with a longer history
Broader range of object detection and segmentation models
Extensive documentation and community support

Cons of Detectron

Less focused on specific tasks like panoptic segmentation
May have more complexity for users only interested in certain tasks
Slower development cycle compared to newer projects

Code Comparison

MaskFormer:

from maskformer import MaskFormer

model = MaskFormer(num_classes=150)
outputs = model(images)

Detectron:

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor

cfg = model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

MaskFormer offers a simpler API for specific tasks, while Detectron provides a more comprehensive framework with additional configuration options. MaskFormer focuses on unified segmentation tasks, whereas Detectron covers a broader range of computer vision tasks. Both projects are maintained by Facebook Research, but MaskFormer represents a more recent approach to segmentation problems.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation

Bowen Cheng, Alexander G. Schwing, Alexander Kirillov

[arXiv] [Project] [BibTeX]

Mask2Former

Checkout Mask2Former, a universal architecture based on MaskFormer meta-architecture that achieves SOTA on panoptic, instance and semantic segmentation across four popular datasets (ADE20K, Cityscapes, COCO, Mapillary Vistas).

Features

Better results while being more efficient.
Unified view of semantic- and instance-level segmentation tasks.
Support major semantic segmentation datasets: ADE20K, Cityscapes, COCO-Stuff, Mapillary Vistas.
Support ALL Detectron2 models.

Installation

See installation instructions.

Getting Started

See Preparing Datasets for MaskFormer.

See Getting Started with MaskFormer.

Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the MaskFormer Model Zoo.

License

Shield:

The majority of MaskFormer is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

However portions of the project are available under separate license terms: Swin-Transformer-Semantic-Segmentation is licensed under the MIT license.

Citing MaskFormer

If you use MaskFormer in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot