Convert Figma logo to code with AI

facebookresearch logoMaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)

1,347
151
1,347
9

Top Related Projects

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

24,600

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

OpenMMLab Detection Toolbox and Benchmark

13,458

End-to-End Object Detection with Transformers

26,250

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Quick Overview

MaskFormer is a unified framework for image segmentation tasks developed by Facebook Research. It combines the strengths of mask classification and pixel classification approaches, offering a flexible and efficient solution for various segmentation problems, including panoptic, instance, and semantic segmentation.

Pros

  • Versatile: Handles multiple segmentation tasks with a single architecture
  • State-of-the-art performance: Achieves competitive results across various benchmarks
  • Efficient: Utilizes a transformer-based design for effective feature extraction
  • Easy integration: Can be incorporated into existing segmentation pipelines

Cons

  • Computational complexity: May require significant computational resources for training and inference
  • Learning curve: Understanding and implementing the architecture might be challenging for beginners
  • Limited pre-trained models: Fewer pre-trained models available compared to some other segmentation frameworks
  • Dependency on specific libraries: Requires specific versions of PyTorch and other dependencies

Code Examples

  1. Loading a pre-trained MaskFormer model:
from detectron2.config import get_cfg
from mask_former import add_mask_former_config
from mask_former_model import MaskFormerModel

cfg = get_cfg()
add_mask_former_config(cfg)
cfg.merge_from_file("path/to/config.yaml")
model = MaskFormerModel(cfg)
  1. Performing inference on an image:
import torch
from PIL import Image

image = Image.open("path/to/image.jpg")
inputs = torch.as_tensor(np.asarray(image).transpose(2, 0, 1))
outputs = model([{"image": inputs}])
  1. Visualizing segmentation results:
from detectron2.utils.visualizer import Visualizer

v = Visualizer(image, metadata=metadata, scale=1.2)
out = v.draw_instance_predictions(outputs[0]["instances"].to("cpu"))
out.save("output.png")

Getting Started

  1. Install dependencies:
pip install torch torchvision
pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install 'git+https://github.com/facebookresearch/MaskFormer.git'
  1. Download a pre-trained model:
wget https://dl.fbaipublicfiles.com/maskformer/mask_former/coco/panoptic/maskformer_panoptic_R50_bs16_50ep/model_final_94dc52.pkl
  1. Run inference:
from mask_former.config import add_mask_former_config
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
add_mask_former_config(cfg)
cfg.merge_from_file("configs/coco/panoptic-segmentation/maskformer_R50_bs16_50ep.yaml")
cfg.MODEL.WEIGHTS = "model_final_94dc52.pkl"
predictor = DefaultPredictor(cfg)

# Use predictor on your images
outputs = predictor(image)

Competitor Comparisons

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

Pros of Mask2Former

  • Improved performance on various segmentation tasks
  • More efficient architecture with reduced computational complexity
  • Better handling of multi-scale features and long-range dependencies

Cons of Mask2Former

  • Potentially more complex to implement and fine-tune
  • May require more training data to achieve optimal results

Code Comparison

MaskFormer:

class MaskFormer(nn.Module):
    def __init__(self, backbone, transformer, num_classes):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)

Mask2Former:

class Mask2Former(nn.Module):
    def __init__(self, backbone, pixel_decoder, transformer_decoder, num_classes):
        super().__init__()
        self.backbone = backbone
        self.pixel_decoder = pixel_decoder
        self.transformer_decoder = transformer_decoder
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)

The main difference in the code is the introduction of a pixel_decoder and transformer_decoder in Mask2Former, which contributes to its improved performance and efficiency in handling multi-scale features and long-range dependencies.

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Pros of Detectron2

  • More comprehensive and versatile, supporting a wider range of computer vision tasks
  • Larger community and more extensive documentation
  • Modular design allows for easier customization and extension

Cons of Detectron2

  • Steeper learning curve due to its complexity and extensive features
  • Potentially slower inference time for specific tasks compared to specialized models

Code Comparison

MaskFormer:

from maskformer import MaskFormer

model = MaskFormer(num_classes=150)
outputs = model(images)

Detectron2:

from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

Summary

Detectron2 is a more comprehensive computer vision library, offering a wide range of models and tasks. It has a larger community and more extensive documentation, making it suitable for various projects. However, its complexity can lead to a steeper learning curve.

MaskFormer, on the other hand, is more specialized, focusing on instance segmentation tasks. It may offer simpler implementation for specific use cases and potentially faster inference times for certain tasks.

The choice between the two depends on the project requirements, desired flexibility, and the specific computer vision tasks at hand.

24,600

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Pros of Mask_RCNN

  • Well-established and widely adopted in the computer vision community
  • Extensive documentation and community support
  • Suitable for a wide range of instance segmentation tasks

Cons of Mask_RCNN

  • Generally slower inference time compared to MaskFormer
  • May struggle with complex scenes or overlapping objects
  • Limited flexibility in handling different segmentation paradigms

Code Comparison

MaskFormer:

outputs = model(pixel_values=pixel_values)
segmentation = outputs.segmentation
class_queries = outputs.class_queries

Mask_RCNN:

results = model.detect([image], verbose=0)
r = results[0]
masks = r['masks']
class_ids = r['class_ids']

MaskFormer uses a unified approach for various segmentation tasks, while Mask_RCNN is specifically designed for instance segmentation. MaskFormer's code is more concise and flexible, allowing for easier adaptation to different segmentation paradigms. Mask_RCNN's code is more explicit in separating masks and class IDs, which can be beneficial for certain applications but may require more post-processing for others.

OpenMMLab Detection Toolbox and Benchmark

Pros of mmdetection

  • Broader scope: Supports a wide range of object detection, instance segmentation, and panoptic segmentation algorithms
  • Extensive documentation and tutorials: Offers comprehensive guides for various tasks and model configurations
  • Active community: Regular updates, bug fixes, and contributions from a large user base

Cons of mmdetection

  • Steeper learning curve: More complex codebase due to its extensive feature set
  • Potentially slower inference: May have higher overhead for simple tasks compared to MaskFormer's focused approach

Code Comparison

MaskFormer:

outputs = model(images)
mask_cls_results = outputs["pred_logits"]
mask_pred_results = outputs["pred_masks"]

mmdetection:

results = model(return_loss=False, rescale=True, **data)
bbox_results, mask_results = results[:2]

MaskFormer focuses on a unified approach for segmentation tasks, while mmdetection provides a more comprehensive toolkit for various detection and segmentation algorithms. MaskFormer's code is generally more straightforward for its specific use case, while mmdetection offers more flexibility at the cost of increased complexity.

13,458

End-to-End Object Detection with Transformers

Pros of DETR

  • Pioneered end-to-end object detection using transformers
  • Simpler architecture with no need for anchor boxes or non-maximum suppression
  • Versatile framework adaptable to various tasks beyond object detection

Cons of DETR

  • Generally slower convergence compared to MaskFormer
  • May struggle with small object detection
  • Higher computational complexity, especially for large images

Code Comparison

DETR:

class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)

MaskFormer:

class MaskFormer(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries):
        super().__init__()
        self.backbone = backbone
        self.transformer = transformer
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.mask_embed = MLP(hidden_dim, hidden_dim, hidden_dim, 3)

Both repositories share similar high-level structures, utilizing a backbone and transformer. The main difference lies in their output heads: DETR focuses on bounding box prediction, while MaskFormer emphasizes mask generation for instance segmentation.

26,250

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Pros of Detectron

  • More established and mature project with a longer history
  • Broader range of object detection and segmentation models
  • Extensive documentation and community support

Cons of Detectron

  • Less focused on specific tasks like panoptic segmentation
  • May have more complexity for users only interested in certain tasks
  • Slower development cycle compared to newer projects

Code Comparison

MaskFormer:

from maskformer import MaskFormer

model = MaskFormer(num_classes=150)
outputs = model(images)

Detectron:

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor

cfg = model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)

MaskFormer offers a simpler API for specific tasks, while Detectron provides a more comprehensive framework with additional configuration options. MaskFormer focuses on unified segmentation tasks, whereas Detectron covers a broader range of computer vision tasks. Both projects are maintained by Facebook Research, but MaskFormer represents a more recent approach to segmentation problems.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation

Bowen Cheng, Alexander G. Schwing, Alexander Kirillov

[arXiv] [Project] [BibTeX]


Mask2Former

Checkout Mask2Former, a universal architecture based on MaskFormer meta-architecture that achieves SOTA on panoptic, instance and semantic segmentation across four popular datasets (ADE20K, Cityscapes, COCO, Mapillary Vistas).

Features

  • Better results while being more efficient.
  • Unified view of semantic- and instance-level segmentation tasks.
  • Support major semantic segmentation datasets: ADE20K, Cityscapes, COCO-Stuff, Mapillary Vistas.
  • Support ALL Detectron2 models.

Installation

See installation instructions.

Getting Started

See Preparing Datasets for MaskFormer.

See Getting Started with MaskFormer.

Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the MaskFormer Model Zoo.

License

Shield: CC BY-NC 4.0

The majority of MaskFormer is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

However portions of the project are available under separate license terms: Swin-Transformer-Semantic-Segmentation is licensed under the MIT license.

Citing MaskFormer

If you use MaskFormer in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}