MobileSAM
This is the official code for MobileSAM project that makes SAM lightweight for mobile applications and beyond!
Top Related Projects
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Segment Anything in High Quality [NeurIPS 2023]
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Track-Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything, XMem, and E2FGVI.
Quick Overview
MobileSAM is a lightweight and efficient version of the Segment Anything Model (SAM), designed for mobile and edge devices. It aims to provide fast and accurate image segmentation capabilities while significantly reducing model size and computational requirements.
Pros
- Dramatically reduced model size (from 2.7GB to 39MB) without significant loss in performance
- Faster inference speed, making it suitable for real-time applications on mobile devices
- Maintains high-quality segmentation results comparable to the original SAM
- Easy integration with existing SAM-based projects
Cons
- May have slightly lower accuracy compared to the full SAM model in some complex scenarios
- Limited to image segmentation tasks, not a general-purpose vision model
- Requires careful fine-tuning for specific use cases to achieve optimal performance
- Relatively new project, which may lead to potential stability issues or lack of extensive community support
Code Examples
- Loading the MobileSAM model:
from mobile_sam import sam_model_registry, SamPredictor
model_type = "vit_t"
sam_checkpoint = "mobile_sam.pt"
mobile_sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
predictor = SamPredictor(mobile_sam)
- Generating masks from an image:
import numpy as np
from PIL import Image
image = np.array(Image.open("example_image.jpg"))
predictor.set_image(image)
input_point = np.array([[500, 375]])
input_label = np.array([1])
masks, _, _ = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)
- Visualizing the segmentation results:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
plt.imshow(image)
for mask in masks:
show_mask(mask, plt.gca(), random_color=True)
show_points(input_point, input_label, plt.gca())
plt.axis('off')
plt.show()
Getting Started
-
Clone the repository:
git clone https://github.com/ChaoningZhang/MobileSAM.git cd MobileSAM
-
Install dependencies:
pip install -r requirements.txt
-
Download the pre-trained model:
wget https://github.com/ChaoningZhang/MobileSAM/releases/download/v1.0/mobile_sam.pt
-
Run the demo script:
python demo/demo.py
Competitor Comparisons
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Pros of segment-anything
- More comprehensive and feature-rich, offering a wider range of segmentation capabilities
- Backed by Facebook/Meta, potentially leading to better long-term support and updates
- Includes pre-trained models and extensive documentation
Cons of segment-anything
- Larger model size, requiring more computational resources
- Slower inference time, which may not be suitable for real-time applications
- More complex to integrate and use, especially for simpler projects
Code Comparison
segment-anything:
from segment_anything import sam_model_registry, SamPredictor
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)
predictor.set_image(image)
masks, _, _ = predictor.predict(point_coords=input_point, point_labels=input_label)
MobileSAM:
from mobile_sam import sam_model_registry, SamPredictor
mobile_sam = sam_model_registry["vit_t"](checkpoint="mobile_sam.pt")
predictor = SamPredictor(mobile_sam)
predictor.set_image(image)
masks, _, _ = predictor.predict(point_coords=input_point, point_labels=input_label)
The code structure is similar, but MobileSAM uses a smaller model ("vit_t") and a different checkpoint file, resulting in faster inference and smaller model size.
Segment Anything in High Quality [NeurIPS 2023]
Pros of sam-hq
- Higher segmentation quality, especially for fine details and complex objects
- Supports both automatic and interactive segmentation modes
- Includes pre-trained models for various resolutions (512x512, 1024x1024)
Cons of sam-hq
- Larger model size, requiring more computational resources
- Potentially slower inference time compared to MobileSAM
- May have higher memory requirements for processing high-resolution images
Code Comparison
sam-hq:
predictor = SamPredictor(sam)
predictor.set_image(image)
masks, _, _ = predictor.predict(point_coords=input_point, point_labels=input_label)
MobileSAM:
predictor = SamPredictor(mobile_sam)
predictor.set_image(image)
masks, _, _ = predictor.predict(point_coords=input_point, point_labels=input_label)
The code usage for both repositories is very similar, with the main difference being the initialization of the predictor with either the standard SAM model or the MobileSAM model. This allows for easy integration and switching between the two implementations in existing projects.
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Pros of Grounded-Segment-Anything
- Integrates grounding DINO for object detection, enhancing segmentation accuracy
- Supports text-to-box and text-to-mask functionalities
- Offers more versatile applications in various computer vision tasks
Cons of Grounded-Segment-Anything
- Requires more computational resources due to additional components
- May have a steeper learning curve for implementation and customization
- Potentially slower inference time compared to MobileSAM
Code Comparison
MobileSAM:
from mobile_sam import sam_model_registry, SamPredictor
model = sam_model_registry["vit_t"](checkpoint="mobile_sam.pt")
predictor = SamPredictor(model)
masks, _, _ = predictor.predict(point_coords, point_labels, box=None)
Grounded-Segment-Anything:
from groundingdino.util.inference import load_model, load_image, predict
from segment_anything import sam_model_registry, SamPredictor
grounding_dino_model = load_model("groundingdino_swint_ogc.pth")
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)
boxes, logits, phrases = predict(grounding_dino_model, image, text_prompt)
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Pros of EfficientSAM
- Achieves faster inference speed on mobile devices
- Utilizes a more lightweight architecture, reducing model size
- Implements efficient attention mechanisms for improved performance
Cons of EfficientSAM
- May have slightly lower accuracy compared to MobileSAM in some scenarios
- Less extensive documentation and community support
- Fewer pre-trained models available for different use cases
Code Comparison
MobileSAM:
from mobile_sam import SamPredictor, SamAutomaticMaskGenerator
predictor = SamPredictor(model)
masks = predictor.predict(image)
EfficientSAM:
from efficient_sam import EfficientSamPredictor
predictor = EfficientSamPredictor(model)
masks = predictor.generate(image)
Both repositories aim to provide efficient implementations of the Segment Anything Model (SAM) for mobile devices. MobileSAM focuses on optimizing the original SAM architecture, while EfficientSAM introduces novel techniques to further reduce computational requirements. The code comparison shows similar usage patterns, with slight differences in method names and initialization. EfficientSAM's approach may offer better performance on resource-constrained devices, but MobileSAM might provide a more balanced trade-off between accuracy and efficiency.
Track-Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything, XMem, and E2FGVI.
Pros of Track-Anything
- Offers video object tracking and segmentation capabilities
- Provides an interactive web interface for easy use
- Supports multiple object tracking in a single video
Cons of Track-Anything
- May require more computational resources due to video processing
- Potentially slower processing times for longer videos
- Limited to tracking objects in video content only
Code Comparison
Track-Anything:
from track_anything import TrackingAnything
tracker = TrackingAnything()
tracker.track(video_path, initial_mask)
MobileSAM:
from mobile_sam import SamPredictor
predictor = SamPredictor()
masks = predictor.predict(image, points)
Summary
Track-Anything focuses on video object tracking and segmentation with an interactive interface, while MobileSAM is designed for efficient image segmentation on mobile devices. Track-Anything offers more comprehensive video analysis features but may require more resources. MobileSAM provides faster, lightweight image segmentation suitable for mobile applications but lacks video tracking capabilities. The choice between the two depends on the specific use case, whether it involves video analysis or quick image segmentation on resource-constrained devices.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Faster Segment Anything (MobileSAM) and Everything (MobileSAMv2)
:pushpin: MobileSAMv2, available at ResearchGate and arXiv, replaces the grid-search prompt sampling in SAM with object-aware prompt sampling for faster segment everything(SegEvery).
:pushpin: MobileSAM, available at ResearchGate and arXiv, replaces the heavyweight image encoder in SAM with a lightweight image encoder for faster segment anything(SegAny).
Support for ONNX model export. Feel free to test it on your devices and share your results with us.
A demo of MobileSAM running on CPU is open at hugging face demo. On our own Mac i5 CPU, it takes around 3s. On the hugging face demo, the interface and inferior CPUs make it slower but still works fine. Stayed tuned for a new version with more features! You can also run a demo of MobileSAM on your local PC.
:grapes: Media coverage and Projects that adapt from SAM to MobileSAM (Thank you all!)
-
2023/07/03: joliGEN supports MobileSAM for faster and lightweight mask refinement for image inpainting with Diffusion and GAN.
-
2023/07/03: MobileSAM-in-the-Browser shows a demo of running MobileSAM on the browser of your local PC or Mobile phone.
-
2023/07/02: Inpaint-Anything supports MobileSAM for faster and lightweight Inpaint Anything
-
2023/07/02: Personalize-SAM supports MobileSAM for faster and lightweight Personalize Segment Anything with 1 Shot
-
2023/07/01: MobileSAM-in-the-Browser makes an example implementation of MobileSAM in the browser.
-
2023/06/30: SegmentAnythingin3D supports MobileSAM to segment anything in 3D efficiently.
-
2023/06/30: MobileSAM has been featured by AK for the second time, see the link AK's MobileSAM tweet. Welcome to retweet.
-
2023/06/29: AnyLabeling supports MobileSAM for auto-labeling.
-
2023/06/29: SonarSAM supports MobileSAM for Image encoder full-finetuing.
-
2023/06/29: Stable Diffusion WebUIv supports MobileSAM.
-
2023/06/28: Grounding-SAM supports MobileSAM with Grounded-MobileSAM.
-
2023/06/27: MobileSAM has been featured by AK, see the link AK's MobileSAM tweet. Welcome to retweet.
:star: How is MobileSAM trained? MobileSAM is trained on a single GPU with 100k datasets (1% of the original images) for less than a day. The training code will be available soon.
:star: How to Adapt from SAM to MobileSAM? Since MobileSAM keeps exactly the same pipeline as the original SAM, we inherit pre-processing, post-processing, and all other interfaces from the original SAM. Therefore, by assuming everything is exactly the same except for a smaller image encoder, those who use the original SAM for their projects can adapt to MobileSAM with almost zero effort.
:star: MobileSAM performs on par with the original SAM (at least visually) and keeps exactly the same pipeline as the original SAM except for a change on the image encoder. Specifically, we replace the original heavyweight ViT-H encoder (632M) with a much smaller Tiny-ViT (5M). On a single GPU, MobileSAM runs around 12ms per image: 8ms on the image encoder and 4ms on the mask decoder.
-
The comparison of ViT-based image encoder is summarzed as follows:
Image Encoder Original SAM MobileSAM Parameters 611M 5M Speed 452ms 8ms -
Original SAM and MobileSAM have exactly the same prompt-guided mask decoder:
Mask Decoder Original SAM MobileSAM Parameters 3.876M 3.876M Speed 4ms 4ms -
The comparison of the whole pipeline is summarized as follows:
Whole Pipeline (Enc+Dec) Original SAM MobileSAM Parameters 615M 9.66M Speed 456ms 12ms
:star: Original SAM and MobileSAM with a point as the prompt.
:star: Original SAM and MobileSAM with a box as the prompt.
:muscle: Is MobileSAM faster and smaller than FastSAM? Yes! MobileSAM is around 7 times smaller and around 5 times faster than the concurrent FastSAM. The comparison of the whole pipeline is summarzed as follows:
Whole Pipeline (Enc+Dec) | FastSAM | MobileSAM |
---|---|---|
Parameters | 68M | 9.66M |
Speed | 64ms | 12ms |
:muscle: Does MobileSAM aign better with the original SAM than FastSAM? Yes! FastSAM is suggested to work with multiple points, thus we compare the mIoU with two prompt points (with different pixel distances) and show the resutls as follows. Higher mIoU indicates higher alignment.
mIoU | FastSAM | MobileSAM |
---|---|---|
100 | 0.27 | 0.73 |
200 | 0.33 | 0.71 |
300 | 0.37 | 0.74 |
400 | 0.41 | 0.73 |
500 | 0.41 | 0.73 |
Installation
The code requires python>=3.8
, as well as pytorch>=1.7
and torchvision>=0.8
. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
Install Mobile Segment Anything:
pip install git+https://github.com/ChaoningZhang/MobileSAM.git
or clone the repository locally and install with
git clone git@github.com:ChaoningZhang/MobileSAM.git
cd MobileSAM; pip install -e .
Demo
Once installed MobileSAM, you can run demo on your local PC or check out our HuggingFace Demo.
It requires latest version of gradio.
cd app
python app.py
Getting Started
The MobileSAM can be loaded in the following ways:
from mobile_sam import sam_model_registry, SamAutomaticMaskGenerator, SamPredictor
model_type = "vit_t"
sam_checkpoint = "./weights/mobile_sam.pt"
device = "cuda" if torch.cuda.is_available() else "cpu"
mobile_sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
mobile_sam.to(device=device)
mobile_sam.eval()
predictor = SamPredictor(mobile_sam)
predictor.set_image(<your_image>)
masks, _, _ = predictor.predict(<input_prompts>)
or generate masks for an entire image:
from mobile_sam import SamAutomaticMaskGenerator
mask_generator = SamAutomaticMaskGenerator(mobile_sam)
masks = mask_generator.generate(<your_image>)
Getting Started (MobileSAMv2)
Download the model weights from the checkpoints.
After downloading the model weights, faster SegEvery with MobileSAMv2 can be simply used as follows:
cd MobileSAMv2
bash ./experiments/mobilesamv2.sh
ONNX Export
MobileSAM now supports ONNX export. Export the model with
python scripts/export_onnx_model.py --checkpoint ./weights/mobile_sam.pt --model-type vit_t --output ./mobile_sam.onnx
Also check the example notebook to follow detailed steps.
We recommend to use onnx==1.12.0
and onnxruntime==1.13.1
which is tested.
BibTex of our MobileSAM
If you use MobileSAM in your research, please use the following BibTeX entry. :mega: Thank you!
@article{mobile_sam,
title={Faster Segment Anything: Towards Lightweight SAM for Mobile Applications},
author={Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung-Ho and Lee, Seungkyu and Hong, Choong Seon},
journal={arXiv preprint arXiv:2306.14289},
year={2023}
}
Acknowledgement
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155911, Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University))
SAM (Segment Anything) [bib]
@article{kirillov2023segany,
title={Segment Anything},
author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
journal={arXiv:2304.02643},
year={2023}
}
TinyViT (TinyViT: Fast Pretraining Distillation for Small Vision Transformers) [bib]
@InProceedings{tiny_vit,
title={TinyViT: Fast Pretraining Distillation for Small Vision Transformers},
author={Wu, Kan and Zhang, Jinnian and Peng, Houwen and Liu, Mengchen and Xiao, Bin and Fu, Jianlong and Yuan, Lu},
booktitle={European conference on computer vision (ECCV)},
year={2022}
Top Related Projects
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Segment Anything in High Quality [NeurIPS 2023]
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Track-Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything, XMem, and E2FGVI.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot