BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

7,657

836

7,657

127

View on GitHub

Top Related Projects

onnxruntime

16,412

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

mlflow

20,329

Open source platform for the machine learning lifecycle

serve

4,313

Serve, optimize and scale PyTorch models in production

serving

6,289

A flexible, high-performance serving system for machine learning models

cortex

8,033

Production infrastructure for machine learning at scale

Quick Overview

BentoML is an open-source platform for machine learning model serving and deployment. It simplifies the process of packaging machine learning models and deploying them as production-ready API endpoints, making it easier for data scientists and ML engineers to bring their models into production environments.

Pros

Seamless integration with popular ML frameworks (e.g., TensorFlow, PyTorch, scikit-learn)
Supports various deployment options, including Docker containers and cloud platforms
Provides built-in model versioning and management
Offers automatic API generation and documentation

Cons

Learning curve for users new to ML deployment concepts
Limited support for some less common ML frameworks
May require additional configuration for complex deployment scenarios
Documentation could be more comprehensive for advanced use cases

Code Examples

Creating a BentoML service:

import bentoml
from bentoml.io import NumpyNdarray

@bentoml.service(name="iris_classifier", runners=[bentoml.sklearn.get("iris_clf:latest").to_runner()])
class IrisClassifier:
    @bentoml.api(input=NumpyNdarray(), output=NumpyNdarray())
    def predict(self, input_series):
        return self.runner.predict.run(input_series)

Saving a model with BentoML:

import bentoml
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
# ... train the model ...

bentoml.sklearn.save_model("iris_clf", model)

Loading and using a saved model:

import bentoml

iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
iris_clf_runner.init_local()
result = iris_clf_runner.predict.run(input_data)

Getting Started

To get started with BentoML, follow these steps:

Install BentoML:

pip install bentoml

Create a BentoML service (as shown in the first code example above)
Save your trained model:

bentoml.sklearn.save_model("my_model", trained_model)

Build and serve your BentoML service:

bentoml build
bentoml serve service:IrisClassifier

Your model is now serving predictions via a REST API!

Competitor Comparisons

onnxruntime

16,412

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Highly optimized for performance across various hardware platforms
Supports a wide range of machine learning frameworks and models
Extensive ecosystem and backing from Microsoft

Cons of ONNX Runtime

Steeper learning curve for beginners
Primarily focused on inference, not model training or deployment

Code Comparison

BentoML:

import bentoml

@bentoml.env(pip_packages=["scikit-learn"])
@bentoml.artifacts([SklearnModelArtifact('model')])
class SklearnService(bentoml.BentoService):
    @bentoml.api(input=JsonInput(), output=JsonOutput())
    def predict(self, input_data):
        return self.artifacts.model.predict(input_data)

ONNX Runtime:

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
result = session.run([output_name], {input_name: input_data})

BentoML focuses on model serving and deployment, providing a high-level API for creating prediction services. ONNX Runtime, on the other hand, is more low-level and focused on optimized inference across different hardware platforms. While BentoML offers a more user-friendly approach to deploying models, ONNX Runtime provides greater flexibility and performance optimization for inference tasks.

mlflow

20,329

Open source platform for the machine learning lifecycle

Pros of MLflow

More comprehensive ecosystem with experiment tracking, model registry, and project management
Better integration with popular ML frameworks and cloud platforms
Larger community and more extensive documentation

Cons of MLflow

Can be complex to set up and configure for smaller projects
Less focus on model serving and deployment compared to BentoML

Code Comparison

MLflow:

import mlflow

mlflow.start_run()
mlflow.log_param("param1", value1)
mlflow.log_metric("metric1", value2)
mlflow.end_run()

BentoML:

import bentoml

@bentoml.env(pip_packages=["scikit-learn"])
@bentoml.artifacts([SklearnModelArtifact('model')])
class MyService(bentoml.BentoService):
    @bentoml.api(input=JsonInput(), output=JsonOutput())
    def predict(self, input_data):
        return self.artifacts.model.predict(input_data)

MLflow focuses on experiment tracking and model management, while BentoML emphasizes model serving and deployment. MLflow's code example shows logging parameters and metrics, whereas BentoML's code demonstrates creating a deployable service with API endpoints. Both tools have their strengths, and the choice depends on specific project requirements and workflow preferences.

serve

4,313

Serve, optimize and scale PyTorch models in production

Pros of TorchServe

Specifically designed for PyTorch models, offering optimized performance
Supports model versioning and A/B testing out of the box
Integrates well with other PyTorch ecosystem tools

Cons of TorchServe

Limited to PyTorch models, less flexible for other frameworks
Steeper learning curve for users not familiar with PyTorch ecosystem
Less extensive documentation and community support compared to BentoML

Code Comparison

TorchServe example:

import torch
from torchvision import models

model = models.resnet18(pretrained=True)
torch.save(model.state_dict(), "resnet18.pth")

BentoML example:

import bentoml
from torchvision import models

model = models.resnet18(pretrained=True)
bentoml.pytorch.save_model("resnet18", model)

Both examples demonstrate saving a pre-trained ResNet18 model, but BentoML provides a more streamlined API for model management and deployment.

serving

6,289

A flexible, high-performance serving system for machine learning models

Pros of TensorFlow Serving

Highly optimized for TensorFlow models, offering excellent performance
Supports model versioning and hot-swapping for seamless updates
Integrates well with other TensorFlow ecosystem tools

Cons of TensorFlow Serving

Limited to TensorFlow models, lacking support for other ML frameworks
Steeper learning curve and more complex setup compared to BentoML
Less flexibility in customizing serving logic and pipelines

Code Comparison

BentoML:

import bentoml

@bentoml.env(pip_packages=["scikit-learn"])
@bentoml.artifacts([SklearnModelArtifact('model')])
class SklearnService(bentoml.BentoService):
    @bentoml.api(input=JsonInput(), output=JsonOutput())
    def predict(self, input_data):
        return self.artifacts.model.predict(input_data)

TensorFlow Serving:

import tensorflow as tf

model = tf.saved_model.load('path/to/model')
serving_fn = model.signatures['serving_default']

def predict(input_data):
    return serving_fn(tf.constant(input_data))['output']

cortex

8,033

Production infrastructure for machine learning at scale

Pros of Cortex

Supports automatic scaling and GPU acceleration out-of-the-box
Provides a managed Kubernetes infrastructure, simplifying deployment
Offers real-time monitoring and logging capabilities

Cons of Cortex

Less flexible in terms of supported ML frameworks compared to BentoML
Steeper learning curve for users not familiar with Kubernetes concepts
Limited customization options for advanced deployment scenarios

Code Comparison

Cortex

# cortex.yaml
- name: iris-classifier
  kind: RealtimeAPI
  predictor:
    type: python
    path: predictor.py

BentoML

# service.py
import bentoml

@bentoml.env(pip_packages=["scikit-learn"])
@bentoml.artifacts([SklearnModelArtifact('model')])
class IrisClassifier(bentoml.BentoService):
    @bentoml.api(input=JsonInput(), output=JsonOutput())
    def predict(self, input_data):
        return self.artifacts.model.predict(input_data)

Both Cortex and BentoML aim to simplify ML model deployment, but they take different approaches. Cortex focuses on providing a managed Kubernetes infrastructure with automatic scaling, while BentoML offers more flexibility in terms of supported frameworks and deployment options. The choice between the two depends on specific project requirements and team expertise.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Unified Model Serving Framework

ð± Build model inference APIs and multi-model serving systems with any open-source or custom AI models. ð Join our Slack community!

What is BentoML?

BentoML is a Python library for building online serving systems optimized for AI apps and model inference.

ð± Easily build APIs for Any AI/ML Model. Turn any model inference script into a REST API server with just a few lines of code and standard Python type hints.
ð³ Docker Containers made simple. No more dependency hell! Manage your environments, dependencies and model versions with a simple config file. BentoML automatically generates Docker images, ensures reproducibility, and simplifies how you deploy to different environments.
ð§ Maximize CPU/GPU utilization. Build high performance inference APIs leveraging built-in serving optimization features like dynamic batching, model parallelism, multi-stage pipeline and multi-model inference-graph orchestration.
ð©âð» Fully customizable. Easily implement your own APIs or task queues, with custom business logic, model inference and multi-model composition. Supports any ML framework, modality, and inference runtime.
ð Ready for Production. Develop, run and debug locally. Seamlessly deploy to production with Docker containers or BentoCloud.

Getting started

Install BentoML:

# Requires Pythonâ¥3.9
pip install -U bentoml

Define APIs in aÂ service.pyÂ file.

import bentoml

@bentoml.service(
    image=bentoml.images.Image(python_version="3.11").python_packages("torch", "transformers"),
)
class Summarization:
    def __init__(self) -> None:
        import torch
        from transformers import pipeline

        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.pipeline = pipeline('summarization', device=device)

    @bentoml.api(batchable=True)
    def summarize(self, texts: list[str]) -> list[str]:
        results = self.pipeline(texts)
        return [item['summary_text'] for item in results]

ð» Run locally

Install PyTorch and Transformers packages to your Python virtual environment.

pip install torch transformers  # additional dependencies for local run

Run the service code locally (serving at http://localhost:3000 by default):

bentoml serve

You should expect to see the following output.

[INFO] [cli] Starting production HTTP BentoServer from "service:Summarization" listening on http://localhost:3000 (Press CTRL+C to quit)
[INFO] [entry_service:Summarization:1] Service Summarization initialized

Now you can run inference from your browser at http://localhost:3000 or with a Python script:

import bentoml

with bentoml.SyncHTTPClient('http://localhost:3000') as client:
    summarized_text: str = client.summarize([bentoml.__doc__])[0]
    print(f"Result: {summarized_text}")

ð³ Deploy using Docker

Run bentoml build to package necessary code, models, dependency configs into a Bento - the standardized deployable artifact in BentoML:

bentoml build

Ensure Docker is running. Generate a Docker container image for deployment:

bentoml containerize summarization:latest

Run the generated image:

docker run --rm -p 3000:3000 summarization:latest

âï¸ Deploy on BentoCloud

BentoCloud provides compute infrastructure for rapid and reliable GenAI adoption. It helps speed up your BentoML development process leveraging cloud compute resources, and simplify how you deploy, scale and operate BentoML in production.

# After signup, run the following command to create an API token:
bentoml cloud login

# Deploy from current directory:
bentoml deploy

bentocloud-ui

For detailed explanations, read the Hello World example.

Examples

LLMs: Llama 3.2, Mistral, DeepSeek Distil, and more.
Image Generation: Stable Diffusion 3 Medium, Stable Video Diffusion, Stable Diffusion XL Turbo, ControlNet, and LCM LoRAs.
Embeddings: SentenceTransformers and ColPali
Audio: ChatTTS, XTTS, WhisperX, Bark
Computer Vision: YOLO and ResNet
Advanced examples: Function calling, LangGraph, CrewAI

Check out the full list for more sample code and usage.

Advanced topics

See Documentation for more tutorials and guides.

Community

Get involved and join our Community Slack ð¬, where thousands of AI/ML engineers help each other, contribute to the project, and talk about building AI products.

To report a bug or suggest a feature request, use GitHub Issues.

Contributing

There are many ways to contribute to the project:

Report bugs and "Thumbs up" on issues that are relevant to you.
Investigate issues and review other developers' pull requests.
Contribute code or documentation to the project by submitting a GitHub pull request.
Check out the Contributing Guide and Development Guide to learn more.
Share your feedback and discuss roadmap plans in the #bentoml-contributors channel here.

Thanks to all of our amazing contributors!

Usage tracking and feedback

The BentoML framework collects anonymous usage data that helps our community improve the product. Only BentoML's internal API calls are being reported. This excludes any sensitive information, such as user code, model data, model names, or stack traces. Here's theÂ codeÂ used for usage tracking. You can opt-out of usage tracking by theÂ --do-not-trackÂ CLI option:

bentoml [command] --do-not-track

Or by setting the environment variable:

export BENTOML_DO_NOT_TRACK=True

License

Apache License 2.0

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of ONNX Runtime

Cons of ONNX Runtime

Code Comparison

Pros of MLflow

Cons of MLflow

Code Comparison

Pros of TorchServe

Cons of TorchServe

Code Comparison

Pros of TensorFlow Serving

Cons of TensorFlow Serving

Code Comparison

Pros of Cortex

Cons of Cortex

Code Comparison

Cortex

BentoML

Convert designs to code with AI

README

Unified Model Serving Framework

What is BentoML?

Getting started

ð» Run locally

ð³ Deploy using Docker

âï¸ Deploy on BentoCloud

Examples

Advanced topics

Community

Contributing

Usage tracking and feedback

License

Top Related Projects

Convert designs to code with AI

ð» Run locally

ð³ Deploy using Docker

âï¸ Deploy on BentoCloud