serve

Serve, optimize and scale PyTorch models in production

4,346

886

4,346

447

View on GitHub

Top Related Projects

serving

6,289

A flexible, high-performance serving system for machine learning models

BentoML

7,950

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

cortex

8,036

Production infrastructure for machine learning at scale

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

server

9,394

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

Quick Overview

TorchServe is an open-source model serving framework for PyTorch. It provides a flexible and easy-to-use solution for deploying and serving PyTorch models in production environments. TorchServe offers features like model versioning, metrics, and multi-model serving, making it suitable for various deployment scenarios.

Pros

Easy to use and deploy, with minimal configuration required
Supports both REST and gRPC APIs for model inference
Provides built-in model management and versioning capabilities
Offers customizable handlers for pre- and post-processing of inputs and outputs

Cons

Limited to PyTorch models, not suitable for other deep learning frameworks
May have performance overhead compared to more specialized serving solutions
Documentation can be sparse or outdated in some areas
Relatively new project, which may lead to potential stability issues or frequent changes

Code Examples

Creating a custom handler:

from ts.torch_handler.base_handler import BaseHandler

class MyCustomHandler(BaseHandler):
    def preprocess(self, data):
        # Custom preprocessing logic
        return processed_data

    def inference(self, data):
        # Custom inference logic
        return predictions

    def postprocess(self, data):
        # Custom postprocessing logic
        return final_output

Registering a model:

import torch
from torchserve.model_archiver.model_packaging import package_model

model = torch.load('my_model.pth')
package_model(model_name='my_model',
              version='1.0',
              model_file='my_model.py',
              serialized_file='my_model.pth',
              handler='my_custom_handler.py')

Starting TorchServe:

torchserve --start --ncs --model-store model_store --models my_model.mar

Getting Started

Install TorchServe:

pip install torchserve torch-model-archiver torch-workflow-archiver

Create a model archive:

torch-model-archiver --model-name my_model --version 1.0 --model-file path/to/model.py --serialized-file path/to/model.pth --handler image_classifier

Start TorchServe:

mkdir model_store
mv my_model.mar model_store/
torchserve --start --ncs --model-store model_store --models my_model.mar

Make an inference request:

curl http://localhost:8080/predictions/my_model -T examples/image_classifier/kitten.jpg

Competitor Comparisons

serving

6,289

A flexible, high-performance serving system for machine learning models

Pros of TensorFlow Serving

More mature and battle-tested in production environments
Supports model versioning and A/B testing out of the box
Offers high-performance serving with optimized C++ runtime

Cons of TensorFlow Serving

Limited to TensorFlow models only
Steeper learning curve and more complex setup
Less flexibility for custom preprocessing and postprocessing

Code Comparison

TensorFlow Serving (using gRPC):

import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'model'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(data))
result = stub.Predict(request, 10.0)

PyTorch Serve:

import requests
import json

data = {'input': input_data.tolist()}
response = requests.post("http://localhost:8080/predictions/model", data=json.dumps(data))
result = response.json()

PyTorch Serve offers a simpler API and supports both REST and gRPC, while TensorFlow Serving focuses on high-performance gRPC communication. PyTorch Serve is more flexible and easier to use for custom models, but TensorFlow Serving excels in production environments with its optimized performance and built-in model management features.

BentoML

7,950

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

Pros of BentoML

More flexible model serving framework, supporting multiple ML frameworks beyond PyTorch
Provides a unified API for model packaging, deployment, and management
Offers built-in model versioning and experiment tracking capabilities

Cons of BentoML

Steeper learning curve due to its more comprehensive feature set
May have higher resource overhead for simpler serving scenarios
Less tightly integrated with PyTorch ecosystem

Code Comparison

BentoML:

import bentoml

@bentoml.env(pip_packages=["torch"])
@bentoml.artifacts([bentoml.PyTorchModelArtifact("model")])
class MyService(bentoml.BentoService):
    @bentoml.api(input=bentoml.Image(), output=bentoml.JsonOutput())
    def predict(self, image):
        return self.artifacts.model(image)

TorchServe:

import torch
from ts.torch_handler.base_handler import BaseHandler

class MyHandler(BaseHandler):
    def preprocess(self, data):
        return torch.tensor(data)

    def inference(self, data):
        return self.model.forward(data)

    def postprocess(self, data):
        return data.tolist()

cortex

8,036

Production infrastructure for machine learning at scale

Pros of Cortex

Supports multiple machine learning frameworks (TensorFlow, PyTorch, scikit-learn, etc.)
Provides automatic scaling and infrastructure management
Offers a more comprehensive end-to-end ML deployment solution

Cons of Cortex

Steeper learning curve due to more complex architecture
Less tightly integrated with PyTorch ecosystem
May have higher operational costs for smaller projects

Code Comparison

Cortex deployment:

- name: iris-classifier
  predictor:
    type: python
    path: predictor.py
  compute:
    cpu: 1

TorchServe deployment:

torch-model-archiver --model-name densenet161 --version 1.0 --model-file model.py --serialized-file densenet161-8d451a50.pth --export-path model_store --extra-files index_to_name.json --handler image_classifier

Summary

Cortex offers a more comprehensive solution for deploying machine learning models across various frameworks, with built-in scaling and infrastructure management. However, it may be more complex to set up and potentially costlier for smaller projects. TorchServe, on the other hand, provides a simpler, PyTorch-focused deployment option that might be more suitable for projects primarily using PyTorch models.

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Broader model support: Works with models from various frameworks, not just PyTorch
Performance optimizations: Offers advanced optimizations for inference across different hardware
Cross-platform compatibility: Supports a wide range of operating systems and devices

Cons of ONNX Runtime

Steeper learning curve: Requires more setup and configuration compared to TorchServe
Less integrated with PyTorch ecosystem: May require additional steps for PyTorch model deployment

Code Comparison

ONNX Runtime:

import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

TorchServe:

import torch
from torchserve.torch_handler.base_handler import BaseHandler

class ModelHandler(BaseHandler):
    def preprocess(self, data):
        # Preprocess input data
    def inference(self, data):
        # Perform inference
    def postprocess(self, data):
        # Postprocess output data

Both repositories offer robust solutions for model serving, with ONNX Runtime providing broader framework support and optimizations, while TorchServe offers tighter integration with the PyTorch ecosystem and simpler deployment for PyTorch models.

server

9,394

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

Pros of Triton Inference Server

Supports multiple deep learning frameworks (TensorFlow, PyTorch, ONNX, etc.)
Offers advanced features like dynamic batching and model ensembling
Provides optimized performance for GPU inference

Cons of Triton Inference Server

Steeper learning curve and more complex setup
Less integrated with PyTorch ecosystem
May be overkill for simpler deployment scenarios

Code Comparison

Triton Inference Server (model configuration):

{
  "name": "mymodel",
  "backend": "pytorch",
  "max_batch_size": 8,
  "input": [{"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [3, 224, 224]}],
  "output": [{"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [1000]}]
}

TorchServe (model handler):

class MyHandler(BaseHandler):
    def preprocess(self, data):
        # Preprocess input data
    def inference(self, data):
        # Perform inference
    def postprocess(self, data):
        # Postprocess output data

Both servers offer powerful inference capabilities, but Triton Inference Server provides more flexibility across frameworks and advanced features, while TorchServe offers a simpler, more PyTorch-centric approach.

mlflow

21,434

Pros of MLflow

Broader scope: Supports the entire machine learning lifecycle, including experiment tracking, model packaging, and deployment
Language-agnostic: Works with various ML frameworks and languages, not limited to PyTorch
Extensive integrations: Offers integrations with popular data science tools and platforms

Cons of MLflow

Less specialized for serving: Not as focused on model serving capabilities as TorchServe
Steeper learning curve: May require more time to set up and configure due to its comprehensive feature set

Code Comparison

MLflow example:

import mlflow

mlflow.start_run()
mlflow.log_param("param1", 5)
mlflow.log_metric("accuracy", 0.95)
mlflow.pytorch.log_model(model, "model")
mlflow.end_run()

TorchServe example:

import torch
from torch import nn

class MyModel(nn.Module):
    def forward(self, x):
        return x * 2

model = MyModel()
torch.save(model.state_dict(), "mymodel.pth")

Both repositories offer valuable tools for machine learning workflows, with MLflow providing a more comprehensive solution for the entire ML lifecycle, while TorchServe focuses specifically on serving PyTorch models efficiently.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

â ï¸ Notice: Limited Maintenance

This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.

âANNOUNCEMENT: Security Changesâ

TorchServe now enforces token authorization enabled and model API control disabled by default. These security features are intended to address the concern of unauthorized API calls and to prevent potential malicious code from being introduced to the model server. Refer the following documentation for more information: Token Authorization, Model API control

TorchServe

TorchServe is a flexible and easy-to-use tool for serving and scaling PyTorch models in production.

Requires python >= 3.8

curl http://127.0.0.1:8080/predictions/bert -T input.txt

ð Quick start with TorchServe

# Install dependencies
python ./ts_scripts/install_dependencies.py

# Include dependencies for accelerator support with the relevant optional flags
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
pip install torchserve torch-model-archiver torch-workflow-archiver

# Nightly build
pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly

ð Quick start with TorchServe (conda)

# Install dependencies
python ./ts_scripts/install_dependencies.py

# Include depeendencies for accelerator support with the relevant optional flags
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
conda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver

# Nightly build
conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver

Getting started guide

ð³ Quick Start with Docker

# Latest release
docker pull pytorch/torchserve

# Nightly build
docker pull pytorch/torchserve-nightly

Refer to torchserve docker for details.

ð¤ Quick Start LLM Deployment

VLLM Engine

# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth

# Try it out
curl -X POST -d '{"model":"meta-llama/Llama-3.2-3B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"

TRT-LLM Engine

# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`
# pip install -U --use-deprecated=legacy-resolver -r requirements/trt_llm.txt
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine trt_llm --disable_token_auth

# Try it out
curl -X POST -d '{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"

ð¢ Quick Start LLM Deployment with Docker

#export token=<HUGGINGFACE_HUB_TOKEN>
docker build --pull . -f docker/Dockerfile.vllm -t ts/vllm

docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth

# Try it out
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"

Refer to LLM deployment for details and other methods.

â¡ Why TorchServe

Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, Nvidia MPS
Model Management API: multi model management with optimized worker to model allocation
Inference API: REST and gRPC support for batched inference
TorchServe Workflows: deploy complex DAGs with multiple interdependent models
Default way to serve PyTorch models in
- Sagemaker
- Vertex AI
- Kubernetes with support for autoscaling, session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
- Kserve: Supports both v1 and v2 API, autoscaling and canary deployments for A/B testing
- Kubeflow
- MLflow
Export your model for optimized inference. Torchscript out of the box, PyTorch Compiler preview, ORT and ONNX, IPEX, TensorRT, FasterTransformer, FlashAttention (Better Transformers)
Performance Guide: builtin support to optimize, benchmark, and profile PyTorch and TorchServe performance
Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your use case with many supported out of the box
Metrics API: out-of-the-box support for system-level metrics with Prometheus exports, custom metrics,
Large Model Inference Guide: With support for GenAI, LLMs including
- SOTA GenAI performance using torch.compile
- Fast Kernels with FlashAttention v2, continuous batching and streaming response
- PyTorch Tensor Parallel preview, Pipeline Parallel
- Microsoft DeepSpeed, DeepSpeed-Mii
- Hugging Face Accelerate, Diffusers
- Running large models on AWS Sagemaker and Inferentia2
- Running Meta Llama Chatbot locally on Mac
Monitoring using Grafana and Datadog

ð¤ How does TorchServe work

Model Server for PyTorch Documentation: Full documentation
TorchServe internals: How TorchServe was built
Contributing guide: How to contribute to TorchServe

ð Highlighted Examples

Serving Meta Llama with TorchServe
Chatbot with Meta Llama on Mac ð¦ð¬
ð¤ HuggingFace Transformers with a Better Transformer Integration/ Flash Attention & Xformer Memory Efficient
Stable Diffusion
Model parallel inference
MultiModal models with MMF combining text, audio and video
Dual Neural Machine Translation for a complex workflow DAG
TorchServe Integrations
TorchServe Internals
TorchServe UseCases

For more examples

ð¡ï¸ TorchServe Security Policy

SECURITY.md

ð¤ Learn More

https://pytorch.org/serve

ð« Contributing

We welcome all contributions!

To learn more about how to contribute, see the contributor guide here.

ð° News

ð All Contributors

Made with contrib.rocks.

âï¸ Disclaimer

This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to opensource@fb.com. For questions directed at Amazon, please send an email to torchserve@amazon.com. For all other questions, please open up an issue in this repository here.

TorchServe acknowledges the Multi Model Server (MMS) project from which it was derived

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of TensorFlow Serving

Cons of TensorFlow Serving

Code Comparison

Pros of BentoML

Cons of BentoML

Code Comparison

Pros of Cortex

Cons of Cortex

Code Comparison

Summary

Pros of ONNX Runtime

Cons of ONNX Runtime

Code Comparison

Pros of Triton Inference Server

Cons of Triton Inference Server

Code Comparison

Pros of MLflow

Cons of MLflow

Code Comparison

Convert designs to code with AI

README

âANNOUNCEMENT: Security Changesâ

TorchServe

ð Quick start with TorchServe

ð Quick start with TorchServe (conda)

ð³ Quick Start with Docker

ð¤ Quick Start LLM Deployment

VLLM Engine

TRT-LLM Engine

ð¢ Quick Start LLM Deployment with Docker

â¡ Why TorchServe

ð¤ How does TorchServe work

ð Highlighted Examples

ð¡ï¸ TorchServe Security Policy

ð¤ Learn More

ð« Contributing

ð° News

ð All Contributors

âï¸ Disclaimer

Top Related Projects

Convert designs to code with AI

âANNOUNCEMENT: Security Changesâ

ð Quick start with TorchServe

ð Quick start with TorchServe (conda)

ð³ Quick Start with Docker

ð¤ Quick Start LLM Deployment

ð¢ Quick Start LLM Deployment with Docker

â¡ Why TorchServe

ð¤ How does TorchServe work

ð Highlighted Examples

ð¡ï¸ TorchServe Security Policy

ð¤ Learn More

ð« Contributing

ð° News

ð All Contributors

âï¸ Disclaimer