Top Related Projects
A flexible, high-performance serving system for machine learning models
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Production infrastructure for machine learning at scale
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Open source platform for the machine learning lifecycle
Quick Overview
TorchServe is an open-source model serving framework for PyTorch. It provides a flexible and easy-to-use solution for deploying and serving PyTorch models in production environments. TorchServe offers features like model versioning, metrics, and multi-model serving, making it suitable for various deployment scenarios.
Pros
- Easy to use and deploy, with minimal configuration required
- Supports both REST and gRPC APIs for model inference
- Provides built-in model management and versioning capabilities
- Offers customizable handlers for pre- and post-processing of inputs and outputs
Cons
- Limited to PyTorch models, not suitable for other deep learning frameworks
- May have performance overhead compared to more specialized serving solutions
- Documentation can be sparse or outdated in some areas
- Relatively new project, which may lead to potential stability issues or frequent changes
Code Examples
- Creating a custom handler:
from ts.torch_handler.base_handler import BaseHandler
class MyCustomHandler(BaseHandler):
def preprocess(self, data):
# Custom preprocessing logic
return processed_data
def inference(self, data):
# Custom inference logic
return predictions
def postprocess(self, data):
# Custom postprocessing logic
return final_output
- Registering a model:
import torch
from torchserve.model_archiver.model_packaging import package_model
model = torch.load('my_model.pth')
package_model(model_name='my_model',
version='1.0',
model_file='my_model.py',
serialized_file='my_model.pth',
handler='my_custom_handler.py')
- Starting TorchServe:
torchserve --start --ncs --model-store model_store --models my_model.mar
Getting Started
- Install TorchServe:
pip install torchserve torch-model-archiver torch-workflow-archiver
- Create a model archive:
torch-model-archiver --model-name my_model --version 1.0 --model-file path/to/model.py --serialized-file path/to/model.pth --handler image_classifier
- Start TorchServe:
mkdir model_store
mv my_model.mar model_store/
torchserve --start --ncs --model-store model_store --models my_model.mar
- Make an inference request:
curl http://localhost:8080/predictions/my_model -T examples/image_classifier/kitten.jpg
Competitor Comparisons
A flexible, high-performance serving system for machine learning models
Pros of TensorFlow Serving
- More mature and battle-tested in production environments
- Supports model versioning and A/B testing out of the box
- Offers high-performance serving with optimized C++ runtime
Cons of TensorFlow Serving
- Limited to TensorFlow models only
- Steeper learning curve and more complex setup
- Less flexibility for custom preprocessing and postprocessing
Code Comparison
TensorFlow Serving (using gRPC):
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'model'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(data))
result = stub.Predict(request, 10.0)
PyTorch Serve:
import requests
import json
data = {'input': input_data.tolist()}
response = requests.post("http://localhost:8080/predictions/model", data=json.dumps(data))
result = response.json()
PyTorch Serve offers a simpler API and supports both REST and gRPC, while TensorFlow Serving focuses on high-performance gRPC communication. PyTorch Serve is more flexible and easier to use for custom models, but TensorFlow Serving excels in production environments with its optimized performance and built-in model management features.
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Pros of BentoML
- More flexible model serving framework, supporting multiple ML frameworks beyond PyTorch
- Provides a unified API for model packaging, deployment, and management
- Offers built-in model versioning and experiment tracking capabilities
Cons of BentoML
- Steeper learning curve due to its more comprehensive feature set
- May have higher resource overhead for simpler serving scenarios
- Less tightly integrated with PyTorch ecosystem
Code Comparison
BentoML:
import bentoml
@bentoml.env(pip_packages=["torch"])
@bentoml.artifacts([bentoml.PyTorchModelArtifact("model")])
class MyService(bentoml.BentoService):
@bentoml.api(input=bentoml.Image(), output=bentoml.JsonOutput())
def predict(self, image):
return self.artifacts.model(image)
TorchServe:
import torch
from ts.torch_handler.base_handler import BaseHandler
class MyHandler(BaseHandler):
def preprocess(self, data):
return torch.tensor(data)
def inference(self, data):
return self.model.forward(data)
def postprocess(self, data):
return data.tolist()
Production infrastructure for machine learning at scale
Pros of Cortex
- Supports multiple machine learning frameworks (TensorFlow, PyTorch, scikit-learn, etc.)
- Provides automatic scaling and infrastructure management
- Offers a more comprehensive end-to-end ML deployment solution
Cons of Cortex
- Steeper learning curve due to more complex architecture
- Less tightly integrated with PyTorch ecosystem
- May have higher operational costs for smaller projects
Code Comparison
Cortex deployment:
- name: iris-classifier
predictor:
type: python
path: predictor.py
compute:
cpu: 1
TorchServe deployment:
torch-model-archiver --model-name densenet161 --version 1.0 --model-file model.py --serialized-file densenet161-8d451a50.pth --export-path model_store --extra-files index_to_name.json --handler image_classifier
Summary
Cortex offers a more comprehensive solution for deploying machine learning models across various frameworks, with built-in scaling and infrastructure management. However, it may be more complex to set up and potentially costlier for smaller projects. TorchServe, on the other hand, provides a simpler, PyTorch-focused deployment option that might be more suitable for projects primarily using PyTorch models.
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Pros of ONNX Runtime
- Broader model support: Works with models from various frameworks, not just PyTorch
- Performance optimizations: Offers advanced optimizations for inference across different hardware
- Cross-platform compatibility: Supports a wide range of operating systems and devices
Cons of ONNX Runtime
- Steeper learning curve: Requires more setup and configuration compared to TorchServe
- Less integrated with PyTorch ecosystem: May require additional steps for PyTorch model deployment
Code Comparison
ONNX Runtime:
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
TorchServe:
import torch
from torchserve.torch_handler.base_handler import BaseHandler
class ModelHandler(BaseHandler):
def preprocess(self, data):
# Preprocess input data
def inference(self, data):
# Perform inference
def postprocess(self, data):
# Postprocess output data
Both repositories offer robust solutions for model serving, with ONNX Runtime providing broader framework support and optimizations, while TorchServe offers tighter integration with the PyTorch ecosystem and simpler deployment for PyTorch models.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Pros of Triton Inference Server
- Supports multiple deep learning frameworks (TensorFlow, PyTorch, ONNX, etc.)
- Offers advanced features like dynamic batching and model ensembling
- Provides optimized performance for GPU inference
Cons of Triton Inference Server
- Steeper learning curve and more complex setup
- Less integrated with PyTorch ecosystem
- May be overkill for simpler deployment scenarios
Code Comparison
Triton Inference Server (model configuration):
{
"name": "mymodel",
"backend": "pytorch",
"max_batch_size": 8,
"input": [{"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [3, 224, 224]}],
"output": [{"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [1000]}]
}
TorchServe (model handler):
class MyHandler(BaseHandler):
def preprocess(self, data):
# Preprocess input data
def inference(self, data):
# Perform inference
def postprocess(self, data):
# Postprocess output data
Both servers offer powerful inference capabilities, but Triton Inference Server provides more flexibility across frameworks and advanced features, while TorchServe offers a simpler, more PyTorch-centric approach.
Open source platform for the machine learning lifecycle
Pros of MLflow
- Broader scope: Supports the entire machine learning lifecycle, including experiment tracking, model packaging, and deployment
- Language-agnostic: Works with various ML frameworks and languages, not limited to PyTorch
- Extensive integrations: Offers integrations with popular data science tools and platforms
Cons of MLflow
- Less specialized for serving: Not as focused on model serving capabilities as TorchServe
- Steeper learning curve: May require more time to set up and configure due to its comprehensive feature set
Code Comparison
MLflow example:
import mlflow
mlflow.start_run()
mlflow.log_param("param1", 5)
mlflow.log_metric("accuracy", 0.95)
mlflow.pytorch.log_model(model, "model")
mlflow.end_run()
TorchServe example:
import torch
from torch import nn
class MyModel(nn.Module):
def forward(self, x):
return x * 2
model = MyModel()
torch.save(model.state_dict(), "mymodel.pth")
Both repositories offer valuable tools for machine learning workflows, with MLflow providing a more comprehensive solution for the entire ML lifecycle, while TorchServe focuses specifically on serving PyTorch models efficiently.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
âANNOUNCEMENT: Security Changesâ
TorchServe now enforces token authorization enabled and model API control disabled by default. These security features are intended to address the concern of unauthorized API calls and to prevent potential malicious code from being introduced to the model server. Refer the following documentation for more information: Token Authorization, Model API control
TorchServe
TorchServe is a flexible and easy-to-use tool for serving and scaling PyTorch models in production.
Requires python >= 3.8
curl http://127.0.0.1:8080/predictions/bert -T input.txt
ð Quick start with TorchServe
# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121
# Latest release
pip install torchserve torch-model-archiver torch-workflow-archiver
# Nightly build
pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly
ð Quick start with TorchServe (conda)
# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121
# Latest release
conda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver
# Nightly build
conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver
ð³ Quick Start with Docker
# Latest release
docker pull pytorch/torchserve
# Nightly build
docker pull pytorch/torchserve-nightly
Refer to torchserve docker for details.
ð¤ Quick Start LLM Deployment
VLLM Engine
# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth
# Try it out
curl -X POST -d '{"model":"meta-llama/Llama-3.2-3B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
TRT-LLM Engine
# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`
# pip install -U --use-deprecated=legacy-resolver -r requirements/trt_llm.txt
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine trt_llm --disable_token_auth
# Try it out
curl -X POST -d '{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
ð¢ Quick Start LLM Deployment with Docker
#export token=<HUGGINGFACE_HUB_TOKEN>
docker build --pull . -f docker/Dockerfile.vllm -t ts/vllm
docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
# Try it out
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
Refer to LLM deployment for details and other methods.
â¡ Why TorchServe
- Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, Nvidia MPS
- Model Management API: multi model management with optimized worker to model allocation
- Inference API: REST and gRPC support for batched inference
- TorchServe Workflows: deploy complex DAGs with multiple interdependent models
- Default way to serve PyTorch models in
- Sagemaker
- Vertex AI
- Kubernetes with support for autoscaling, session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
- Kserve: Supports both v1 and v2 API, autoscaling and canary deployments for A/B testing
- Kubeflow
- MLflow
- Export your model for optimized inference. Torchscript out of the box, PyTorch Compiler preview, ORT and ONNX, IPEX, TensorRT, FasterTransformer, FlashAttention (Better Transformers)
- Performance Guide: builtin support to optimize, benchmark, and profile PyTorch and TorchServe performance
- Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your use case with many supported out of the box
- Metrics API: out-of-the-box support for system-level metrics with Prometheus exports, custom metrics,
- Large Model Inference Guide: With support for GenAI, LLMs including
- SOTA GenAI performance using
torch.compile
- Fast Kernels with FlashAttention v2, continuous batching and streaming response
- PyTorch Tensor Parallel preview, Pipeline Parallel
- Microsoft DeepSpeed, DeepSpeed-Mii
- Hugging Face Accelerate, Diffusers
- Running large models on AWS Sagemaker and Inferentia2
- Running Meta Llama Chatbot locally on Mac
- SOTA GenAI performance using
- Monitoring using Grafana and Datadog
ð¤ How does TorchServe work
- Model Server for PyTorch Documentation: Full documentation
- TorchServe internals: How TorchServe was built
- Contributing guide: How to contribute to TorchServe
ð Highlighted Examples
- Serving Meta Llama with TorchServe
- Chatbot with Meta Llama on Mac ð¦ð¬
- ð¤ HuggingFace Transformers with a Better Transformer Integration/ Flash Attention & Xformer Memory Efficient
- Stable Diffusion
- Model parallel inference
- MultiModal models with MMF combining text, audio and video
- Dual Neural Machine Translation for a complex workflow DAG
- TorchServe Integrations
- TorchServe Internals
- TorchServe UseCases
For more examples
ð¡ï¸ TorchServe Security Policy
ð¤ Learn More
ð« Contributing
We welcome all contributions!
To learn more about how to contribute, see the contributor guide here.
ð° News
- High performance Llama 2 deployments with AWS Inferentia2 using TorchServe
- Naver Case Study: Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance
- Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs
- Deploying your Generative AI model in only four steps with Vertex AI and PyTorch
- PyTorch Model Serving on Google Cloud TPU v5
- Monitoring using Datadog
- Torchserve Performance Tuning, Animated Drawings Case-Study
- Walmart Search: Serving Models at a Scale on TorchServe
- ð¥ Scaling inference on CPU with TorchServe
- ð¥ TorchServe C++ backend
- Grokking Intel CPU PyTorch performance from first principles: a TorchServe case study
- Grokking Intel CPU PyTorch performance from first principles( Part 2): a TorchServe case study
- Case Study: Amazon Ads Uses PyTorch and AWS Inferentia to Scale Models for Ads Processing
- Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker
- Using AI to bring children's drawings to life
- ð¥ Model Serving in PyTorch
- Evolution of Cresta's machine learning architecture: Migration to AWS and PyTorch
- ð¥ Explain Like Iâm 5: TorchServe
- ð¥ How to Serve PyTorch Models with TorchServe
- How to deploy PyTorch models on Vertex AI
- Quantitative Comparison of Serving Platforms
- Efficient Serverless deployment of PyTorch models on Azure
- Deploy PyTorch models with TorchServe in Azure Machine Learning online endpoints
- Dynaboard moving beyond accuracy to holistic model evaluation in NLP
- A MLOps Tale about operationalising MLFlow and PyTorch
- Operationalize, Scale and Infuse Trust in AI Models using KFServing
- How Wadhwani AI Uses PyTorch To Empower Cotton Farmers
- TorchServe Streamlit Integration
- Dynabench aims to make AI models more robust through distributed human workers
- Announcing TorchServe
ð All Contributors
Made with contrib.rocks.
âï¸ Disclaimer
This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to opensource@fb.com. For questions directed at Amazon, please send an email to torchserve@amazon.com. For all other questions, please open up an issue in this repository here.
TorchServe acknowledges the Multi Model Server (MMS) project from which it was derived
Top Related Projects
A flexible, high-performance serving system for machine learning models
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Production infrastructure for machine learning at scale
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Open source platform for the machine learning lifecycle
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot