serving

A flexible, high-performance serving system for machine learning models

6,289

2,201

6,289

116

View on GitHub

Top Related Projects

mediapipe

30,868

Cross-platform, customizable ML solutions for live and streaming media.

onnx

19,372

Open standard for machine learning interoperability

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

serve

4,346

Serve, optimize and scale PyTorch models in production

server

9,394

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

BentoML

7,950

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

Quick Overview

TensorFlow Serving is an open-source serving system for machine learning models, designed to take models from experimentation to production environments. It allows for easy deployment of algorithms and experiments while maintaining the same server architecture and APIs. TensorFlow Serving is particularly well-suited for TensorFlow models but can be extended to serve other types of models and data.

Pros

Flexible architecture that supports multiple machine learning frameworks
Efficient model versioning and concurrent model serving
High performance, supporting batching and GPU acceleration
Easy integration with TensorFlow and other ML ecosystems

Cons

Steep learning curve for beginners
Limited support for non-TensorFlow models out of the box
Can be resource-intensive for large-scale deployments
Documentation can be sparse for advanced use cases

Code Examples

Loading and serving a saved model:

import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

model = tf.saved_model.load('/path/to/saved_model')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(input_data))

result = stub.Predict(request, 10.0)  # 10 secs timeout

Creating a simple model server:

import tensorflow as tf
import tensorflow_serving as tf_serving

model = tf.keras.Sequential([...])  # Define your model
model.compile(...)
model.fit(...)

tf_serving.Model(model).save('/path/to/save')

server = tf_serving.Server()
server.add_model('my_model', '/path/to/save')
server.start()

Making predictions using REST API:

import requests
import json

data = json.dumps({"signature_name": "serving_default", "instances": [1.0, 2.0, 3.0]})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/my_model:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']

Getting Started

Install TensorFlow Serving:

echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install tensorflow-model-server

Serve a model:

tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=my_model --model_base_path=/path/to/my_model

Make predictions using the gRPC API:

from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc

channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(input_data))
result = stub.Predict(request, 10.0)

Competitor Comparisons

mediapipe

30,868

Cross-platform, customizable ML solutions for live and streaming media.

Pros of MediaPipe

More versatile, supporting various tasks beyond model serving (e.g., audio, video processing)
Cross-platform support (mobile, web, desktop)
Easier to integrate into end-user applications

Cons of MediaPipe

Less focused on high-performance model serving
May require more setup and configuration for specific use cases
Potentially steeper learning curve due to broader feature set

Code Comparison

MediaPipe (graph-based pipeline):

import mediapipe as mp

mp_hands = mp.solutions.hands
hands = mp_hands.Hands()
results = hands.process(image)

TensorFlow Serving (REST API request):

import requests

data = {"instances": [image.tolist()]}
response = requests.post(url, json=data)
predictions = response.json()["predictions"]

MediaPipe offers a more integrated approach for specific tasks, while TensorFlow Serving focuses on efficient model deployment and scaling. MediaPipe is better suited for end-to-end applications, especially on mobile and web platforms, while TensorFlow Serving excels in high-performance model serving for production environments.

onnx

19,372

Open standard for machine learning interoperability

Pros of ONNX

Framework-agnostic: Supports multiple ML frameworks, not limited to TensorFlow
Broader ecosystem: Wide range of tools and libraries for model conversion and optimization
Lightweight: Focused on model representation, not tied to a specific serving infrastructure

Cons of ONNX

Less integrated: Requires additional components for deployment and serving
Limited built-in serving capabilities: Primarily a model format, not a complete serving solution
Steeper learning curve: May require more setup and configuration for deployment

Code Comparison

ONNX model definition:

import onnx

node = onnx.helper.make_node("Relu", inputs=["x"], outputs=["y"])
graph = onnx.helper.make_graph([node], "test-model", [input], [output])
model = onnx.helper.make_model(graph)

TensorFlow Serving model definition:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(5,)),
    tf.keras.layers.Dense(1)
])
tf.saved_model.save(model, "saved_model_dir")

onnxruntime

17,390

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Supports multiple frameworks (TensorFlow, PyTorch, etc.) through ONNX format
Offers better performance optimization across different hardware
Provides a more flexible deployment option for various platforms

Cons of ONNX Runtime

May require additional steps to convert models to ONNX format
Less mature ecosystem compared to TensorFlow Serving
Potentially more complex setup for certain use cases

Code Comparison

ONNX Runtime:

import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

TensorFlow Serving:

import tensorflow as tf
model = tf.saved_model.load("saved_model_dir")
output = model.signatures["serving_default"](tf.constant(input_data))

Both ONNX Runtime and TensorFlow Serving aim to provide efficient model serving solutions. ONNX Runtime offers broader framework support and potentially better cross-platform optimization, while TensorFlow Serving provides a more streamlined experience for TensorFlow models with a mature ecosystem. The choice between them depends on specific project requirements and the frameworks used in model development.

serve

4,346

Serve, optimize and scale PyTorch models in production

Pros of Serve

More flexible and customizable serving architecture
Easier integration with PyTorch ecosystem and models
Supports multi-model serving out of the box

Cons of Serve

Less mature and battle-tested in production environments
Smaller community and ecosystem compared to TensorFlow Serving
Limited support for non-PyTorch models

Code Comparison

Serve:

import torch
from torchvision import models
model = models.resnet18(pretrained=True)
torch.save(model.state_dict(), "resnet18.pth")

TensorFlow Serving:

import tensorflow as tf
model = tf.keras.applications.ResNet50(weights='imagenet')
tf.saved_model.save(model, "resnet50/1/")

Both frameworks offer straightforward ways to save models for serving, but Serve uses PyTorch's native format while TensorFlow Serving uses the SavedModel format. Serve's approach is more aligned with PyTorch's ecosystem, making it easier for PyTorch users to deploy their models. However, TensorFlow Serving's SavedModel format is more widely supported and offers better versioning capabilities.

server

9,394

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

Pros of Triton Inference Server

Supports multiple deep learning frameworks (TensorFlow, PyTorch, ONNX, etc.)
Offers dynamic batching and concurrent model execution
Provides GPU acceleration and multi-GPU support

Cons of Triton Inference Server

Steeper learning curve due to more complex configuration
May have higher resource overhead for simpler deployment scenarios

Code Comparison

TensorFlow Serving:

import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(input_data))

Triton Inference Server:

import tritonclient.grpc as grpcclient

client = grpcclient.InferenceServerClient(url='localhost:8001')
inputs = [grpcclient.InferInput('input', input_shape, 'FP32')]
inputs[0].set_data_from_numpy(input_data)
result = client.infer(model_name='my_model', inputs=inputs)

Both repositories provide powerful inference serving capabilities, but Triton Inference Server offers more flexibility in terms of supported frameworks and deployment options, while TensorFlow Serving is more tightly integrated with the TensorFlow ecosystem.

BentoML

7,950

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

Pros of BentoML

Supports multiple ML frameworks (TensorFlow, PyTorch, scikit-learn, etc.)
Easier to use and more flexible for various deployment scenarios
Built-in model versioning and management

Cons of BentoML

Less optimized for TensorFlow-specific deployments
Smaller community and ecosystem compared to TensorFlow Serving

Code Comparison

BentoML:

import bentoml

@bentoml.env(pip_packages=["tensorflow"])
@bentoml.artifacts([TensorflowSavedModelArtifact('model')])
class TensorflowModelService(bentoml.BentoService):
    @bentoml.api(input=TensorflowTensorInput())
    def predict(self, input_data):
        return self.artifacts.model(input_data)

TensorFlow Serving:

import tensorflow as tf

model = tf.saved_model.load("path/to/model")
serving_fn = model.signatures["serving_default"]

def predict(input_data):
    return serving_fn(tf.constant(input_data))

BentoML offers a more declarative approach with built-in service definition, while TensorFlow Serving requires separate model loading and serving setup. BentoML's code is more self-contained and easier to package for deployment, whereas TensorFlow Serving typically requires additional configuration files and deployment scripts.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TensorFlow Serving

Docker CPU Nightly Build Status Docker GPU Nightly Build Status

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It deals with the inference aspect of machine learning, taking models after training and managing their lifetimes, providing clients with versioned access via a high-performance, reference-counted lookup table. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.

To note a few features:

Can serve multiple models, or multiple versions of the same model simultaneously
Exposes both gRPC as well as HTTP inference endpoints
Allows deployment of new model versions without changing any client code
Supports canarying new versions and A/B testing experimental models
Adds minimal latency to inference time due to efficient, low-overhead implementation
Features a scheduler that groups individual inference requests into batches for joint execution on GPU, with configurable latency controls
Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations and even non-Tensorflow-based machine learning models

Serve a Tensorflow model in 60 seconds

# Download the TensorFlow Serving Docker image and repo
docker pull tensorflow/serving

git clone https://github.com/tensorflow/serving
# Location of demo models
TESTDATA="$(pwd)/serving/tensorflow_serving/servables/tensorflow/testdata"

# Start TensorFlow Serving container and open the REST API port
docker run -t --rm -p 8501:8501 \
    -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
    -e MODEL_NAME=half_plus_two \
    tensorflow/serving &

# Query the model using the predict API
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
    -X POST http://localhost:8501/v1/models/half_plus_two:predict

# Returns => { "predictions": [2.5, 3.0, 4.5] }

End-to-End Training & Serving Tutorial

Refer to the official Tensorflow documentations site for a complete tutorial to train and serve a Tensorflow Model.

Documentation

Set up

The easiest and most straight-forward way of using TensorFlow Serving is with Docker images. We highly recommend this route unless you have specific needs that are not addressed by running in a container.

Install Tensorflow Serving using Docker (Recommended)
Install Tensorflow Serving without Docker (Not Recommended)
Build Tensorflow Serving from Source with Docker
Deploy Tensorflow Serving on Kubernetes

Use

Export your Tensorflow model

In order to serve a Tensorflow model, simply export a SavedModel from your Tensorflow program. SavedModel is a language-neutral, recoverable, hermetic serialization format that enables higher-level systems and tools to produce, consume, and transform TensorFlow models.

Please refer to Tensorflow documentation for detailed instructions on how to export SavedModels.

Configure and Use Tensorflow Serving

Follow a tutorial on Serving Tensorflow models
Configure Tensorflow Serving to make it fit your serving use case
Read the Performance Guide and learn how to use TensorBoard to profile and optimize inference requests
Read the REST API Guide or gRPC API definition
Use SavedModel Warmup if initial inference requests are slow due to lazy initialization of graph
If encountering issues regarding model signatures, please read the SignatureDef documentation
If using a model with custom ops, learn how to serve models with custom ops

Extend

Tensorflow Serving's architecture is highly modular. You can use some parts individually (e.g. batch scheduling) and/or extend it to serve new use cases.

Contribute

If you'd like to contribute to TensorFlow Serving, be sure to review the contribution guidelines.

For more information

Please refer to the official TensorFlow website for more information.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot