Top Related Projects
Cross-platform, customizable ML solutions for live and streaming media.
Open standard for machine learning interoperability
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Serve, optimize and scale PyTorch models in production
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Quick Overview
TensorFlow Serving is an open-source serving system for machine learning models, designed to take models from experimentation to production environments. It allows for easy deployment of algorithms and experiments while maintaining the same server architecture and APIs. TensorFlow Serving is particularly well-suited for TensorFlow models but can be extended to serve other types of models and data.
Pros
- Flexible architecture that supports multiple machine learning frameworks
- Efficient model versioning and concurrent model serving
- High performance, supporting batching and GPU acceleration
- Easy integration with TensorFlow and other ML ecosystems
Cons
- Steep learning curve for beginners
- Limited support for non-TensorFlow models out of the box
- Can be resource-intensive for large-scale deployments
- Documentation can be sparse for advanced use cases
Code Examples
- Loading and serving a saved model:
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
model = tf.saved_model.load('/path/to/saved_model')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(input_data))
result = stub.Predict(request, 10.0) # 10 secs timeout
- Creating a simple model server:
import tensorflow as tf
import tensorflow_serving as tf_serving
model = tf.keras.Sequential([...]) # Define your model
model.compile(...)
model.fit(...)
tf_serving.Model(model).save('/path/to/save')
server = tf_serving.Server()
server.add_model('my_model', '/path/to/save')
server.start()
- Making predictions using REST API:
import requests
import json
data = json.dumps({"signature_name": "serving_default", "instances": [1.0, 2.0, 3.0]})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/my_model:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']
Getting Started
- Install TensorFlow Serving:
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install tensorflow-model-server
- Serve a model:
tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=my_model --model_base_path=/path/to/my_model
- Make predictions using the gRPC API:
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(input_data))
result = stub.Predict(request, 10.0)
Competitor Comparisons
Cross-platform, customizable ML solutions for live and streaming media.
Pros of MediaPipe
- More versatile, supporting various tasks beyond model serving (e.g., audio, video processing)
- Cross-platform support (mobile, web, desktop)
- Easier to integrate into end-user applications
Cons of MediaPipe
- Less focused on high-performance model serving
- May require more setup and configuration for specific use cases
- Potentially steeper learning curve due to broader feature set
Code Comparison
MediaPipe (graph-based pipeline):
import mediapipe as mp
mp_hands = mp.solutions.hands
hands = mp_hands.Hands()
results = hands.process(image)
TensorFlow Serving (REST API request):
import requests
data = {"instances": [image.tolist()]}
response = requests.post(url, json=data)
predictions = response.json()["predictions"]
MediaPipe offers a more integrated approach for specific tasks, while TensorFlow Serving focuses on efficient model deployment and scaling. MediaPipe is better suited for end-to-end applications, especially on mobile and web platforms, while TensorFlow Serving excels in high-performance model serving for production environments.
Open standard for machine learning interoperability
Pros of ONNX
- Framework-agnostic: Supports multiple ML frameworks, not limited to TensorFlow
- Broader ecosystem: Wide range of tools and libraries for model conversion and optimization
- Lightweight: Focused on model representation, not tied to a specific serving infrastructure
Cons of ONNX
- Less integrated: Requires additional components for deployment and serving
- Limited built-in serving capabilities: Primarily a model format, not a complete serving solution
- Steeper learning curve: May require more setup and configuration for deployment
Code Comparison
ONNX model definition:
import onnx
node = onnx.helper.make_node("Relu", inputs=["x"], outputs=["y"])
graph = onnx.helper.make_graph([node], "test-model", [input], [output])
model = onnx.helper.make_model(graph)
TensorFlow Serving model definition:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu', input_shape=(5,)),
tf.keras.layers.Dense(1)
])
tf.saved_model.save(model, "saved_model_dir")
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Pros of ONNX Runtime
- Supports multiple frameworks (TensorFlow, PyTorch, etc.) through ONNX format
- Offers better performance optimization across different hardware
- Provides a more flexible deployment option for various platforms
Cons of ONNX Runtime
- May require additional steps to convert models to ONNX format
- Less mature ecosystem compared to TensorFlow Serving
- Potentially more complex setup for certain use cases
Code Comparison
ONNX Runtime:
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
TensorFlow Serving:
import tensorflow as tf
model = tf.saved_model.load("saved_model_dir")
output = model.signatures["serving_default"](tf.constant(input_data))
Both ONNX Runtime and TensorFlow Serving aim to provide efficient model serving solutions. ONNX Runtime offers broader framework support and potentially better cross-platform optimization, while TensorFlow Serving provides a more streamlined experience for TensorFlow models with a mature ecosystem. The choice between them depends on specific project requirements and the frameworks used in model development.
Serve, optimize and scale PyTorch models in production
Pros of Serve
- More flexible and customizable serving architecture
- Easier integration with PyTorch ecosystem and models
- Supports multi-model serving out of the box
Cons of Serve
- Less mature and battle-tested in production environments
- Smaller community and ecosystem compared to TensorFlow Serving
- Limited support for non-PyTorch models
Code Comparison
Serve:
import torch
from torchvision import models
model = models.resnet18(pretrained=True)
torch.save(model.state_dict(), "resnet18.pth")
TensorFlow Serving:
import tensorflow as tf
model = tf.keras.applications.ResNet50(weights='imagenet')
tf.saved_model.save(model, "resnet50/1/")
Both frameworks offer straightforward ways to save models for serving, but Serve uses PyTorch's native format while TensorFlow Serving uses the SavedModel format. Serve's approach is more aligned with PyTorch's ecosystem, making it easier for PyTorch users to deploy their models. However, TensorFlow Serving's SavedModel format is more widely supported and offers better versioning capabilities.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Pros of Triton Inference Server
- Supports multiple deep learning frameworks (TensorFlow, PyTorch, ONNX, etc.)
- Offers dynamic batching and concurrent model execution
- Provides GPU acceleration and multi-GPU support
Cons of Triton Inference Server
- Steeper learning curve due to more complex configuration
- May have higher resource overhead for simpler deployment scenarios
Code Comparison
TensorFlow Serving:
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.inputs['input'].CopyFrom(tf.make_tensor_proto(input_data))
Triton Inference Server:
import tritonclient.grpc as grpcclient
client = grpcclient.InferenceServerClient(url='localhost:8001')
inputs = [grpcclient.InferInput('input', input_shape, 'FP32')]
inputs[0].set_data_from_numpy(input_data)
result = client.infer(model_name='my_model', inputs=inputs)
Both repositories provide powerful inference serving capabilities, but Triton Inference Server offers more flexibility in terms of supported frameworks and deployment options, while TensorFlow Serving is more tightly integrated with the TensorFlow ecosystem.
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Pros of BentoML
- Supports multiple ML frameworks (TensorFlow, PyTorch, scikit-learn, etc.)
- Easier to use and more flexible for various deployment scenarios
- Built-in model versioning and management
Cons of BentoML
- Less optimized for TensorFlow-specific deployments
- Smaller community and ecosystem compared to TensorFlow Serving
Code Comparison
BentoML:
import bentoml
@bentoml.env(pip_packages=["tensorflow"])
@bentoml.artifacts([TensorflowSavedModelArtifact('model')])
class TensorflowModelService(bentoml.BentoService):
@bentoml.api(input=TensorflowTensorInput())
def predict(self, input_data):
return self.artifacts.model(input_data)
TensorFlow Serving:
import tensorflow as tf
model = tf.saved_model.load("path/to/model")
serving_fn = model.signatures["serving_default"]
def predict(input_data):
return serving_fn(tf.constant(input_data))
BentoML offers a more declarative approach with built-in service definition, while TensorFlow Serving requires separate model loading and serving setup. BentoML's code is more self-contained and easier to package for deployment, whereas TensorFlow Serving typically requires additional configuration files and deployment scripts.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
TensorFlow Serving
TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It deals with the inference aspect of machine learning, taking models after training and managing their lifetimes, providing clients with versioned access via a high-performance, reference-counted lookup table. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.
To note a few features:
- Can serve multiple models, or multiple versions of the same model simultaneously
- Exposes both gRPC as well as HTTP inference endpoints
- Allows deployment of new model versions without changing any client code
- Supports canarying new versions and A/B testing experimental models
- Adds minimal latency to inference time due to efficient, low-overhead implementation
- Features a scheduler that groups individual inference requests into batches for joint execution on GPU, with configurable latency controls
- Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations and even non-Tensorflow-based machine learning models
Serve a Tensorflow model in 60 seconds
# Download the TensorFlow Serving Docker image and repo
docker pull tensorflow/serving
git clone https://github.com/tensorflow/serving
# Location of demo models
TESTDATA="$(pwd)/serving/tensorflow_serving/servables/tensorflow/testdata"
# Start TensorFlow Serving container and open the REST API port
docker run -t --rm -p 8501:8501 \
-v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
-e MODEL_NAME=half_plus_two \
tensorflow/serving &
# Query the model using the predict API
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://localhost:8501/v1/models/half_plus_two:predict
# Returns => { "predictions": [2.5, 3.0, 4.5] }
End-to-End Training & Serving Tutorial
Refer to the official Tensorflow documentations site for a complete tutorial to train and serve a Tensorflow Model.
Documentation
Set up
The easiest and most straight-forward way of using TensorFlow Serving is with Docker images. We highly recommend this route unless you have specific needs that are not addressed by running in a container.
- Install Tensorflow Serving using Docker (Recommended)
- Install Tensorflow Serving without Docker (Not Recommended)
- Build Tensorflow Serving from Source with Docker
- Deploy Tensorflow Serving on Kubernetes
Use
Export your Tensorflow model
In order to serve a Tensorflow model, simply export a SavedModel from your Tensorflow program. SavedModel is a language-neutral, recoverable, hermetic serialization format that enables higher-level systems and tools to produce, consume, and transform TensorFlow models.
Please refer to Tensorflow documentation for detailed instructions on how to export SavedModels.
Configure and Use Tensorflow Serving
- Follow a tutorial on Serving Tensorflow models
- Configure Tensorflow Serving to make it fit your serving use case
- Read the Performance Guide and learn how to use TensorBoard to profile and optimize inference requests
- Read the REST API Guide or gRPC API definition
- Use SavedModel Warmup if initial inference requests are slow due to lazy initialization of graph
- If encountering issues regarding model signatures, please read the SignatureDef documentation
- If using a model with custom ops, learn how to serve models with custom ops
Extend
Tensorflow Serving's architecture is highly modular. You can use some parts individually (e.g. batch scheduling) and/or extend it to serve new use cases.
- Ensure you are familiar with building Tensorflow Serving
- Learn about Tensorflow Serving's architecture
- Explore the Tensorflow Serving C++ API reference
- Create a new type of Servable
- Create a custom Source of Servable versions
Contribute
If you'd like to contribute to TensorFlow Serving, be sure to review the contribution guidelines.
For more information
Please refer to the official TensorFlow website for more information.
Top Related Projects
Cross-platform, customizable ML solutions for live and streaming media.
Open standard for machine learning interoperability
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Serve, optimize and scale PyTorch models in production
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot