ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Top Related Projects
Parallel computing with task scheduling
Apache Spark - A unified analytics engine for large-scale data processing
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Open source platform for the machine learning lifecycle
Quick Overview
Ray is an open-source distributed computing framework that enables scaling Python applications from a laptop to a cluster. It provides a simple, universal API for building distributed applications and includes libraries for machine learning, reinforcement learning, and other compute-intensive tasks.
Pros
- Seamless scaling from local development to large clusters
- Rich ecosystem of libraries for AI/ML tasks (e.g., Ray Tune, RLlib)
- Easy-to-use API for distributed computing
- Active community and regular updates
Cons
- Steeper learning curve for complex distributed systems
- Documentation can be overwhelming for beginners
- Some features may be overkill for smaller projects
- Occasional stability issues in cutting-edge features
Code Examples
- Simple parallel task execution:
import ray
@ray.remote
def f(x):
return x * x
ray.init()
futures = [f.remote(i) for i in range(4)]
print(ray.get(futures)) # [0, 1, 4, 9]
- Distributed machine learning with Ray Tune:
from ray import tune
def objective(config):
score = config["x"] ** 2 + config["y"] ** 2
return {"score": score}
analysis = tune.run(
objective,
config={
"x": tune.uniform(-10, 10),
"y": tune.uniform(-10, 10),
},
num_samples=100
)
print("Best config:", analysis.get_best_config(metric="score", mode="min"))
- Actor-based stateful computation:
import ray
@ray.remote
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
ray.init()
counter = Counter.remote()
print(ray.get([counter.increment.remote() for _ in range(5)])) # [1, 2, 3, 4, 5]
Getting Started
To get started with Ray:
- Install Ray:
pip install ray
- Import Ray and initialize:
import ray
ray.init()
- Define remote functions or actors:
@ray.remote
def my_function(x):
return x * 2
result = ray.get(my_function.remote(4))
print(result) # 8
For more advanced usage, refer to the official Ray documentation for specific libraries and use cases.
Competitor Comparisons
Parallel computing with task scheduling
Pros of Dask
- Seamless integration with NumPy and Pandas, making it easier for data scientists familiar with these libraries
- More mature and established ecosystem for data analysis and scientific computing
- Better support for out-of-core computations and handling larger-than-memory datasets
Cons of Dask
- Less suitable for general-purpose distributed computing tasks
- Limited support for machine learning workloads compared to Ray
- Slower performance for certain types of parallel computations
Code Comparison
Dask:
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean().compute()
Ray:
import ray
import pandas as pd
@ray.remote
def process_chunk(chunk):
return chunk.groupby('column').mean()
df = pd.read_csv('large_file.csv')
result = ray.get([process_chunk.remote(chunk) for chunk in np.array_split(df, 100)])
This comparison shows how Dask provides a more familiar interface for data processing tasks, while Ray offers more flexibility for custom parallel computations.
Apache Spark - A unified analytics engine for large-scale data processing
Pros of Spark
- Mature ecosystem with extensive libraries and integrations
- Robust support for SQL and structured data processing
- Strong community and enterprise adoption
Cons of Spark
- Steeper learning curve for beginners
- Higher resource consumption for small-scale tasks
- Less flexible for general-purpose distributed computing
Code Comparison
Spark:
val df = spark.read.json("data.json")
df.groupBy("category").agg(avg("price")).show()
Ray:
@ray.remote
def process_data(data):
# Custom data processing logic
return result
results = ray.get([process_data.remote(chunk) for chunk in data_chunks])
Key Differences
- Spark focuses on big data processing and analytics, while Ray is more versatile for distributed computing tasks
- Ray offers easier scaling for machine learning and AI workloads
- Spark has better built-in support for SQL and DataFrames, while Ray provides more flexibility for custom distributed algorithms
Use Cases
- Choose Spark for large-scale data processing, ETL, and SQL-based analytics
- Opt for Ray when building distributed applications, reinforcement learning systems, or scaling Python code
Performance Considerations
- Spark excels at batch processing of large datasets
- Ray offers lower latency for fine-grained tasks and better support for GPU acceleration
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Pros of Horovod
- Specialized for distributed deep learning, particularly with TensorFlow, PyTorch, and MXNet
- Efficient communication using ring-allreduce algorithm
- Easier to scale existing single-GPU training scripts to multiple GPUs or nodes
Cons of Horovod
- Limited to deep learning workloads, less versatile for general distributed computing
- Requires more manual configuration for complex distributed setups
- Less extensive ecosystem and fewer built-in features compared to Ray
Code Comparison
Horovod:
import horovod.tensorflow as hvd
hvd.init()
optimizer = tf.optimizers.Adam(0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
Ray:
import ray
from ray import train
@ray.remote
def train_func():
model.fit(train_data)
results = ray.get([train_func.remote() for _ in range(num_workers)])
Summary
Horovod excels in distributed deep learning scenarios, offering efficient communication and easy scaling of existing scripts. Ray, on the other hand, provides a more versatile distributed computing framework suitable for various tasks beyond deep learning. While Horovod focuses on optimizing specific deep learning workflows, Ray offers a broader ecosystem with more built-in features for general distributed computing needs.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- Specialized for deep learning model training and inference optimization
- Offers advanced memory optimization techniques like ZeRO (Zero Redundancy Optimizer)
- Tightly integrated with PyTorch for seamless usage
Cons of DeepSpeed
- More focused on deep learning, less versatile for general distributed computing
- Steeper learning curve for users not familiar with PyTorch ecosystem
- Limited support for other ML frameworks compared to Ray's broader ecosystem
Code Comparison
DeepSpeed:
import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=params)
for step, batch in enumerate(data_loader):
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()
Ray:
import ray
@ray.remote
def train_step(model, batch):
loss = model(batch)
return loss
results = ray.get([train_step.remote(model, batch) for batch in data_loader])
Both frameworks aim to simplify distributed computing, but DeepSpeed focuses on optimizing deep learning workloads, while Ray offers a more general-purpose distributed computing solution. DeepSpeed provides deeper integration with PyTorch and specialized optimizations for large models, whereas Ray offers greater flexibility across various computing tasks and ML frameworks.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- More mature and widely adopted in the deep learning community
- Extensive ecosystem of pre-trained models and libraries
- Dynamic computational graphs for easier debugging and flexibility
Cons of PyTorch
- Less suitable for distributed computing and large-scale data processing
- Limited support for non-deep learning machine learning tasks
- Steeper learning curve for beginners compared to Ray
Code Comparison
PyTorch example:
import torch
x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])
z = torch.add(x, y)
Ray example:
import ray
@ray.remote
def add(x, y):
return x + y
result = ray.get(add.remote(1, 2))
PyTorch focuses on tensor operations and neural network building blocks, while Ray emphasizes distributed computing and parallel task execution. PyTorch is primarily used for deep learning tasks, whereas Ray is a more general-purpose framework for scaling Python applications and machine learning workloads.
Open source platform for the machine learning lifecycle
Pros of MLflow
- Focused on experiment tracking and model management
- Easier to integrate with existing ML workflows
- More comprehensive model versioning and deployment features
Cons of MLflow
- Less scalable for distributed computing tasks
- Limited support for parallel and distributed training
Code Comparison
MLflow:
import mlflow
mlflow.start_run()
mlflow.log_param("param1", value1)
mlflow.log_metric("metric1", value2)
mlflow.end_run()
Ray:
import ray
@ray.remote
def task(x):
return x * x
futures = [task.remote(i) for i in range(4)]
print(ray.get(futures))
MLflow focuses on tracking experiments and managing models, making it easier to log parameters, metrics, and artifacts. Ray, on the other hand, excels at distributed computing and parallel task execution, providing a more scalable solution for large-scale machine learning workloads.
While MLflow offers better model versioning and deployment features, Ray provides a more robust framework for distributed computing and scaling machine learning pipelines. MLflow is generally easier to integrate into existing workflows, but Ray offers more flexibility for complex, distributed tasks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png
.. image:: https://readthedocs.org/projects/ray/badge/?version=master :target: http://docs.ray.io/en/master/?badge=master
.. image:: https://img.shields.io/badge/Ray-Join%20Slack-blue :target: https://forms.gle/9TSdDYUgxYs8SA9e8
.. image:: https://img.shields.io/badge/Discuss-Ask%20Questions-blue :target: https://discuss.ray.io/
.. image:: https://img.shields.io/twitter/follow/raydistributed.svg?style=social&logo=twitter :target: https://twitter.com/raydistributed
.. image:: https://img.shields.io/badge/Get_started_for_free-3C8AE9?logo=data%3Aimage%2Fpng%3Bbase64%2CiVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8%2F9hAAAAAXNSR0IArs4c6QAAAERlWElmTU0AKgAAAAgAAYdpAAQAAAABAAAAGgAAAAAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAAEKADAAQAAAABAAAAEAAAAAA0VXHyAAABKElEQVQ4Ea2TvWoCQRRGnWCVWChIIlikC9hpJdikSbGgaONbpAoY8gKBdAGfwkfwKQypLQ1sEGyMYhN1Pd%2B6A8PqwBZeOHt%2FvsvMnd3ZXBRFPQjBZ9K6OY8ZxF%2B0IYw9PW3qz8aY6lk92bZ%2BVqSI3oC9T7%2FyCVnrF1ngj93us%2B540sf5BrCDfw9b6jJ5lx%2FyjtGKBBXc3cnqx0INN4ImbI%2Bl%2BPnI8zWfFEr4chLLrWHCp9OO9j19Kbc91HX0zzzBO8EbLK2Iv4ZvNO3is3h6jb%2BCwO0iL8AaWqB7ILPTxq3kDypqvBuYuwswqo6wgYJbT8XxBPZ8KS1TepkFdC79TAHHce%2F7LbVioi3wEfTpmeKtPRGEeoldSP%2FOeoEftpP4BRbgXrYZefsAI%2BP9JU7ImyEAAAAASUVORK5CYII%3D :target: https://console.anyscale.com/register/ha?utm_source=github&utm_medium=ray_readme&utm_campaign=get_started_badge
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI libraries for simplifying ML compute:
.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/what-is-ray-padded.svg
.. https://docs.google.com/drawings/d/1Pl8aCYOsZCo61cmp57c7Sja6HhIygGCvSZLi_AuBuqo/edit
Learn more about Ray AI Libraries
_:
Data
_: Scalable Datasets for MLTrain
_: Distributed TrainingTune
_: Scalable Hyperparameter TuningRLlib
_: Scalable Reinforcement LearningServe
_: Scalable and Programmable Serving
Or more about Ray Core
_ and its key abstractions:
Tasks
_: Stateless functions executed in the cluster.Actors
_: Stateful worker processes created in the cluster.Objects
_: Immutable values accessible across the cluster.
Learn more about Monitoring and Debugging:
- Monitor Ray apps and clusters with the
Ray Dashboard <https://docs.ray.io/en/latest/ray-core/ray-dashboard.html>
__. - Debug Ray apps with the
Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>
__.
Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing
ecosystem of community integrations
_.
Install Ray with: pip install ray
. For nightly wheels, see the
Installation page <https://docs.ray.io/en/latest/ray-overview/installation.html>
__.
.. _Serve
: https://docs.ray.io/en/latest/serve/index.html
.. _Data
: https://docs.ray.io/en/latest/data/dataset.html
.. _Workflow
: https://docs.ray.io/en/latest/workflows/concepts.html
.. _Train
: https://docs.ray.io/en/latest/train/train.html
.. _Tune
: https://docs.ray.io/en/latest/tune/index.html
.. _RLlib
: https://docs.ray.io/en/latest/rllib/index.html
.. _ecosystem of community integrations
: https://docs.ray.io/en/latest/ray-overview/ray-libraries.html
Why Ray?
Today's ML workloads are increasingly compute-intensive. As convenient as they are, single-node development environments such as your laptop cannot scale to meet these demands.
Ray is a unified way to scale Python and AI applications from a laptop to a cluster.
With Ray, you can seamlessly scale the same code from a laptop to a cluster. Ray is designed to be general-purpose, meaning that it can performantly run any kind of workload. If your application is written in Python, you can scale it with Ray, no other infrastructure required.
More Information
Documentation
_Ray Architecture whitepaper
_Exoshuffle: large-scale data shuffle in Ray
_Ownership: a distributed futures system for fine-grained tasks
_RLlib paper
_Tune paper
_
Older documents:
Ray paper
_Ray HotOS paper
_Ray Architecture v1 whitepaper
_
.. _Ray AI Libraries
: https://docs.ray.io/en/latest/ray-air/getting-started.html
.. _Ray Core
: https://docs.ray.io/en/latest/ray-core/walkthrough.html
.. _Tasks
: https://docs.ray.io/en/latest/ray-core/tasks.html
.. _Actors
: https://docs.ray.io/en/latest/ray-core/actors.html
.. _Objects
: https://docs.ray.io/en/latest/ray-core/objects.html
.. _Documentation
: http://docs.ray.io/en/latest/index.html
.. _Ray Architecture v1 whitepaper
: https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview
.. _Ray Architecture whitepaper
: https://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-la5w4q_TjyJgJL_jN2fI/preview
.. _Exoshuffle: large-scale data shuffle in Ray
: https://arxiv.org/abs/2203.05072
.. _Ownership: a distributed futures system for fine-grained tasks
: https://www.usenix.org/system/files/nsdi21-wang.pdf
.. _Ray paper
: https://arxiv.org/abs/1712.05889
.. _Ray HotOS paper
: https://arxiv.org/abs/1703.03924
.. _RLlib paper
: https://arxiv.org/abs/1712.09381
.. _Tune paper
: https://arxiv.org/abs/1807.05118
Getting Involved
.. list-table:: :widths: 25 50 25 25 :header-rows: 1
-
- Platform
- Purpose
- Estimated Response Time
- Support Level
-
Discourse Forum
_- For discussions about development and questions about usage.
- < 1 day
- Community
-
GitHub Issues
_- For reporting bugs and filing feature requests.
- < 2 days
- Ray OSS Team
-
Slack
_- For collaborating with other Ray users.
- < 2 days
- Community
-
StackOverflow
_- For asking questions about how to use Ray.
- 3-5 days
- Community
-
Meetup Group
_- For learning about Ray projects and best practices.
- Monthly
- Ray DevRel
-
Twitter
_- For staying up-to-date on new features.
- Daily
- Ray DevRel
.. _Discourse Forum
: https://discuss.ray.io/
.. _GitHub Issues
: https://github.com/ray-project/ray/issues
.. _StackOverflow
: https://stackoverflow.com/questions/tagged/ray
.. _Meetup Group
: https://www.meetup.com/Bay-Area-Ray-Meetup/
.. _Twitter
: https://twitter.com/raydistributed
.. _Slack
: https://www.ray.io/join-slack?utm_source=github&utm_medium=ray_readme&utm_campaign=getting_involved
Top Related Projects
Parallel computing with task scheduling
Apache Spark - A unified analytics engine for large-scale data processing
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Open source platform for the machine learning lifecycle
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot