rl

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

2,984

399

2,984

276

View on GitHub

Top Related Projects

gym

36,310

A toolkit for developing and comparing reinforcement learning algorithms.

ray

38,187

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

baselines

16,374

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

dopamine

10,785

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Quick Overview

The pytorch/rl repository is a collection of reinforcement learning (RL) algorithms and environments implemented using the PyTorch deep learning framework. It provides a set of tools and examples for researchers and practitioners to explore and experiment with various RL techniques.

Pros

Diverse Algorithms: The repository includes implementations of several popular RL algorithms, such as DQN, PPO, and A2C, allowing users to compare and experiment with different approaches.
PyTorch Integration: By using PyTorch, the project benefits from the library's powerful and flexible deep learning capabilities, making it easier to integrate RL models with other deep learning components.
Customizable Environments: The project includes a variety of OpenAI Gym-compatible environments, which can be used to test and evaluate RL algorithms.
Active Development: The project is actively maintained, with regular updates and contributions from the PyTorch community.

Cons

Limited Documentation: The project's documentation could be more comprehensive, making it challenging for newcomers to get started and understand the codebase.
Steep Learning Curve: Reinforcement learning, in general, can have a steep learning curve, and this project may not be the most beginner-friendly for those new to the field.
Potential Compatibility Issues: As the project is built on PyTorch, users may encounter compatibility issues when using different versions of PyTorch or other dependencies.
Narrow Focus: The project is primarily focused on RL algorithms and environments, and may not provide a comprehensive set of tools for other aspects of the RL workflow, such as hyperparameter tuning or visualization.

Code Examples

Here are a few code examples from the pytorch/rl repository:

Training a DQN Agent:

import gym
from rl.agents.dqn import DQNAgent
from rl.environments import make_env

env = make_env('CartPole-v0')
agent = DQNAgent(env.observation_space.shape, env.action_space.n)

for episode in range(1000):
    state = env.reset()
    done = False
    while not done:
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        agent.step(state, action, reward, next_state, done)
        state = next_state

This code sets up a DQN agent to train on the CartPole-v0 environment.

Training a PPO Agent:

import gym
from rl.agents.ppo import PPOAgent
from rl.environments import make_env

env = make_env('LunarLander-v2')
agent = PPOAgent(env.observation_space.shape, env.action_space.n)

for episode in range(1000):
    state = env.reset()
    done = False
    while not done:
        action, log_prob = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        agent.step(state, action, reward, next_state, done, log_prob)
        state = next_state

    agent.update()

This code sets up a PPO agent to train on the LunarLander-v2 environment.

Evaluating an A2C Agent:

import gym
from rl.agents.a2c import A2CAgent
from rl.environments import make_env

env = make_env('Pendulum-v0')
agent = A2CAgent(env.observation_space.shape, env.action_space.shape)

state = env.reset()
done = False
while not done:
    action = agent.act(state)
    next_state, reward, done, _ = env.step(action)
    state = next_state

This code sets up an A2C agent to evaluate on the Pendulum-v0 environment.

Getting Started

To get started with the pytorch/rl repository, follow these steps:

Clone the repository:

git clone https://github.com/pytorch/rl.git

Install the required dependencies:

cd rl
pip

Competitor Comparisons

gym

36,310

A toolkit for developing and comparing reinforcement learning algorithms.

Pros of OpenAI Gym

Provides a wide range of environments for reinforcement learning, including classic control problems, Atari games, and robotics simulations.
Offers a standardized interface for interacting with these environments, making it easier to develop and test reinforcement learning algorithms.
Supports multiple programming languages, including Python, C++, and Lua, allowing for flexibility in the choice of implementation.

Cons of OpenAI Gym

The environment setup and configuration can be complex, especially for more advanced simulations.
The documentation, while generally good, may not always be comprehensive or up-to-date, making it challenging for beginners to get started.
The performance of the simulations can be limited, especially for more complex environments, which may impact the training process.

Code Comparison

Here's a brief code comparison between PyTorch RL and OpenAI Gym:

PyTorch RL (creating a simple policy network):

import torch.nn as nn

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, action_size)

    def forward(self, state):
        x = self.fc1(state)
        x = nn.ReLU()(x)
        x = self.fc2(x)
        return x

OpenAI Gym (creating a simple environment and running an agent):

import gym

env = gym.make('CartPole-v0')
observation = env.reset()

for _ in range(1000):
    env.render()
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        observation = env.reset()

trfl

3,133

TensorFlow Reinforcement Learning

Pros of TRFL

TRFL provides a set of building blocks for constructing reinforcement learning agents, making it easier to experiment with different algorithms and architectures.
The library includes a wide range of reinforcement learning algorithms, including DQN, A3C, and PPO, among others.
TRFL is well-documented and includes examples and tutorials to help users get started.

Cons of TRFL

TRFL is primarily focused on TensorFlow, which may not be the preferred framework for some users.
The library can be more complex to set up and use compared to PyTorch/RL, which has a more user-friendly interface.
TRFL may not have the same level of community support and contributions as PyTorch/RL.

Code Comparison

PyTorch/RL:

import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

TRFL:

import trfl

def dqn_loss(q_values, actions, rewards, next_q_values, terminals, discount):
    q_selected = trfl.gather_along_first_axis(q_values, actions)
    target_q_values = rewards + (1.0 - terminals) * discount * trfl.maximum_along_first_axis(next_q_values)[0]
    return trfl.mean_squared_error(q_selected, target_q_values)

ray

38,187

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Pros of Ray

Ray provides a more comprehensive set of tools for distributed computing, including support for distributed training, hyperparameter tuning, and serving of machine learning models.
Ray's ecosystem includes a wide range of libraries and frameworks, such as RLlib for reinforcement learning, Tune for hyperparameter optimization, and Serve for model serving.
Ray's architecture is designed to be scalable and fault-tolerant, making it suitable for large-scale distributed applications.

Cons of Ray

The learning curve for Ray can be steeper than PyTorch's, as it requires understanding of distributed systems concepts and Ray's specific APIs.
Ray's focus on distributed computing may make it overkill for smaller-scale projects that don't require the advanced features it provides.

Code Comparison

PyTorch RL:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, action_size)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = self.fc2(x)
        return Categorical(logits=x)

Ray:

import ray
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env

def env_creator(config):
    return MyEnv()

register_env("my_env", env_creator)

config = {
    "env": "my_env",
    "num_workers": 4,
    "gamma": 0.99,
    "lambda": 0.95,
    "clip_param": 0.2,
    "kl_target": 0.01,
}

trainer = PPOTrainer(config=config)

baselines

16,374

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Pros of Baselines

Baselines provides a wide range of pre-implemented reinforcement learning algorithms, including A2C, ACER, ACKTR, DDPG, DQN, GAIL, PPO, and TRPO.
The codebase is well-documented and easy to understand, making it a great resource for learning and experimenting with different RL algorithms.
Baselines includes a variety of environments, such as Atari, MuJoCo, and Roboschool, which can be used for benchmarking and testing RL models.

Cons of Baselines

Baselines is primarily focused on classic RL algorithms and may not have the latest advancements in the field.
The codebase is not as actively maintained as some other RL libraries, and there may be compatibility issues with newer versions of dependencies.
Baselines does not provide the same level of flexibility and customization as PyTorch RL, which is built on top of the PyTorch framework.

Code Comparison

PyTorch RL:

import torch.nn as nn

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        return F.softmax(self.fc2(x), dim=1)

Baselines:

import tensorflow as tf

class PolicyNetwork(tf.keras.Model):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(PolicyNetwork, self).__init__()
        self.fc1 = tf.keras.layers.Dense(hidden_size, activation='relu')
        self.fc2 = tf.keras.layers.Dense(action_size, activation='softmax')

    def call(self, state):
        x = self.fc1(state)
        return self.fc2(x)

dopamine

10,785

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Pros of Dopamine

Dopamine provides a collection of reinforcement learning agents and environments, making it easier to experiment with different algorithms and setups.
The library includes a variety of common benchmark tasks, such as Atari games, which can be used to evaluate the performance of different agents.
Dopamine is well-documented and includes detailed tutorials and examples, making it more accessible for researchers and developers new to reinforcement learning.

Cons of Dopamine

Dopamine is primarily focused on classic reinforcement learning tasks and may not be as well-suited for more complex or specialized applications.
The library is primarily developed and maintained by Google, which may limit the diversity of contributions and perspectives compared to a more open-source project.
Dopamine may have a steeper learning curve for developers who are more familiar with PyTorch's ecosystem and tooling.

Code Comparison

PyTorch RL:

import torch.nn as nn

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        return F.softmax(self.fc2(x), dim=1)

Dopamine:

import tensorflow as tf

class DQNNetwork(tf.keras.Model):
    def __init__(self, num_actions):
        super(DQNNetwork, self).__init__()
        self.conv1 = tf.keras.layers.Conv2D(32, (8, 8), strides=(4, 4), activation='relu')
        self.conv2 = tf.keras.layers.Conv2D(64, (4, 4), strides=(2, 2), activation='relu')
        self.conv3 = tf.keras.layers.Conv2D(64, (3, 3), strides=(1, 1), activation='relu')
        self.flatten = tf.keras.layers.Flatten()
        self.fc1 = tf.keras.layers.Dense(512, activation='relu')
        self.fc2 = tf.keras.layers.Dense(num_actions)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TorchRL

TorchRL is an open-source Reinforcement Learning (RL) library for PyTorch.

Key features

ð Python-first: Designed with Python as the primary language for ease of use and flexibility
â±ï¸ Efficient: Optimized for performance to support demanding RL research applications
ð§® Modular, customizable, extensible: Highly modular architecture allows for easy swapping, transformation, or creation of new components
ð Documented: Thorough documentation ensures that users can quickly understand and utilize the library
â Tested: Rigorously tested to ensure reliability and stability
âï¸ Reusable functionals: Provides a set of highly reusable functions for cost functions, returns, and data processing

Design Principles

ð¥ Aligns with PyTorch ecosystem: Follows the structure and conventions of popular PyTorch libraries (e.g., dataset pillar, transforms, models, data utilities)
â Minimal dependencies: Only requires Python standard library, NumPy, and PyTorch; optional dependencies for common environment libraries (e.g., OpenAI Gym) and datasets (D4RL, OpenX...)

Read the full paper for a more curated description of the library.

Getting started

Check our Getting Started tutorials for quickly ramp up with the basic features of the library!

Documentation and knowledge base

The TorchRL documentation can be found here. It contains tutorials and the API reference.

TorchRL also provides a RL knowledge base to help you debug your code, or simply learn the basics of RL. Check it out here.

We have some introductory videos for you to get to know the library better, check them out:

Spotlight publications

TorchRL being domain-agnostic, you can use it across many different fields. Here are a few examples:

ACEGEN: Reinforcement Learning of Generative Chemical Agents for Drug Discovery
BenchMARL: Benchmarking Multi-Agent Reinforcement Learning
BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO
OmniDrones: An Efficient and Flexible Platform for Reinforcement Learning in Drone Control
RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark
Robohive: A unified framework for robot learning

Writing simplified and portable RL codebase with `TensorDict`

RL algorithms are very heterogeneous, and it can be hard to recycle a codebase across settings (e.g. from online to offline, from state-based to pixel-based learning). TorchRL solves this problem through TensorDict, a convenient data structure⁽¹⁾ that can be used to streamline one's RL codebase. With this tool, one can write a complete PPO training script in less than 100 lines of code!

Code

import torch
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn

from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import TensorDictReplayBuffer, \
  LazyTensorStorage, SamplerWithoutReplacement
from torchrl.envs.libs.gym import GymEnv
from torchrl.modules import ProbabilisticActor, ValueOperator, TanhNormal
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE

env = GymEnv("Pendulum-v1") 
model = TensorDictModule(
  nn.Sequential(
      nn.Linear(3, 128), nn.Tanh(),
      nn.Linear(128, 128), nn.Tanh(),
      nn.Linear(128, 128), nn.Tanh(),
      nn.Linear(128, 2),
      NormalParamExtractor()
  ),
  in_keys=["observation"],
  out_keys=["loc", "scale"]
)
critic = ValueOperator(
  nn.Sequential(
      nn.Linear(3, 128), nn.Tanh(),
      nn.Linear(128, 128), nn.Tanh(),
      nn.Linear(128, 128), nn.Tanh(),
      nn.Linear(128, 1),
  ),
  in_keys=["observation"],
)
actor = ProbabilisticActor(
  model,
  in_keys=["loc", "scale"],
  distribution_class=TanhNormal,
  distribution_kwargs={"low": -1.0, "high": 1.0},
  return_log_prob=True
  )
buffer = TensorDictReplayBuffer(
  storage=LazyTensorStorage(1000),
  sampler=SamplerWithoutReplacement(),
  batch_size=50,
  )
collector = SyncDataCollector(
  env,
  actor,
  frames_per_batch=1000,
  total_frames=1_000_000,
)
loss_fn = ClipPPOLoss(actor, critic)
adv_fn = GAE(value_network=critic, average_gae=True, gamma=0.99, lmbda=0.95)
optim = torch.optim.Adam(loss_fn.parameters(), lr=2e-4)

for data in collector:  # collect data
  for epoch in range(10):
      adv_fn(data)  # compute advantage
      buffer.extend(data)
      for sample in buffer:  # consume data
          loss_vals = loss_fn(sample)
          loss_val = sum(
              value for key, value in loss_vals.items() if
              key.startswith("loss")
              )
          loss_val.backward()
          optim.step()
          optim.zero_grad()
  print(f"avg reward: {data['next', 'reward'].mean().item(): 4.4f}")

Here is an example of how the environment API relies on tensordict to carry data from one function to another during a rollout execution: Alt Text

TensorDict makes it easy to re-use pieces of code across environments, models and algorithms.

Code

For instance, here's how to code a rollout in TorchRL:

- obs, done = env.reset()
+ tensordict = env.reset()
policy = SafeModule(
    model,
    in_keys=["observation_pixels", "observation_vector"],
    out_keys=["action"],
)
out = []
for i in range(n_steps):
-     action, log_prob = policy(obs)
-     next_obs, reward, done, info = env.step(action)
-     out.append((obs, next_obs, action, log_prob, reward, done))
-     obs = next_obs
+     tensordict = policy(tensordict)
+     tensordict = env.step(tensordict)
+     out.append(tensordict)
+     tensordict = step_mdp(tensordict)  # renames next_observation_* keys to observation_*
- obs, next_obs, action, log_prob, reward, done = [torch.stack(vals, 0) for vals in zip(*out)]
+ out = torch.stack(out, 0)  # TensorDict supports multiple tensor operations

Using this, TorchRL abstracts away the input / output signatures of the modules, env, collectors, replay buffers and losses of the library, allowing all primitives to be easily recycled across settings.

Code

Here's another example of an off-policy training loop in TorchRL (assuming that a data collector, a replay buffer, a loss and an optimizer have been instantiated):

- for i, (obs, next_obs, action, hidden_state, reward, done) in enumerate(collector):
+ for i, tensordict in enumerate(collector):
-     replay_buffer.add((obs, next_obs, action, log_prob, reward, done))
+     replay_buffer.add(tensordict)
    for j in range(num_optim_steps):
-         obs, next_obs, action, hidden_state, reward, done = replay_buffer.sample(batch_size)
-         loss = loss_fn(obs, next_obs, action, hidden_state, reward, done)
+         tensordict = replay_buffer.sample(batch_size)
+         loss = loss_fn(tensordict)
        loss.backward()
        optim.step()
        optim.zero_grad()

This training loop can be re-used across algorithms as it makes a minimal number of assumptions about the structure of the data.

TensorDict supports multiple tensor operations on its device and shape (the shape of TensorDict, or its batch size, is the common arbitrary N first dimensions of all its contained tensors):

Code

# stack and cat
tensordict = torch.stack(list_of_tensordicts, 0)
tensordict = torch.cat(list_of_tensordicts, 0)
# reshape
tensordict = tensordict.view(-1)
tensordict = tensordict.permute(0, 2, 1)
tensordict = tensordict.unsqueeze(-1)
tensordict = tensordict.squeeze(-1)
# indexing
tensordict = tensordict[:2]
tensordict[:, 2] = sub_tensordict
# device and memory location
tensordict.cuda()
tensordict.to("cuda:1")
tensordict.share_memory_()

TensorDict comes with a dedicated tensordict.nn module that contains everything you might need to write your model with it. And it is functorch and torch.compile compatible!

Code

transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
+ td_module = SafeModule(transformer_model, in_keys=["src", "tgt"], out_keys=["out"])
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
+ tensordict = TensorDict({"src": src, "tgt": tgt}, batch_size=[20, 32])
- out = transformer_model(src, tgt)
+ td_module(tensordict)
+ out = tensordict["out"]

The TensorDictSequential class allows to branch sequences of nn.Module instances in a highly modular way. For instance, here is an implementation of a transformer using the encoder and decoder blocks:

encoder_module = TransformerEncoder(...)
encoder = TensorDictSequential(encoder_module, in_keys=["src", "src_mask"], out_keys=["memory"])
decoder_module = TransformerDecoder(...)
decoder = TensorDictModule(decoder_module, in_keys=["tgt", "memory"], out_keys=["output"])
transformer = TensorDictSequential(encoder, decoder)
assert transformer.in_keys == ["src", "src_mask", "tgt"]
assert transformer.out_keys == ["memory", "output"]

TensorDictSequential allows to isolate subgraphs by querying a set of desired input / output keys:

transformer.select_subsequence(out_keys=["memory"])  # returns the encoder
transformer.select_subsequence(in_keys=["tgt", "memory"])  # returns the decoder

Check TensorDict tutorials to learn more!

Features

A common interface for environments which supports common libraries (OpenAI gym, deepmind control lab, etc.)⁽¹⁾ and state-less execution (e.g. Model-based environments). The batched environments containers allow parallel execution⁽²⁾. A common PyTorch-first class of tensor-specification class is also provided. TorchRL's environments API is simple but stringent and specific. Check the documentation and tutorial to learn more!
Code
```
env_make = lambda: GymEnv("Pendulum-v1", from_pixels=True)
env_parallel = ParallelEnv(4, env_make)  # creates 4 envs in parallel
tensordict = env_parallel.rollout(max_steps=20, policy=None)  # random rollout (no policy given)
assert tensordict.shape == [4, 20]  # 4 envs, 20 steps rollout
env_parallel.action_spec.is_in(tensordict["action"])  # spec check returns True
```

multiprocess and distributed data collectors⁽²⁾ that work synchronously or asynchronously. Through the use of TensorDict, TorchRL's training loops are made very similar to regular training loops in supervised learning (although the "dataloader" -- read data collector -- is modified on-the-fly):

Code

env_make = lambda: GymEnv("Pendulum-v1", from_pixels=True)
collector = MultiaSyncDataCollector(
    [env_make, env_make],
    policy=policy,
    devices=["cuda:0", "cuda:0"],
    total_frames=10000,
    frames_per_batch=50,
    ...
)
for i, tensordict_data in enumerate(collector):
    loss = loss_module(tensordict_data)
    loss.backward()
    optim.step()
    optim.zero_grad()
    collector.update_policy_weights_()

Check our distributed collector examples to learn more about ultra-fast data collection with TorchRL.

efficient⁽²⁾ and generic⁽¹⁾ replay buffers with modularized storage:

Code

storage = LazyMemmapStorage(  # memory-mapped (physical) storage
    cfg.buffer_size,
    scratch_dir="/tmp/"
)
buffer = TensorDictPrioritizedReplayBuffer(
    alpha=0.7,
    beta=0.5,
    collate_fn=lambda x: x,
    pin_memory=device != torch.device("cpu"),
    prefetch=10,  # multi-threaded sampling
    storage=storage
)

Replay buffers are also offered as wrappers around common datasets for offline RL:

Code

from torchrl.data.replay_buffers import SamplerWithoutReplacement
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
data = D4RLExperienceReplay(
    "maze2d-open-v0",
    split_trajs=True,
    batch_size=128,
    sampler=SamplerWithoutReplacement(drop_last=True),
)
for sample in data:  # or alternatively sample = data.sample()
    fun(sample)

cross-library environment transforms⁽¹⁾, executed on device and in a vectorized fashion⁽²⁾, which process and prepare the data coming out of the environments to be used by the agent:

Code

env_make = lambda: GymEnv("Pendulum-v1", from_pixels=True)
env_base = ParallelEnv(4, env_make, device="cuda:0")  # creates 4 envs in parallel
env = TransformedEnv(
    env_base,
    Compose(
        ToTensorImage(),
        ObservationNorm(loc=0.5, scale=1.0)),  # executes the transforms once and on device
)
tensordict = env.reset()
assert tensordict.device == torch.device("cuda:0")

Other transforms include: reward scaling (RewardScaling), shape operations (concatenation of tensors, unsqueezing etc.), concatenation of successive operations (CatFrames), resizing (Resize) and many more.

Unlike other libraries, the transforms are stacked as a list (and not wrapped in each other), which makes it easy to add and remove them at will:

env.insert_transform(0, NoopResetEnv())  # inserts the NoopResetEnv transform at the index 0

Nevertheless, transforms can access and execute operations on the parent environment:

transform = env.transform[1]  # gathers the second transform of the list
parent_env = transform.parent  # returns the base environment of the second transform, i.e. the base env + the first transform

various tools for distributed learning (e.g. memory mapped tensors)⁽²⁾;

various architectures and models (e.g. actor-critic)⁽¹⁾:

Code

# create an nn.Module
common_module = ConvNet(
    bias_last_layer=True,
    depth=None,
    num_cells=[32, 64, 64],
    kernel_sizes=[8, 4, 3],
    strides=[4, 2, 1],
)
# Wrap it in a SafeModule, indicating what key to read in and where to
# write out the output
common_module = SafeModule(
    common_module,
    in_keys=["pixels"],
    out_keys=["hidden"],
)
# Wrap the policy module in NormalParamsWrapper, such that the output
# tensor is split in loc and scale, and scale is mapped onto a positive space
policy_module = SafeModule(
    NormalParamsWrapper(
        MLP(num_cells=[64, 64], out_features=32, activation=nn.ELU)
    ),
    in_keys=["hidden"],
    out_keys=["loc", "scale"],
)
# Use a SafeProbabilisticTensorDictSequential to combine the SafeModule with a
# SafeProbabilisticModule, indicating how to build the
# torch.distribution.Distribution object and what to do with it
policy_module = SafeProbabilisticTensorDictSequential(  # stochastic policy
    policy_module,
    SafeProbabilisticModule(
        in_keys=["loc", "scale"],
        out_keys="action",
        distribution_class=TanhNormal,
    ),
)
value_module = MLP(
    num_cells=[64, 64],
    out_features=1,
    activation=nn.ELU,
)
# Wrap the policy and value funciton in a common module
actor_value = ActorValueOperator(common_module, policy_module, value_module)
# standalone policy from this
standalone_policy = actor_value.get_policy_operator()

exploration wrappers and modules to easily swap between exploration and exploitation⁽¹⁾:

Code

policy_explore = EGreedyWrapper(policy)
with set_exploration_type(ExplorationType.RANDOM):
    tensordict = policy_explore(tensordict)  # will use eps-greedy
with set_exploration_type(ExplorationType.DETERMINISTIC):
    tensordict = policy_explore(tensordict)  # will not use eps-greedy

A series of efficient loss modules and highly vectorized functional return and advantage computation.

Code

Loss modules

from torchrl.objectives import DQNLoss
loss_module = DQNLoss(value_network=value_network, gamma=0.99)
tensordict = replay_buffer.sample(batch_size)
loss = loss_module(tensordict)

Advantage computation

from torchrl.objectives.value.functional import vec_td_lambda_return_estimate
advantage = vec_td_lambda_return_estimate(gamma, lmbda, next_state_value, reward, done, terminated)

a generic trainer class⁽¹⁾ that executes the aforementioned training loop. Through a hooking mechanism, it also supports any logging or data transformation operation at any given time.
various recipes to build models that correspond to the environment being deployed.

If you feel a feature is missing from the library, please submit an issue! If you would like to contribute to new features, check our call for contributions and our contribution page.

Examples, tutorials and demos

A series of State-of-the-Art implementations are provided with an illustrative purpose:

Algorithm	Compile Support**	Tensordict-free API	Modular Losses	Continuous and Discrete
DQN	1.9x	+	NA	+ (through ActionDiscretizer transform)
DDPG	1.87x	+	+	- (continuous only)
IQL	3.22x	+	+	+
CQL	2.68x	+	+	+
TD3	2.27x	+	+	- (continuous only)
TD3+BC	untested	+	+	- (continuous only)
A2C	2.67x	+	-	+
PPO	2.42x	+	-	+
SAC	2.62x	+	-	+
REDQ	2.28x	+	-	- (continuous only)
Dreamer v1	untested	+	+ (different classes)	- (continuous only)
Decision Transformers	untested	+	NA	- (continuous only)
CrossQ	untested	+	+	- (continuous only)
Gail	untested	+	NA	+
Impala	untested	+	-	+
IQL (MARL)	untested	+	+	+
DDPG (MARL)	untested	+	+	- (continuous only)
PPO (MARL)	untested	+	-	+
QMIX-VDN (MARL)	untested	+	NA	+
SAC (MARL)	untested	+	-	+
RLHF	NA	+	NA	NA

** The number indicates expected speed-up compared to eager mode when executed on CPU. Numbers may vary depending on architecture and device.

and many more to come!

Code examples displaying toy code snippets and training scripts are also available

Check the examples directory for more details about handling the various configuration settings.

We also provide tutorials and demos that give a sense of what the library can do.

Citation

If you're using TorchRL, please refer to this BibTeX entry to cite this work:

@misc{bou2023torchrl,
      title={TorchRL: A data-driven decision-making library for PyTorch}, 
      author={Albert Bou and Matteo Bettini and Sebastian Dittert and Vikash Kumar and Shagun Sodhani and Xiaomeng Yang and Gianni De Fabritiis and Vincent Moens},
      year={2023},
      eprint={2306.00577},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Installation

Create a new virtual environment:

python -m venv torchrl
source torchrl/bin/activate  # On Windows use: venv\Scripts\activate

Or create a conda environment where the packages will be installed.

conda create --name torchrl python=3.9
conda activate torchrl

Install dependencies:

PyTorch

Depending on the use of torchrl that you want to make, you may want to install the latest (nightly) PyTorch release or the latest stable version of PyTorch. See here for a detailed list of commands, including pip3 or other special installation instructions.

TorchRL offers a few pre-defined dependencies such as "torchrl[tests]", "torchrl[atari]" etc.

Torchrl

You can install the latest stable release by using

pip3 install torchrl

This should work on linux (including AArch64 machines), Windows 10 and OsX (Metal chips only). On certain Windows machines (Windows 11), one should build the library locally. This can be done in two ways:

# Install and build locally v0.8.1 of the library without cloning
pip3 install git+https://github.com/pytorch/rl@v0.8.1
# Clone the library and build it locally
git clone https://github.com/pytorch/tensordict
git clone https://github.com/pytorch/rl
pip install -e tensordict
pip install -e rl

Note that tensordict local build requires cmake to be installed via homebrew (MacOS) or another package manager such as apt, apt-get, conda or yum but NOT pip, as well as pip install "pybind11[global]".

One can also build the wheels to distribute to co-workers using

python setup.py bdist_wheel

Your wheels will be stored there ./dist/torchrl<name>.whl and installable via

pip install torchrl<name>.whl

The nightly build can be installed via

pip3 install tensordict-nightly torchrl-nightly

which we currently only ship for Linux machines. Importantly, the nightly builds require the nightly builds of PyTorch too. Also, a local build of torchrl with the nightly build of tensordict may fail - install both nightlies or both local builds but do not mix them.

Disclaimer: As of today, TorchRL is roughly compatible with any pytorch version >= 2.1 and installing it will not directly require a newer version of pytorch to be installed. Indirectly though, tensordict still requires the latest PyTorch to be installed and we are working hard to loosen that requirement. The C++ binaries of TorchRL (mainly for prioritized replay buffers) will only work with PyTorch 2.7.0 and above. Some features (e.g., working with nested jagged tensors) may also be limited with older versions of pytorch. It is recommended to use the latest TorchRL with the latest PyTorch version unless there is a strong reason not to do so.

Optional dependencies

The following libraries can be installed depending on the usage one wants to make of torchrl:

# diverse
pip3 install tqdm tensorboard "hydra-core>=1.1" hydra-submitit-launcher

# rendering
pip3 install "moviepy<2.0.0"

# deepmind control suite
pip3 install dm_control

# gym, atari games
pip3 install "gym[atari]" "gym[accept-rom-license]" pygame

# tests
pip3 install pytest pyyaml pytest-instafail

# tensorboard
pip3 install tensorboard

# wandb
pip3 install wandb

Versioning issues can cause error message of the type undefined symbol and such. For these, refer to the versioning issues document for a complete explanation and proposed workarounds.

Asking a question

If you spot a bug in the library, please raise an issue in this repo.

If you have a more generic question regarding RL in PyTorch, post it on the PyTorch forum.

Contributing

Internal collaborations to torchrl are welcome! Feel free to fork, submit issues and PRs. You can checkout the detailed contribution guide here. As mentioned above, a list of open contributions can be found in here.

Contributors are recommended to install pre-commit hooks (using pre-commit install). pre-commit will check for linting related issues when the code is committed locally. You can disable th check by appending -n to your commit command: git commit -m <commit message> -n

Disclaimer

This library is released as a PyTorch beta feature. BC-breaking changes are likely to happen but they will be introduced with a deprecation warranty after a few release cycles.

License

TorchRL is licensed under the MIT License. See LICENSE for details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of OpenAI Gym

Cons of OpenAI Gym

Code Comparison

Pros of TRFL

Cons of TRFL

Code Comparison

Pros of Ray

Cons of Ray

Code Comparison

Pros of Baselines

Cons of Baselines

Code Comparison

Pros of Dopamine

Cons of Dopamine

Code Comparison

Convert designs to code with AI

README

TorchRL

Key features

Design Principles

Getting started

Documentation and knowledge base

Spotlight publications

Writing simplified and portable RL codebase with TensorDict

Features

Loss modules

Advantage computation

Examples, tutorials and demos

Citation

Installation

Create a new virtual environment:

Install dependencies:

PyTorch

Torchrl

Asking a question

Contributing

Disclaimer

License

Top Related Projects

Convert designs to code with AI

Writing simplified and portable RL codebase with `TensorDict`