cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

7,085

756

7,085

View on GitHub

Top Related Projects

baselines

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

stable-baselines

4,257

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

stable-baselines3

10,506

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Gymnasium

9,007

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

spinningup

10,709

An educational resource to help anyone learn deep reinforcement learning.

Quick Overview

CleanRL is a deep reinforcement learning library that provides high-quality single-file implementations of popular RL algorithms. It focuses on simplicity, readability, and reproducibility, making it an excellent resource for both learning and research in the field of reinforcement learning.

Pros

Single-file implementations for easy understanding and modification
Supports various popular RL algorithms and environments
Well-documented and actively maintained
Integrates with Weights & Biases for experiment tracking

Cons

May not be as feature-rich as larger, more established RL libraries
Limited to a subset of RL algorithms compared to some other libraries
Single-file approach may not scale well for very complex algorithms
Might require additional dependencies for certain environments

Code Examples

Training a PPO agent on the CartPole environment:

from cleanrl import ppo

ppo.train(env_id="CartPole-v1", total_timesteps=500000, learning_rate=2.5e-4)

Using a custom neural network architecture:

import torch.nn as nn
from cleanrl import dqn

class CustomNetwork(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(env.observation_space.shape[0], 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, env.action_space.n)
        )

    def forward(self, x):
        return self.network(x)

dqn.train(env_id="LunarLander-v2", q_network=CustomNetwork, total_timesteps=1000000)

Enabling Weights & Biases logging:

from cleanrl import sac

sac.train(env_id="HalfCheetah-v4", track=True, wandb_project_name="cleanrl", wandb_entity="your_username")

Getting Started

To get started with CleanRL, follow these steps:

Install the library:

pip install cleanrl

Train an agent using one of the provided algorithms:

from cleanrl import ppo

ppo.train(env_id="CartPole-v1", total_timesteps=500000)

Customize hyperparameters as needed:

from cleanrl import dqn

dqn.train(
    env_id="LunarLander-v2",
    total_timesteps=1000000,
    learning_rate=1e-4,
    buffer_size=100000,
    gamma=0.99
)

For more detailed information and advanced usage, refer to the official documentation and examples in the GitHub repository.

Competitor Comparisons

baselines

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Pros of Baselines

Comprehensive collection of RL algorithms
Well-established and widely used in research
Extensive documentation and community support

Cons of Baselines

Complex codebase with interdependencies
Less focus on readability and simplicity
Slower development and updates

Code Comparison

Baselines (PPO implementation):

def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
            vf_coef=0.5,  max_grad_norm=0.5, gamma=0.99, lam=0.95,
            log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
            save_interval=0, load_path=None, model_fn=None, update_fn=None, init_fn=None, mpi_rank_weight=1, comm=None, **network_kwargs):

CleanRL (PPO implementation):

def train(env_id, num_envs, learning_rate, seed, total_timesteps, torch_deterministic, cuda, track, wandb_project_name, wandb_entity, capture_video, save_model):

CleanRL offers a more streamlined and readable implementation, focusing on simplicity and ease of understanding. Baselines provides a more feature-rich but complex implementation with numerous parameters and options.

stable-baselines

4,257

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Pros of Stable-Baselines

More comprehensive library with a wider range of implemented algorithms
Better documentation and tutorials for beginners
Integrated with OpenAI Gym environments out of the box

Cons of Stable-Baselines

Larger codebase, potentially harder to understand and modify
Less focus on simplicity and readability compared to CleanRL
May have more dependencies and be harder to set up in some environments

Code Comparison

CleanRL (PPO implementation):

def compute_gae(self, next_value, rewards, masks, values, gamma=0.99, lam=0.95):
    values = values + [next_value]
    gae = 0
    returns = []
    for step in reversed(range(len(rewards))):
        delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
        gae = delta + gamma * lam * masks[step] * gae
        returns.insert(0, gae + values[step])
    return returns

Stable-Baselines (PPO implementation):

def compute_gae(self, rewards, values, dones, last_values, last_dones):
    advantages = np.zeros_like(rewards)
    last_gae_lam = 0
    for step in reversed(range(self.n_steps)):
        if step == self.n_steps - 1:
            next_non_terminal = 1.0 - last_dones
            next_values = last_values
        else:
            next_non_terminal = 1.0 - dones[step + 1]
            next_values = values[step + 1]
        delta = rewards[step] + self.gamma * next_values * next_non_terminal - values[step]
        advantages[step] = last_gae_lam = delta + self.gamma * self.lam * next_non_terminal * last_gae_lam
    return advantages + values

stable-baselines3

10,506

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Pros of stable-baselines3

More comprehensive documentation and tutorials
Wider range of implemented algorithms
Better integration with OpenAI Gym environments

Cons of stable-baselines3

More complex codebase, potentially harder to understand and modify
Slower training speed due to additional features and abstractions

Code Comparison

stable-baselines3:

from stable_baselines3 import PPO

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=10000)

CleanRL:

import gym
from cleanrl_utils import parse_args, make_env
from ppo import PPO

args = parse_args()
env = make_env(args.env_id)
agent = PPO(env)
agent.learn(total_timesteps=10000)

Both libraries aim to provide implementations of reinforcement learning algorithms, but they differ in their approach. stable-baselines3 offers a more feature-rich and abstracted interface, while CleanRL focuses on simplicity and readability. The code comparison shows that stable-baselines3 requires less setup but may be less flexible, while CleanRL provides more control over the implementation details at the cost of additional setup code.

Gymnasium

9,007

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

Pros of Gymnasium

Broader scope and more comprehensive environment suite
Better maintained and more actively developed
Stronger community support and wider adoption in the RL field

Cons of Gymnasium

More complex codebase, potentially harder for beginners
Heavier dependencies and larger installation footprint
Less focused on specific RL algorithms compared to CleanRL

Code Comparison

Gymnasium example:

import gymnasium as gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=42)
for _ in range(1000):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

CleanRL example:

import gym
env = gym.make("CartPole-v1")
obs = env.reset()
for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)

The main difference is in the API, with Gymnasium using a more detailed step return and separate reset return values.

pytorch-a2c-ppo-acktr-gail

3,729

Pros of pytorch-a2c-ppo-acktr-gail

Implements a wider range of algorithms (A2C, PPO, ACKTR, GAIL)
More established project with a larger community and longer history
Provides pre-trained models for some environments

Cons of pytorch-a2c-ppo-acktr-gail

Less focus on code readability and simplicity
Fewer comments and explanations in the codebase
Less frequent updates and maintenance

Code Comparison

pytorch-a2c-ppo-acktr-gail:

class Policy(nn.Module):
    def __init__(self, obs_shape, action_space, base=None, base_kwargs=None):
        super(Policy, self).__init__()
        if base_kwargs is None:
            base_kwargs = {}
        if base is None:
            base = MLPBase
        self.base = base(obs_shape[0], **base_kwargs)

cleanrl:

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )

The code comparison shows that cleanrl uses a more straightforward and readable approach to defining the neural network architecture, while pytorch-a2c-ppo-acktr-gail uses a more modular and flexible design.

spinningup

10,709

An educational resource to help anyone learn deep reinforcement learning.

Pros of SpinningUp

Comprehensive educational resources and tutorials
Implements a wider range of RL algorithms
Backed by OpenAI, ensuring high-quality implementations

Cons of SpinningUp

Less frequently updated compared to CleanRL
More complex codebase, potentially harder for beginners to understand
Focuses on PyTorch, while CleanRL supports multiple frameworks

Code Comparison

SpinningUp (PPO implementation):

def ppo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), seed=0,
        steps_per_epoch=4000, epochs=50, gamma=0.99, clip_ratio=0.2, pi_lr=3e-4,
        vf_lr=1e-3, train_pi_iters=80, train_v_iters=80, lam=0.97, max_ep_len=1000,
        target_kl=0.01, logger_kwargs=dict(), save_freq=10):

CleanRL (PPO implementation):

def train(env_id, num_envs, learning_rate, seed, total_timesteps, torch_deterministic, cuda,
          track, wandb_project_name, wandb_entity, capture_video, save_model):

Both repositories provide implementations of popular RL algorithms, but CleanRL focuses on simplicity and readability, while SpinningUp offers more comprehensive educational content and a wider range of algorithms.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CleanRL (Clean Implementation of RL Algorithms)

CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation with research-friendly features. The implementation is clean and simple, yet we can scale it to run thousands of experiments using AWS Batch. The highlight features of CleanRL are:

ð Single-file implementation
- Every detail about an algorithm variant is put into a single standalone file.
- For example, our ppo_atari.py only has 340 lines of code but contains all implementation details on how PPO works with Atari games, so it is a great reference implementation to read for folks who do not wish to read an entire modular library.
ð Benchmarked Implementation (7+ algorithms and 34+ games at https://benchmark.cleanrl.dev)
ð Tensorboard Logging
ðª Local Reproducibility via Seeding
ð® Videos of Gameplay Capturing
ð§« Experiment Management with Weights and Biases
ð¸ Cloud Integration with docker and AWS

You can read more about CleanRL in our JMLR paper and documentation.

Notable CleanRL-related projects:

corl-team/CORL: Offline RL algorithm implemented in CleanRL style
pytorch-labs/LeanRL: Fast optimized PyTorch implementation of CleanRL RL algorithms using CUDAGraphs.

â¹ï¸ Support for Gymnasium: Farama-Foundation/Gymnasium is the next generation of openai/gym that will continue to be maintained and introduce new features. Please see their announcement for further detail. We are migrating to gymnasium and the progress can be tracked in vwxyzjn/cleanrl#277.

â ï¸ NOTE: CleanRL is not a modular library and therefore it is not meant to be imported. At the cost of duplicate code, we make all implementation details of a DRL algorithm variant easy to understand, so CleanRL comes with its own pros and cons. You should consider using CleanRL if you want to 1) understand all implementation details of an algorithm's varaint or 2) prototype advanced features that other modular DRL libraries do not support (CleanRL has minimal lines of code so it gives you great debugging experience and you don't have do a lot of subclassing like sometimes in modular DRL libraries).

Get started

Prerequisites:

Python >=3.7.1,<3.11
Poetry 1.2.1+

To run experiments locally, give the following a try:

git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
poetry install

# alternatively, you could use `poetry shell` and do
# `python run cleanrl/ppo.py`
poetry run python cleanrl/ppo.py \
    --seed 1 \
    --env-id CartPole-v0 \
    --total-timesteps 50000

# open another terminal and enter `cd cleanrl/cleanrl`
tensorboard --logdir runs

To use experiment tracking with wandb, run

wandb login # only required for the first time
poetry run python cleanrl/ppo.py \
    --seed 1 \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --track \
    --wandb-project-name cleanrltest

If you are not using poetry, you can install CleanRL with requirements.txt:

# core dependencies
pip install -r requirements/requirements.txt

# optional dependencies
pip install -r requirements/requirements-atari.txt
pip install -r requirements/requirements-mujoco.txt
pip install -r requirements/requirements-mujoco_py.txt
pip install -r requirements/requirements-procgen.txt
pip install -r requirements/requirements-envpool.txt
pip install -r requirements/requirements-pettingzoo.txt
pip install -r requirements/requirements-jax.txt
pip install -r requirements/requirements-docs.txt
pip install -r requirements/requirements-cloud.txt
pip install -r requirements/requirements-memory_gym.txt

To run training scripts in other games:

poetry shell

# classic control
python cleanrl/dqn.py --env-id CartPole-v1
python cleanrl/ppo.py --env-id CartPole-v1
python cleanrl/c51.py --env-id CartPole-v1

# atari
poetry install -E atari
python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/sac_atari.py --env-id BreakoutNoFrameskip-v4

# NEW: 3-4x side-effects free speed up with envpool's atari (only available to linux)
poetry install -E envpool
python cleanrl/ppo_atari_envpool.py --env-id BreakoutNoFrameskip-v4
# Learn Pong-v5 in ~5-10 mins
# Side effects such as lower sample efficiency might occur
poetry run python ppo_atari_envpool.py --clip-coef=0.2 --num-envs=16 --num-minibatches=8 --num-steps=128 --update-epochs=3

# procgen
poetry install -E procgen
python cleanrl/ppo_procgen.py --env-id starpilot
python cleanrl/ppg_procgen.py --env-id starpilot

# ppo + lstm
poetry install -E atari
python cleanrl/ppo_atari_lstm.py --env-id BreakoutNoFrameskip-v4

You may also use a prebuilt development environment hosted in Gitpod:

Algorithms Implemented

Algorithm	Variants Implemented
â Proximal Policy Gradient (PPO)	`ppo.py`, docs
	`ppo_atari.py`, docs
	`ppo_continuous_action.py`, docs
	`ppo_atari_lstm.py`, docs
	`ppo_atari_envpool.py`, docs
	`ppo_atari_envpool_xla_jax.py`, docs
	`ppo_atari_envpool_xla_jax_scan.py`, docs)
	`ppo_procgen.py`, docs
	`ppo_atari_multigpu.py`, docs
	`ppo_pettingzoo_ma_atari.py`, docs
	`ppo_continuous_action_isaacgym.py`, docs
	`ppo_trxl.py`, docs
â Deep Q-Learning (DQN)	`dqn.py`, docs
	`dqn_atari.py`, docs
	`dqn_jax.py`, docs
	`dqn_atari_jax.py`, docs
â Categorical DQN (C51)	`c51.py`, docs
	`c51_atari.py`, docs
	`c51_jax.py`, docs
	`c51_atari_jax.py`, docs
â Soft Actor-Critic (SAC)	`sac_continuous_action.py`, docs
	`sac_atari.py`, docs
â Deep Deterministic Policy Gradient (DDPG)	`ddpg_continuous_action.py`, docs
	`ddpg_continuous_action_jax.py`, docs
â Twin Delayed Deep Deterministic Policy Gradient (TD3)	`td3_continuous_action.py`, docs
	`td3_continuous_action_jax.py`, docs
â Phasic Policy Gradient (PPG)	`ppg_procgen.py`, docs
â Random Network Distillation (RND)	`ppo_rnd_envpool.py`, docs
â Qdagger	`qdagger_dqn_atari_impalacnn.py`, docs
	`qdagger_dqn_atari_jax_impalacnn.py`, docs

Open RL Benchmark

To make our experimental data transparent, CleanRL participates in a related project called Open RL Benchmark, which contains tracked experiments from popular DRL libraries such as ours, Stable-baselines3, openai/baselines, jaxrl, and others.

Check out https://benchmark.cleanrl.dev/ for a collection of Weights and Biases reports showcasing tracked DRL experiments. The reports are interactive, and researchers can easily query information such as GPU utilization and videos of an agent's gameplay that are normally hard to acquire in other RL benchmarks. In the future, Open RL Benchmark will likely provide an dataset API for researchers to easily access the data (see repo).

Support and get involved

We have a Discord Community for support. Feel free to ask questions. Posting in Github Issues and PRs are also welcome. Also our past video recordings are available at YouTube

Citing CleanRL

If you use CleanRL in your work, please cite our technical paper:

@article{huang2022cleanrl,
  author  = {Shengyi Huang and Rousslan Fernand Julien Dossa and Chang Ye and Jeff Braga and Dipam Chakraborty and Kinal Mehta and JoÃ£o G.M. AraÃºjo},
  title   = {CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {274},
  pages   = {1--18},
  url     = {http://jmlr.org/papers/v23/21-1342.html}
}

Acknowledgement

CleanRL is a community-powered by project and our contributors run experiments on a variety of hardware.

We thank many contributors for using their own computers to run experiments
We thank Google's TPU research cloud for providing TPU resources.
We thank Hugging Face's cluster for providing GPU resources.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot