Convert Figma logo to code with AI

vwxyzjn logocleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

7,085
756
7,085
78

Top Related Projects

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

An educational resource to help anyone learn deep reinforcement learning.

Quick Overview

CleanRL is a deep reinforcement learning library that provides high-quality single-file implementations of popular RL algorithms. It focuses on simplicity, readability, and reproducibility, making it an excellent resource for both learning and research in the field of reinforcement learning.

Pros

  • Single-file implementations for easy understanding and modification
  • Supports various popular RL algorithms and environments
  • Well-documented and actively maintained
  • Integrates with Weights & Biases for experiment tracking

Cons

  • May not be as feature-rich as larger, more established RL libraries
  • Limited to a subset of RL algorithms compared to some other libraries
  • Single-file approach may not scale well for very complex algorithms
  • Might require additional dependencies for certain environments

Code Examples

  1. Training a PPO agent on the CartPole environment:
from cleanrl import ppo

ppo.train(env_id="CartPole-v1", total_timesteps=500000, learning_rate=2.5e-4)
  1. Using a custom neural network architecture:
import torch.nn as nn
from cleanrl import dqn

class CustomNetwork(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(env.observation_space.shape[0], 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, env.action_space.n)
        )

    def forward(self, x):
        return self.network(x)

dqn.train(env_id="LunarLander-v2", q_network=CustomNetwork, total_timesteps=1000000)
  1. Enabling Weights & Biases logging:
from cleanrl import sac

sac.train(env_id="HalfCheetah-v4", track=True, wandb_project_name="cleanrl", wandb_entity="your_username")

Getting Started

To get started with CleanRL, follow these steps:

  1. Install the library:
pip install cleanrl
  1. Train an agent using one of the provided algorithms:
from cleanrl import ppo

ppo.train(env_id="CartPole-v1", total_timesteps=500000)
  1. Customize hyperparameters as needed:
from cleanrl import dqn

dqn.train(
    env_id="LunarLander-v2",
    total_timesteps=1000000,
    learning_rate=1e-4,
    buffer_size=100000,
    gamma=0.99
)

For more detailed information and advanced usage, refer to the official documentation and examples in the GitHub repository.

Competitor Comparisons

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Pros of Baselines

  • Comprehensive collection of RL algorithms
  • Well-established and widely used in research
  • Extensive documentation and community support

Cons of Baselines

  • Complex codebase with interdependencies
  • Less focus on readability and simplicity
  • Slower development and updates

Code Comparison

Baselines (PPO implementation):

def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
            vf_coef=0.5,  max_grad_norm=0.5, gamma=0.99, lam=0.95,
            log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
            save_interval=0, load_path=None, model_fn=None, update_fn=None, init_fn=None, mpi_rank_weight=1, comm=None, **network_kwargs):

CleanRL (PPO implementation):

def train(env_id, num_envs, learning_rate, seed, total_timesteps, torch_deterministic, cuda, track, wandb_project_name, wandb_entity, capture_video, save_model):

CleanRL offers a more streamlined and readable implementation, focusing on simplicity and ease of understanding. Baselines provides a more feature-rich but complex implementation with numerous parameters and options.

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Pros of Stable-Baselines

  • More comprehensive library with a wider range of implemented algorithms
  • Better documentation and tutorials for beginners
  • Integrated with OpenAI Gym environments out of the box

Cons of Stable-Baselines

  • Larger codebase, potentially harder to understand and modify
  • Less focus on simplicity and readability compared to CleanRL
  • May have more dependencies and be harder to set up in some environments

Code Comparison

CleanRL (PPO implementation):

def compute_gae(self, next_value, rewards, masks, values, gamma=0.99, lam=0.95):
    values = values + [next_value]
    gae = 0
    returns = []
    for step in reversed(range(len(rewards))):
        delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
        gae = delta + gamma * lam * masks[step] * gae
        returns.insert(0, gae + values[step])
    return returns

Stable-Baselines (PPO implementation):

def compute_gae(self, rewards, values, dones, last_values, last_dones):
    advantages = np.zeros_like(rewards)
    last_gae_lam = 0
    for step in reversed(range(self.n_steps)):
        if step == self.n_steps - 1:
            next_non_terminal = 1.0 - last_dones
            next_values = last_values
        else:
            next_non_terminal = 1.0 - dones[step + 1]
            next_values = values[step + 1]
        delta = rewards[step] + self.gamma * next_values * next_non_terminal - values[step]
        advantages[step] = last_gae_lam = delta + self.gamma * self.lam * next_non_terminal * last_gae_lam
    return advantages + values

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Pros of stable-baselines3

  • More comprehensive documentation and tutorials
  • Wider range of implemented algorithms
  • Better integration with OpenAI Gym environments

Cons of stable-baselines3

  • More complex codebase, potentially harder to understand and modify
  • Slower training speed due to additional features and abstractions

Code Comparison

stable-baselines3:

from stable_baselines3 import PPO

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=10000)

CleanRL:

import gym
from cleanrl_utils import parse_args, make_env
from ppo import PPO

args = parse_args()
env = make_env(args.env_id)
agent = PPO(env)
agent.learn(total_timesteps=10000)

Both libraries aim to provide implementations of reinforcement learning algorithms, but they differ in their approach. stable-baselines3 offers a more feature-rich and abstracted interface, while CleanRL focuses on simplicity and readability. The code comparison shows that stable-baselines3 requires less setup but may be less flexible, while CleanRL provides more control over the implementation details at the cost of additional setup code.

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

Pros of Gymnasium

  • Broader scope and more comprehensive environment suite
  • Better maintained and more actively developed
  • Stronger community support and wider adoption in the RL field

Cons of Gymnasium

  • More complex codebase, potentially harder for beginners
  • Heavier dependencies and larger installation footprint
  • Less focused on specific RL algorithms compared to CleanRL

Code Comparison

Gymnasium example:

import gymnasium as gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=42)
for _ in range(1000):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

CleanRL example:

import gym
env = gym.make("CartPole-v1")
obs = env.reset()
for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)

The main difference is in the API, with Gymnasium using a more detailed step return and separate reset return values.

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Pros of pytorch-a2c-ppo-acktr-gail

  • Implements a wider range of algorithms (A2C, PPO, ACKTR, GAIL)
  • More established project with a larger community and longer history
  • Provides pre-trained models for some environments

Cons of pytorch-a2c-ppo-acktr-gail

  • Less focus on code readability and simplicity
  • Fewer comments and explanations in the codebase
  • Less frequent updates and maintenance

Code Comparison

pytorch-a2c-ppo-acktr-gail:

class Policy(nn.Module):
    def __init__(self, obs_shape, action_space, base=None, base_kwargs=None):
        super(Policy, self).__init__()
        if base_kwargs is None:
            base_kwargs = {}
        if base is None:
            base = MLPBase
        self.base = base(obs_shape[0], **base_kwargs)

cleanrl:

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )

The code comparison shows that cleanrl uses a more straightforward and readable approach to defining the neural network architecture, while pytorch-a2c-ppo-acktr-gail uses a more modular and flexible design.

An educational resource to help anyone learn deep reinforcement learning.

Pros of SpinningUp

  • Comprehensive educational resources and tutorials
  • Implements a wider range of RL algorithms
  • Backed by OpenAI, ensuring high-quality implementations

Cons of SpinningUp

  • Less frequently updated compared to CleanRL
  • More complex codebase, potentially harder for beginners to understand
  • Focuses on PyTorch, while CleanRL supports multiple frameworks

Code Comparison

SpinningUp (PPO implementation):

def ppo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), seed=0,
        steps_per_epoch=4000, epochs=50, gamma=0.99, clip_ratio=0.2, pi_lr=3e-4,
        vf_lr=1e-3, train_pi_iters=80, train_v_iters=80, lam=0.97, max_ep_len=1000,
        target_kl=0.01, logger_kwargs=dict(), save_freq=10):

CleanRL (PPO implementation):

def train(env_id, num_envs, learning_rate, seed, total_timesteps, torch_deterministic, cuda,
          track, wandb_project_name, wandb_entity, capture_video, save_model):

Both repositories provide implementations of popular RL algorithms, but CleanRL focuses on simplicity and readability, while SpinningUp offers more comprehensive educational content and a wider range of algorithms.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CleanRL (Clean Implementation of RL Algorithms)

tests docs Code style: black Imports: isort Open In Colab

CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation with research-friendly features. The implementation is clean and simple, yet we can scale it to run thousands of experiments using AWS Batch. The highlight features of CleanRL are:

  • 📜 Single-file implementation
    • Every detail about an algorithm variant is put into a single standalone file.
    • For example, our ppo_atari.py only has 340 lines of code but contains all implementation details on how PPO works with Atari games, so it is a great reference implementation to read for folks who do not wish to read an entire modular library.
  • 📊 Benchmarked Implementation (7+ algorithms and 34+ games at https://benchmark.cleanrl.dev)
  • 📈 Tensorboard Logging
  • 🪛 Local Reproducibility via Seeding
  • 🎮 Videos of Gameplay Capturing
  • 🧫 Experiment Management with Weights and Biases
  • 💸 Cloud Integration with docker and AWS

You can read more about CleanRL in our JMLR paper and documentation.

Notable CleanRL-related projects:

  • corl-team/CORL: Offline RL algorithm implemented in CleanRL style
  • pytorch-labs/LeanRL: Fast optimized PyTorch implementation of CleanRL RL algorithms using CUDAGraphs.

ℹ️ Support for Gymnasium: Farama-Foundation/Gymnasium is the next generation of openai/gym that will continue to be maintained and introduce new features. Please see their announcement for further detail. We are migrating to gymnasium and the progress can be tracked in vwxyzjn/cleanrl#277.

⚠️ NOTE: CleanRL is not a modular library and therefore it is not meant to be imported. At the cost of duplicate code, we make all implementation details of a DRL algorithm variant easy to understand, so CleanRL comes with its own pros and cons. You should consider using CleanRL if you want to 1) understand all implementation details of an algorithm's varaint or 2) prototype advanced features that other modular DRL libraries do not support (CleanRL has minimal lines of code so it gives you great debugging experience and you don't have do a lot of subclassing like sometimes in modular DRL libraries).

Get started

Prerequisites:

To run experiments locally, give the following a try:

git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
poetry install

# alternatively, you could use `poetry shell` and do
# `python run cleanrl/ppo.py`
poetry run python cleanrl/ppo.py \
    --seed 1 \
    --env-id CartPole-v0 \
    --total-timesteps 50000

# open another terminal and enter `cd cleanrl/cleanrl`
tensorboard --logdir runs

To use experiment tracking with wandb, run

wandb login # only required for the first time
poetry run python cleanrl/ppo.py \
    --seed 1 \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --track \
    --wandb-project-name cleanrltest

If you are not using poetry, you can install CleanRL with requirements.txt:

# core dependencies
pip install -r requirements/requirements.txt

# optional dependencies
pip install -r requirements/requirements-atari.txt
pip install -r requirements/requirements-mujoco.txt
pip install -r requirements/requirements-mujoco_py.txt
pip install -r requirements/requirements-procgen.txt
pip install -r requirements/requirements-envpool.txt
pip install -r requirements/requirements-pettingzoo.txt
pip install -r requirements/requirements-jax.txt
pip install -r requirements/requirements-docs.txt
pip install -r requirements/requirements-cloud.txt
pip install -r requirements/requirements-memory_gym.txt

To run training scripts in other games:

poetry shell

# classic control
python cleanrl/dqn.py --env-id CartPole-v1
python cleanrl/ppo.py --env-id CartPole-v1
python cleanrl/c51.py --env-id CartPole-v1

# atari
poetry install -E atari
python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/sac_atari.py --env-id BreakoutNoFrameskip-v4

# NEW: 3-4x side-effects free speed up with envpool's atari (only available to linux)
poetry install -E envpool
python cleanrl/ppo_atari_envpool.py --env-id BreakoutNoFrameskip-v4
# Learn Pong-v5 in ~5-10 mins
# Side effects such as lower sample efficiency might occur
poetry run python ppo_atari_envpool.py --clip-coef=0.2 --num-envs=16 --num-minibatches=8 --num-steps=128 --update-epochs=3

# procgen
poetry install -E procgen
python cleanrl/ppo_procgen.py --env-id starpilot
python cleanrl/ppg_procgen.py --env-id starpilot

# ppo + lstm
poetry install -E atari
python cleanrl/ppo_atari_lstm.py --env-id BreakoutNoFrameskip-v4

You may also use a prebuilt development environment hosted in Gitpod:

Open in Gitpod

Algorithms Implemented

AlgorithmVariants Implemented
✅ Proximal Policy Gradient (PPO)ppo.py, docs
ppo_atari.py, docs
ppo_continuous_action.py, docs
ppo_atari_lstm.py, docs
ppo_atari_envpool.py, docs
ppo_atari_envpool_xla_jax.py, docs
ppo_atari_envpool_xla_jax_scan.py, docs)
ppo_procgen.py, docs
ppo_atari_multigpu.py, docs
ppo_pettingzoo_ma_atari.py, docs
ppo_continuous_action_isaacgym.py, docs
ppo_trxl.py, docs
✅ Deep Q-Learning (DQN)dqn.py, docs
dqn_atari.py, docs
dqn_jax.py, docs
dqn_atari_jax.py, docs
✅ Categorical DQN (C51)c51.py, docs
c51_atari.py, docs
c51_jax.py, docs
c51_atari_jax.py, docs
✅ Soft Actor-Critic (SAC)sac_continuous_action.py, docs
sac_atari.py, docs
✅ Deep Deterministic Policy Gradient (DDPG)ddpg_continuous_action.py, docs
ddpg_continuous_action_jax.py, docs
✅ Twin Delayed Deep Deterministic Policy Gradient (TD3)td3_continuous_action.py, docs
td3_continuous_action_jax.py, docs
✅ Phasic Policy Gradient (PPG)ppg_procgen.py, docs
✅ Random Network Distillation (RND)ppo_rnd_envpool.py, docs
✅ Qdaggerqdagger_dqn_atari_impalacnn.py, docs
qdagger_dqn_atari_jax_impalacnn.py, docs

Open RL Benchmark

To make our experimental data transparent, CleanRL participates in a related project called Open RL Benchmark, which contains tracked experiments from popular DRL libraries such as ours, Stable-baselines3, openai/baselines, jaxrl, and others.

Check out https://benchmark.cleanrl.dev/ for a collection of Weights and Biases reports showcasing tracked DRL experiments. The reports are interactive, and researchers can easily query information such as GPU utilization and videos of an agent's gameplay that are normally hard to acquire in other RL benchmarks. In the future, Open RL Benchmark will likely provide an dataset API for researchers to easily access the data (see repo).

Support and get involved

We have a Discord Community for support. Feel free to ask questions. Posting in Github Issues and PRs are also welcome. Also our past video recordings are available at YouTube

Citing CleanRL

If you use CleanRL in your work, please cite our technical paper:

@article{huang2022cleanrl,
  author  = {Shengyi Huang and Rousslan Fernand Julien Dossa and Chang Ye and Jeff Braga and Dipam Chakraborty and Kinal Mehta and João G.M. Araújo},
  title   = {CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {274},
  pages   = {1--18},
  url     = {http://jmlr.org/papers/v23/21-1342.html}
}

Acknowledgement

CleanRL is a community-powered by project and our contributors run experiments on a variety of hardware.

  • We thank many contributors for using their own computers to run experiments
  • We thank Google's TPU research cloud for providing TPU resources.
  • We thank Hugging Face's cluster for providing GPU resources.