rlkit

Collection of reinforcement learning algorithms

2,668

558

2,668

View on GitHub

Top Related Projects

baselines

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

stable-baselines

4,257

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

agents

2,901

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Gymnasium

9,007

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

ignite

4,655

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

dopamine

10,720

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Quick Overview

RLkit is an open-source reinforcement learning framework developed by the Berkeley Artificial Intelligence Research (BAIR) lab. It provides a modular and extensible toolkit for implementing and experimenting with various reinforcement learning algorithms, with a focus on off-policy methods and deep reinforcement learning.

Pros

Modular design allows for easy customization and extension of algorithms
Includes implementations of popular RL algorithms like SAC, TD3, and DDPG
Integrates well with PyTorch for neural network implementations
Provides utilities for logging, visualization, and experiment management

Cons

Documentation could be more comprehensive and up-to-date
Limited support for on-policy algorithms compared to off-policy methods
May have a steeper learning curve for beginners compared to some other RL libraries
Not as actively maintained as some other popular RL frameworks

Code Examples

Creating and training a SAC agent:

from rlkit.torch.sac.sac import SACTrainer
from rlkit.torch.networks import FlattenMlp
from rlkit.envs.wrappers import NormalizedBoxEnv
import gym

env = NormalizedBoxEnv(gym.make('HalfCheetah-v2'))
qf = FlattenMlp(
    input_size=env.observation_space.low.size + env.action_space.low.size,
    output_size=1,
    hidden_sizes=[400, 300],
)
policy = TanhGaussianPolicy(
    obs_dim=env.observation_space.low.size,
    action_dim=env.action_space.low.size,
    hidden_sizes=[400, 300],
)
trainer = SACTrainer(
    env=env,
    policy=policy,
    qf=qf,
    vf=vf,
)
trainer.train()

Implementing a custom replay buffer:

from rlkit.data_management.simple_replay_buffer import SimpleReplayBuffer

class CustomReplayBuffer(SimpleReplayBuffer):
    def __init__(self, max_replay_buffer_size, env):
        super().__init__(
            max_replay_buffer_size=max_replay_buffer_size,
            observation_dim=env.observation_space.low.size,
            action_dim=env.action_space.low.size,
        )
    
    def add_sample(self, observation, action, reward, next_observation, terminal):
        # Custom logic for adding samples
        super().add_sample(observation, action, reward, next_observation, terminal)

Creating a custom environment wrapper:

from rlkit.envs.wrappers import ProxyEnv
import numpy as np

class CustomEnvWrapper(ProxyEnv):
    def __init__(self, env):
        super().__init__(env)
    
    def step(self, action):
        observation, reward, done, info = self._wrapped_env.step(action)
        # Custom reward shaping
        reward = np.clip(reward, -1, 1)
        return observation, reward, done, info

Getting Started

To get started with RLkit, follow these steps:

Install RLkit and its dependencies:

git clone https://github.com/rail-berkeley/rlkit.git
cd rlkit
pip install -e .

Run an example experiment:

from rlkit.torch.sac.sac import SACTrainer
from rlkit.torch.networks import FlattenMlp
from rlkit.envs.wrappers import NormalizedBoxEnv
from rlkit.launchers.launcher_util import setup_logger, run_experiment
import gym

def experiment(variant):
    env = NormalizedBoxEnv(gym.make('HalfCheetah-v2'))
    qf = FlattenMlp(
        input_size=env.observation_space.low.size + env.action_space.low.size,
        output_size=1,
        hidden_sizes=[400, 300],
    )

Competitor Comparisons

baselines

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Pros of baselines

More comprehensive library with a wider range of RL algorithms
Better documentation and examples for getting started
Larger community and more frequent updates

Cons of baselines

Can be more complex to use and customize
Less focus on modular design, making it harder to extend

Code Comparison

rlkit example:

from rlkit.torch.sac.sac import SACTrainer
from rlkit.torch.networks import FlattenMlp
from rlkit.launchers.launcher_util import setup_logger

variant = dict(
    algorithm="SAC",
    version="normal",
    layer_size=256,
    replay_buffer_size=int(1E6),
    algorithm_kwargs=dict(
        num_epochs=3000,
        num_eval_steps_per_epoch=5000,
        num_trains_per_train_loop=1000,
        num_expl_steps_per_train_loop=1000,
        min_num_steps_before_training=1000,
        max_path_length=1000,
        batch_size=256,
    ),
    trainer_kwargs=dict(
        discount=0.99,
        soft_target_tau=5e-3,
        target_update_period=1,
        policy_lr=3E-4,
        qf_lr=3E-4,
        reward_scale=1,
        use_automatic_entropy_tuning=True,
    ),
)

baselines example:

from baselines import deepq

def callback(lcl, _glb):
    # stop training if reward exceeds 199
    is_solved = lcl['t'] > 100 and sum(lcl['episode_rewards'][-101:-1]) / 100 >= 199
    return is_solved

model = deepq.learn(
    env='CartPole-v0',
    network='mlp',
    lr=1e-3,
    total_timesteps=100000,
    buffer_size=50000,
    exploration_fraction=0.1,
    exploration_final_eps=0.02,
    print_freq=10,
    callback=callback
)

stable-baselines

4,257

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Pros of stable-baselines

More comprehensive documentation and tutorials
Wider range of implemented algorithms (e.g., PPO, A2C, DDPG)
Active maintenance and regular updates

Cons of stable-baselines

Less flexibility for customization and experimentation
Heavier dependencies, potentially slower execution

Code Comparison

rlkit example:

from rlkit.torch.sac.sac import SACTrainer
from rlkit.torch.networks import FlattenMlp

qf = FlattenMlp(
    input_size=obs_dim + action_dim,
    output_size=1,
    hidden_sizes=[400, 300],
)
trainer = SACTrainer(
    env=env,
    policy=policy,
    qf=qf,
    vf=vf,
)

stable-baselines example:

from stable_baselines3 import SAC

model = SAC("MlpPolicy", "Pendulum-v0", verbose=1)
model.learn(total_timesteps=10000)

Both libraries provide implementations of reinforcement learning algorithms, but stable-baselines offers a more user-friendly API with fewer lines of code required to get started. rlkit provides more granular control over network architectures and training parameters, making it suitable for researchers who need to customize their experiments.

agents

2,901

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Pros of agents

Built on TensorFlow, offering seamless integration with the broader TensorFlow ecosystem
More comprehensive documentation and tutorials
Wider range of implemented algorithms and environments

Cons of agents

Steeper learning curve for beginners due to TensorFlow complexity
Less flexible for custom environments and algorithms compared to rlkit

Code Comparison

agents:

import tensorflow as tf
from tf_agents.agents.dqn import dqn_agent
from tf_agents.environments import tf_py_environment
from tf_agents.networks import q_network

q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=(100,))

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=1e-3))

rlkit:

from rlkit.torch.dqn.dqn import DQNTrainer
from rlkit.torch.networks import Mlp

qf = Mlp(
    hidden_sizes=[32, 32],
    input_size=observation_dim,
    output_size=action_dim,
)
trainer = DQNTrainer(
    qf=qf,
    target_qf=qf.copy(),
)

Gymnasium

9,007

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

Pros of Gymnasium

More active development and maintenance
Wider range of pre-built environments
Better documentation and community support

Cons of Gymnasium

Steeper learning curve for beginners
Less focus on specific RL algorithms

Code Comparison

Gymnasium:

import gymnasium as gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=42)
for _ in range(1000):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

RLkit:

from rlkit.envs.gym_env import GymEnv
env = GymEnv("CartPole-v1")
observation = env.reset()
for _ in range(1000):
    action = env.action_space.sample()
    next_observation, reward, done, info = env.step(action)

The main differences are in the import statement and the step function return values. Gymnasium provides more detailed information about the environment state.

ignite

4,655

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

Pros of Ignite

More general-purpose framework for PyTorch, suitable for various deep learning tasks
Larger community and more frequent updates
Extensive documentation and examples for different use cases

Cons of Ignite

Less specialized for reinforcement learning tasks
Steeper learning curve for RL-specific implementations
May require more custom code for RL experiments

Code Comparison

rlkit example:

algorithm = SAC(
    env=env,
    policy=policy,
    qf1=qf1,
    qf2=qf2,
    target_qf1=target_qf1,
    target_qf2=target_qf2,
)
algorithm.train()

Ignite example:

trainer = create_supervised_trainer(model, optimizer, loss_fn)
evaluator = create_supervised_evaluator(model, metrics={'accuracy': Accuracy()})

trainer.run(train_loader, max_epochs=10)

Summary

rlkit is specifically designed for reinforcement learning tasks, offering pre-built algorithms and utilities. Ignite, on the other hand, is a more versatile framework for general PyTorch training, requiring additional implementation for RL-specific tasks but providing greater flexibility for various deep learning applications. The choice between the two depends on the specific requirements of the project and the user's familiarity with reinforcement learning concepts.

dopamine

10,720

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Pros of Dopamine

More comprehensive documentation and tutorials
Better integration with TensorFlow and support for distributed training
Wider range of implemented algorithms, including DQN, Rainbow, and C51

Cons of Dopamine

Less flexible architecture, primarily focused on Atari environments
Steeper learning curve for customization and extending the framework
Limited support for continuous action spaces

Code Comparison

Dopamine (DQN implementation snippet)

def _build_networks(self):
  self.online_convnet = networks.AtariDQNNetwork(
      self.num_actions, name='Online')
  self.target_convnet = networks.AtariDQNNetwork(
      self.num_actions, name='Target')
  self._network_template = self.online_convnet

RLkit (DQN implementation snippet)

def _get_q_network(self):
    return FlattenMlp(
        input_size=self.obs_dim,
        output_size=self.action_dim,
        hidden_sizes=self.hidden_sizes,
    )

The code snippets show differences in network architecture definition. Dopamine uses a specialized AtariDQNNetwork, while RLkit employs a more generic FlattenMlp approach, reflecting the frameworks' different focuses and flexibility levels.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

RLkit

Reinforcement learning framework and algorithms implemented in PyTorch.

Implemented algorithms:

Semi-supervised Meta Actor Critic
Skew-Fit
- example script
- paper
- Documentation
- Requires multiworld to be installed
Reinforcement Learning with Imagined Goals (RIG)
- See this version of this repository.
- paper
Temporal Difference Models (TDMs)
- Only implemented in v0.1.2 of RLkit. See Legacy Documentation section below.
- paper
- Documentation
Hindsight Experience Replay (HER)
(Double) Deep Q-Network (DQN)
Soft Actor Critic (SAC)
- example script
- original paper and updated version
- TensorFlow implementation from author
- Includes the "min of Q" method, the entropy-constrained implementation, reparameterization trick, and numerical tanh-Normal Jacbian calcuation.
Twin Delayed Deep Determinstic Policy Gradient (TD3)
- example script
- paper
Advantage Weighted Actor Critic (AWAC)
- example scripts
- paper
Implicit Q-Learning (IQL)
- example scripts
- paper

To get started, checkout the example scripts, linked above.

What's New

Version 0.2

04/25/2019

Use new multiworld code that requires explicit environment registration.
Make installation easier by adding setup.py and using default conf.py.

04/16/2019

Log how many train steps were called
Log env_info and agent_info.

04/05/2019-04/15/2019

Add rendering
Fix SAC bug to account for future entropy (#41, #43)
Add online algorithm mode (#42)

04/05/2019

The initial release for 0.2 has the following major changes:

Remove Serializable class and use default pickle scheme.
Remove PyTorchModule class and use native torch.nn.Module directly.
Switch to batch-style training rather than online training.
- Makes code more amenable to parallelization.
- Implementing the online-version is straightforward.
Refactor training code to be its own object, rather than being integrated inside of RLAlgorithm.
Refactor sampling code to be its own object, rather than being integrated inside of RLAlgorithm.
Implement Skew-Fit: State-Covering Self-Supervised Reinforcement Learning, a method for performing goal-directed exploration to maximize the entropy of visited states.
Update soft actor-critic to more closely match TensorFlow implementation:
- Rename TwinSAC to just SAC.
- Only have Q networks.
- Remove unnecessary policy regualization terms.
- Use numerically stable Jacobian computation.

Overall, the refactors are intended to make the code more modular and readable than the previous versions.

Version 0.1

12/04/2018

Add RIG implementation

12/03/2018

Add HER implementation
Add doodad support

10/16/2018

Upgraded to PyTorch v0.4
Added Twin Soft Actor Critic Implementation
Various small refactor (e.g. logger, evaluate code)

Installation

Install and use the included Ananconda environment

$ conda env create -f environment/[linux-cpu|linux-gpu|mac]-env.yml
$ source activate rlkit
(rlkit) $ python examples/ddpg.py

Choose the appropriate .yml file for your system. These Anaconda environments use MuJoCo 1.5 and gym 0.10.5. You'll need to get your own MuJoCo key if you want to use MuJoCo.

Add this repo directory to your PYTHONPATH environment variable or simply run:

pip install -e .

(Optional) Copy conf.py to conf_private.py and edit to override defaults:

cp rlkit/launchers/conf.py rlkit/launchers/conf_private.py

(Optional) If you plan on running the Skew-Fit experiments or the HER example with the Sawyer environment, then you need to install multiworld.

DISCLAIMER: the mac environment has only been tested without a GPU.

For an even more portable solution, try using the docker image provided in environment/docker. The Anaconda env should be enough, but this docker image addresses some of the rendering issues that may arise when using MuJoCo 1.5 and GPUs. The docker image supports GPU, but it should work without a GPU. To use a GPU with the image, you need to have nvidia-docker installed.

Using a GPU

You can use a GPU by calling

import rlkit.torch.pytorch_util as ptu
ptu.set_gpu_mode(True)

before launching the scripts.

If you are using doodad (see below), simply use the use_gpu flag:

run_experiment(..., use_gpu=True)

Visualizing a policy and seeing results

During training, the results will be saved to a file called under

LOCAL_LOG_DIR/<exp_prefix>/<foldername>

LOCAL_LOG_DIR is the directory set by rlkit.launchers.config.LOCAL_LOG_DIR. Default name is 'output'.
<exp_prefix> is given either to setup_logger.
<foldername> is auto-generated and based off of exp_prefix.
inside this folder, you should see a file called params.pkl. To visualize a policy, run

(rlkit) $ python scripts/run_policy.py LOCAL_LOG_DIR/<exp_prefix>/<foldername>/params.pkl

(rlkit) $ python scripts/run_goal_conditioned_policy.py LOCAL_LOG_DIR/<exp_prefix>/<foldername>/params.pkl

depending on whether or not the policy is goal-conditioned.

If you have rllab installed, you can also visualize the results using rllab's viskit, described at the bottom of this page

tl;dr run

python rllab/viskit/frontend.py LOCAL_LOG_DIR/<exp_prefix>/

to visualize all experiments with a prefix of exp_prefix. To only visualize a single run, you can do

python rllab/viskit/frontend.py LOCAL_LOG_DIR/<exp_prefix>/<folder name>

Alternatively, if you don't want to clone all of rllab, a repository containing only viskit can be found here. You can similarly visualize results with.

python viskit/viskit/frontend.py LOCAL_LOG_DIR/<exp_prefix>/

This viskit repo also has a few extra nice features, like plotting multiple Y-axis values at once, figure-splitting on multiple keys, and being able to filter hyperparametrs out.

Visualizing a goal-conditioned policy

To visualize a goal-conditioned policy, run

(rlkit) $ python scripts/run_goal_conditioned_policy.py
LOCAL_LOG_DIR/<exp_prefix>/<foldername>/params.pkl

Launching jobs with `doodad`

The run_experiment function makes it easy to run Python code on Amazon Web Services (AWS) or Google Cloud Platform (GCP) by using this fork of doodad.

It's as easy as:

from rlkit.launchers.launcher_util import run_experiment

def function_to_run(variant):
    learning_rate = variant['learning_rate']
    ...

run_experiment(
    function_to_run,
    exp_prefix="my-experiment-name",
    mode='ec2',  # or 'gcp'
    variant={'learning_rate': 1e-3},
)

You will need to set up parameters in config.py (see step one of Installation). This requires some knowledge of AWS and/or GCP, which is beyond the scope of this README. To learn more, more about doodad, go to the repository, which is based on this original repository.

Requests for pull-requests

Implement policy-gradient algorithms.
Implement model-based algorithms.

Legacy Code (v0.1.2)

For Temporal Difference Models (TDMs) and the original implementation of Reinforcement Learning with Imagined Goals (RIG), run git checkout tags/v0.1.2.

References

The algorithms are based on the following papers

Offline Meta-Reinforcement Learning with Online Self-Supervision Vitchyr H. Pong, Ashvin Nair, Laura Smith, Catherine Huang, Sergey Levine. arXiv preprint, 2021.

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. Vitchyr H. Pong*, Murtaza Dalal*, Steven Lin*, Ashvin Nair, Shikhar Bahl, Sergey Levine. ICML, 2020.

Visual Reinforcement Learning with Imagined Goals. Ashvin Nair*, Vitchyr Pong*, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine. NeurIPS 2018.

Temporal Difference Models: Model-Free Deep RL for Model-Based Control. Vitchyr Pong*, Shixiang Gu*, Murtaza Dalal, Sergey Levine. ICLR 2018.

Hindsight Experience Replay. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba. NeurIPS 2017.

Deep Reinforcement Learning with Double Q-learning. Hado van Hasselt, Arthur Guez, David Silver. AAAI 2016.

Human-level control through deep reinforcement learning. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis. Nature 2015.

Soft Actor-Critic Algorithms and Applications. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. arXiv preprint, 2018.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. ICML, 2018.

Addressing Function Approximation Error in Actor-Critic Methods Scott Fujimoto, Herke van Hoof, David Meger. ICML, 2018.

Credits

This repository was initially developed primarily by Vitchyr Pong, until July 2021, at which point it was transferred to the RAIL Berkeley organization and is primarily maintained by Ashvin Nair. Other major collaborators and contributions:

A lot of the coding infrastructure is based on rllab. The serialization and logger code are basically a carbon copy of the rllab versions.

The Dockerfile is based on the OpenAI mujoco-py Dockerfile.

The SMAC code builds off of the PEARL code, which built off of an older RLKit version.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of baselines

Cons of baselines

Code Comparison

Pros of stable-baselines

Cons of stable-baselines

Code Comparison

Pros of agents

Cons of agents

Code Comparison

Pros of Gymnasium

Cons of Gymnasium

Code Comparison

Pros of Ignite

Cons of Ignite

Code Comparison

Summary

Pros of Dopamine

Cons of Dopamine

Code Comparison

Dopamine (DQN implementation snippet)

RLkit (DQN implementation snippet)

Convert designs to code with AI

README

RLkit

What's New

Version 0.2

04/25/2019

04/16/2019

04/05/2019-04/15/2019

04/05/2019

Version 0.1

12/04/2018

12/03/2018

10/16/2018

Installation

Using a GPU

Visualizing a policy and seeing results

Visualizing a goal-conditioned policy

Launching jobs with doodad

Requests for pull-requests

Legacy Code (v0.1.2)

References

Credits

Top Related Projects

Convert designs to code with AI

Launching jobs with `doodad`