baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

16,374

4,921

16,374

503

View on GitHub

Top Related Projects

acme

3,761

A library of reinforcement learning components and agents

stable-baselines

4,300

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

agents

2,943

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

pfrl

1,242

PFRL: a PyTorch-based deep reinforcement learning library

garage

2,009

A toolkit for reproducible reinforcement learning research.

Quick Overview

OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms. It provides a collection of well-tested, modular, and extensible code for experimenting with and researching reinforcement learning. The project aims to serve as a starting point for researchers and practitioners in the field.

Pros

Offers implementations of popular RL algorithms like DQN, PPO, and A2C
Well-documented and maintained by the OpenAI team
Includes tools for reproducibility and benchmarking
Integrates seamlessly with OpenAI Gym environments

Cons

Some algorithms may not be up-to-date with the latest research
Can be complex for beginners to understand and modify
Limited support for custom environments outside of OpenAI Gym
Requires specific versions of dependencies, which may conflict with other projects

Code Examples

Training a PPO agent on CartPole:

from baselines import PPO2
from baselines.common.vec_env import DummyVecEnv
from baselines.common.policies import MlpPolicy
import gym

env = DummyVecEnv([lambda: gym.make('CartPole-v1')])
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

Evaluating a trained DQN agent:

from baselines import deepq
import gym

def callback(lcl, _glb):
    return lcl['t'] > 100

model = deepq.learn(
    env='CartPole-v0',
    network='mlp',
    total_timesteps=1000,
    callback=callback
)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Creating a custom policy network:

import tensorflow as tf
from baselines.common.policies import build_policy

def custom_cnn(unscaled_images, **kwargs):
    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
    activ = tf.nn.relu
    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
    h3 = conv_to_fc(h3)
    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

CustomPolicy = build_policy('CustomPolicy', cnn=custom_cnn)

Getting Started

To get started with OpenAI Baselines:

Install the package:

pip install git+https://github.com/openai/baselines.git

Import and use the desired algorithm:

from baselines import PPO2
from baselines.common.cmd_util import make_vec_env

env = make_vec_env('CartPole-v1', n_envs=4)
model = PPO2('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=25000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

This example creates a CartPole environment, trains a PPO2 agent, and then runs the trained agent for 1000 steps.

Competitor Comparisons

acme

3,761

A library of reinforcement learning components and agents

Pros of Acme

More comprehensive and modular framework for RL research
Better support for distributed training and multi-agent systems
Actively maintained with regular updates and new features

Cons of Acme

Steeper learning curve due to its more complex architecture
Less focus on classic RL algorithms, more geared towards advanced research
Requires more computational resources for some experiments

Code Comparison

Baselines (PPO implementation):

def learn(network, env, total_timesteps, **network_kwargs):
    policy = build_policy(env, network, **network_kwargs)
    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space
    # ... (implementation continues)

Acme (PPO implementation):

class PPO(acme.Actor):
    def __init__(self, network: snt.Module, optimizer: optax.GradientTransformation):
        self._network = network
        self._optimizer = optimizer
        # ... (implementation continues)

Both repositories provide implementations of popular RL algorithms, but Acme offers a more structured and modular approach. Baselines focuses on simplicity and ease of use for classic algorithms, while Acme provides a flexible framework for advanced RL research and experimentation.

stable-baselines

4,300

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Pros of stable-baselines

Better documentation and user-friendliness
More consistent API across algorithms
Actively maintained with regular updates

Cons of stable-baselines

Slightly slower performance in some cases
Less flexibility for advanced users
Fewer experimental algorithms

Code Comparison

baselines:

from baselines import deepq
model = deepq.learn(env, network='mlp', lr=1e-3, total_timesteps=100000)

stable-baselines:

from stable_baselines import DQN
model = DQN('MlpPolicy', env, learning_rate=1e-3, verbose=1)
model.learn(total_timesteps=100000)

Both repositories provide implementations of reinforcement learning algorithms, but stable-baselines offers a more user-friendly experience with improved documentation and a consistent API. It's actively maintained, making it a good choice for beginners and those who prioritize ease of use. However, baselines may be preferred by advanced users who need more flexibility or are working on cutting-edge research. The code comparison shows that stable-baselines has a more intuitive API, while baselines requires slightly more setup. Overall, the choice between the two depends on the user's specific needs and level of expertise in reinforcement learning.

agents

2,943

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Pros of Agents

More comprehensive and actively maintained library
Better integration with TensorFlow ecosystem
Supports both TF1 and TF2, offering flexibility

Cons of Agents

Steeper learning curve due to more complex architecture
Potentially slower execution compared to Baselines' optimized implementations
Less focus on classic RL algorithms, more on advanced techniques

Code Comparison

Baselines (DQN implementation):

def learn(env, network, seed=None, lr=1e-3, total_timesteps=100000, buffer_size=50000):
    q_func = build_q_func(network)
    act, train, update_target, debug = deepq.build_train(
        make_obs_ph=lambda name: U.BatchInput(env.observation_space.shape, name=name),
        q_func=q_func,
        num_actions=env.action_space.n,
        optimizer=tf.train.AdamOptimizer(learning_rate=lr),
    )

Agents (DQN implementation):

agent = dqn_agent.DqnAgent(
    time_step_spec,
    action_spec,
    q_network=q_net,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=tf.Variable(0)
)

pfrl

1,242

PFRL: a PyTorch-based deep reinforcement learning library

Pros of PFRL

More comprehensive and up-to-date implementation of reinforcement learning algorithms
Better documentation and examples for easier usage and understanding
Supports both PyTorch and NumPy backends, offering more flexibility

Cons of PFRL

Smaller community and fewer contributors compared to Baselines
Less integration with OpenAI Gym environments out-of-the-box
Steeper learning curve for beginners due to more advanced features

Code Comparison

PFRL example:

import pfrl
agent = pfrl.agents.PPO(
    policy, optimizer, obs_space, action_space,
    gamma=0.99, update_interval=2048, minibatch_size=64
)

Baselines example:

from baselines import ppo2
model = ppo2.learn(
    env=env, network='mlp', total_timesteps=1e6,
    lr=3e-4, nminibatches=4, noptepochs=4
)

Both repositories provide implementations of popular reinforcement learning algorithms, but PFRL offers a more modular and flexible approach, while Baselines focuses on simplicity and integration with OpenAI Gym. PFRL's code structure allows for easier customization of agents and algorithms, whereas Baselines provides a more straightforward API for quick experimentation.

garage

2,009

A toolkit for reproducible reinforcement learning research.

Pros of garage

More modular and extensible architecture, making it easier to implement new algorithms
Better documentation and tutorials for getting started
Supports a wider range of reinforcement learning algorithms

Cons of garage

Less actively maintained compared to baselines
Smaller community and fewer contributors
May have a steeper learning curve for beginners

Code Comparison

garage example:

from garage import wrap_experiment
from garage.tf.algos import PPO
from garage.tf.baselines import GaussianMLPBaseline
from garage.tf.envs import TfEnv
from garage.tf.policies import GaussianMLPPolicy

@wrap_experiment
def ppo_garage(ctxt=None, seed=1):
    env = TfEnv(normalize(gym.make('HalfCheetah-v2')))
    policy = GaussianMLPPolicy(env.spec)
    baseline = GaussianMLPBaseline(env.spec)
    algo = PPO(env_spec=env.spec, policy=policy, baseline=baseline)
    trainer.setup(algo, env)
    trainer.train(n_epochs=100, batch_size=4000)

baselines example:

from baselines.common.vec_env import DummyVecEnv
from baselines.ppo2 import ppo2
from baselines.common.policies import MlpPolicy

env = DummyVecEnv([lambda: gym.make('HalfCheetah-v2')])
model = ppo2.learn(policy=MlpPolicy, env=env, nsteps=2048, nminibatches=32,
                   lam=0.95, gamma=0.99, noptepochs=10, log_interval=1,
                   ent_coef=0.0, lr=3e-4, cliprange=0.2, total_timesteps=1e6)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Status: Maintenance (expect bug fixes and minor updates)

Baselines

OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms.

These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.

Prerequisites

Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows

Ubuntu

sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev

Mac OS X

Installation of system packages on Mac requires Homebrew. With Homebrew installed, run the following:

brew install cmake openmpi

Virtual environment

From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via

pip install virtualenv

Virtualenvs are essentially folders that have copies of python executable and all python packages. To create a virtualenv called venv with python3, one runs

virtualenv /path/to/venv --python=python3

To activate a virtualenv:

. /path/to/venv/bin/activate

More thorough tutorial on virtualenvs and options can be found here

Tensorflow versions

The master branch supports Tensorflow from version 1.4 to 1.14. For Tensorflow 2.0 support, please use tf2 branch.

Installation

Clone the repo and cd into it:

git clone https://github.com/openai/baselines.git
cd baselines

If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases, you may use
```
pip install tensorflow-gpu==1.14 # if you have a CUDA-compatible gpu and proper drivers
```
or
```
pip install tensorflow==1.14
```
to install Tensorflow 1.14, which is the latest version of Tensorflow supported by the master branch. Refer to TensorFlow installation guide for more details.
Install baselines package
```
pip install -e .
```

MuJoCo

Some of the baselines examples use MuJoCo (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from www.mujoco.org). Instructions on setting up MuJoCo can be found here

Testing the installation

All unit tests in baselines can be run using pytest runner:

pip install pytest
pytest

Training models

Most of the algorithms in baselines repo are used as follows:

python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]

Example 1. PPO with MuJoCo Humanoid

For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps

python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7

Note that for mujoco environments fully-connected network is default, so we can omit --network=mlp The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:

python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy

will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)

See docstrings in common/models.py for description of network parameters for each type of model, and docstring for baselines/ppo2/ppo2.py/learn() for the description of the ppo2 hyperparameters.

Example 2. DQN on Atari

DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:

python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6

Saving, loading and visualizing models

Saving and loading the model

The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models. --save_path and --load_path command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively. Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2

This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play

NOTE: Mujoco environments require normalization to work properly, so we wrap them with VecNormalize wrapper. Currently, to ensure the models are saved with normalization (so that trained models can be restored and run without further training) the normalization coefficients are saved as tensorflow variables. This can decrease the performance somewhat, so if you require high-throughput steps with Mujoco and do not need saving/restoring the models, it may make sense to use numpy normalization instead. To do that, set 'use_tf=False` in baselines/run.py.

Logging and vizualizing learning curves and other training metrics

By default, all summary data, including progress, standard output, is saved to a unique directory in a temp folder, specified by a call to Python's tempfile.gettempdir(). The directory can be changed with the --log_path command-line option.

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2 --log_path=~/logs/Pong/

NOTE: Please be aware that the logger will overwrite files of the same name in an existing directory, thus it's recommended that folder names be given a unique timestamp to prevent overwritten logs.

Another way the temp directory can be changed is through the use of the $OPENAI_LOGDIR environment variable.

For examples on how to load and display the training data, see here.

Subpackages

A2C
ACER
ACKTR
DDPG
DQN
GAIL
HER
PPO1 (obsolete version, left here temporarily)
PPO2
TRPO

Benchmarks

Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available here for Mujoco and here for Atari respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page.

To cite this repository in publications:

@misc{baselines,
  author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai and Zhokhov, Peter},
  title = {OpenAI Baselines},
  year = {2017},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/openai/baselines}},
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot