pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

3,729

835

3,729

View on GitHub

Top Related Projects

baselines

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

stable-baselines

4,257

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

stable-baselines3

10,506

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

agents

2,901

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

garage

1,965

A toolkit for reproducible reinforcement learning research.

Quick Overview

The ikostrikov/pytorch-a2c-ppo-acktr-gail repository is a PyTorch implementation of several popular reinforcement learning algorithms, including Advantage Actor-Critic (A2C), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), and Generative Adversarial Imitation Learning (GAIL). These algorithms are widely used in the field of reinforcement learning for training agents to solve complex tasks.

Pros

Comprehensive Implementation: The repository provides a well-documented and comprehensive implementation of several state-of-the-art reinforcement learning algorithms, making it a valuable resource for researchers and practitioners.
Modular Design: The codebase is designed in a modular fashion, allowing users to easily integrate the algorithms into their own projects or experiment with different components.
Reproducibility: The repository includes detailed instructions and configurations to ensure the reproducibility of the results, which is crucial for research and development.
Active Maintenance: The project is actively maintained, with regular updates and bug fixes, ensuring the codebase remains up-to-date and reliable.

Cons

Limited Environments: The repository primarily focuses on classic control tasks and Atari games, which may not be representative of the full range of real-world problems that reinforcement learning can be applied to.
Steep Learning Curve: The codebase can be complex and may require a good understanding of reinforcement learning concepts and PyTorch to effectively use and extend the project.
Lack of Detailed Documentation: While the repository includes some documentation, it may not be comprehensive enough for beginners or users who are new to the field of reinforcement learning.
Potential Performance Issues: Depending on the hardware and the complexity of the task, the algorithms implemented in the repository may not always achieve optimal performance, which could be a limitation for certain applications.

Code Examples

Here are a few code examples from the ikostrikov/pytorch-a2c-ppo-acktr-gail repository:

Advantage Actor-Critic (A2C) Implementation:

import torch.nn as nn
import torch.nn.functional as F

class A2CModel(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(A2CModel, self).__init__()
        self.conv1 = nn.Conv2d(num_inputs, 32, 8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, 4, stride=2)
        self.conv3 = nn.Conv2d(64, 32, 3, stride=1)
        self.linear1 = nn.Linear(32 * 7 * 7, 512)
        self.critic_linear = nn.Linear(512, 1)
        self.actor_linear = nn.Linear(512, num_outputs)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(-1, 32 * 7 * 7)
        x = F.relu(self.linear1(x))
        return self.critic_linear(x), self.actor_linear(x)

This code defines the A2C model architecture, which consists of a series of convolutional and fully connected layers to process the input state and output the value function and policy.

Proximal Policy Optimization (PPO) Implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F

class PPOModel(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(PPOModel, self).__init__()
        self.conv1 = nn.Conv2d(num_inputs, 32, 8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, 4, stride=2)
        self.conv3 = nn.Conv2d(64, 32, 3, stride=1)
        self.linear1

Competitor Comparisons

baselines

16,242

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Pros of baselines

Wider range of implemented algorithms, including DQN, DDPG, and TRPO
More extensive documentation and examples
Larger community and more frequent updates

Cons of baselines

Implemented in TensorFlow, which may be less preferred by some users compared to PyTorch
Can be more complex to use and modify due to its broader scope

Code Comparison

baselines (TRPO implementation):

def learn(env, policy_fn, *,
          timesteps_per_batch, # what to train on
          max_kl, cg_iters,
          gamma, lam, # advantage estimation
          entcoeff=0.0,
          cg_damping=1e-2,
          vf_stepsize=3e-4,
          vf_iters =3,
          max_timesteps=0, max_episodes=0, max_iters=0):

pytorch-a2c-ppo-acktr-gail (PPO implementation):

def ppo_update(actor_critic, agent, value_loss_coef, entropy_coef, max_grad_norm, ppo_epoch, num_mini_batch,
               rollouts, clip_param, use_clipped_value_loss):
    advantages = rollouts.returns[:-1] - rollouts.value_preds[:-1]
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-5)

The code snippets show different approaches to implementing reinforcement learning algorithms, with baselines using a function-based approach and pytorch-a2c-ppo-acktr-gail using a more object-oriented style.

stable-baselines

4,257

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Pros of stable-baselines

Wider range of algorithms implemented, including DQN, DDPG, and SAC
Better documentation and tutorials for easier onboarding
More active maintenance and community support

Cons of stable-baselines

Based on TensorFlow 1.x, which is becoming outdated
Generally slower performance compared to PyTorch-based implementations
Less flexibility for customizing neural network architectures

Code Comparison

stable-baselines:

from stable_baselines import PPO2
model = PPO2('MlpPolicy', 'CartPole-v1', verbose=1)
model.learn(total_timesteps=10000)

pytorch-a2c-ppo-acktr-gail:

from a2c_ppo_acktr import algo, utils
from a2c_ppo_acktr.model import Policy
model = algo.PPO(Policy, env, lr=3e-4, eps=1e-5, num_processes=8)
model.learn(total_timesteps=10000)

The stable-baselines implementation is more concise and user-friendly, while pytorch-a2c-ppo-acktr-gail offers more granular control over the training process and model architecture.

stable-baselines3

10,506

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Pros of stable-baselines3

More comprehensive library with a wider range of algorithms and features
Better documentation and active community support
Easier to use with a more consistent API across different algorithms

Cons of stable-baselines3

Potentially slower performance due to higher-level abstractions
Less flexibility for customization compared to the more barebones implementation

Code Comparison

pytorch-a2c-ppo-acktr-gail:

envs = [make_env(args.env_name, args.seed, i, args.log_dir, args.add_timestep)
        for i in range(args.num_processes)]
envs = SubprocVecEnv(envs)

stable-baselines3:

env = make_vec_env(env_id, n_envs=4, seed=0)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=25000)

The stable-baselines3 code is more concise and easier to understand, while pytorch-a2c-ppo-acktr-gail requires more manual setup.

agents

2,901

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Pros of tensorflow/agents

Comprehensive library with a wide range of RL algorithms and tools
Seamless integration with TensorFlow ecosystem and Google's research
Extensive documentation and tutorials for easier adoption

Cons of tensorflow/agents

Steeper learning curve due to its complexity and extensive features
May be overkill for simpler RL projects or quick prototyping
Less flexibility in customization compared to pytorch-a2c-ppo-acktr-gail

Code Comparison

pytorch-a2c-ppo-acktr-gail:

envs = [make_env(args.env_name, args.seed, i, args.log_dir)
        for i in range(args.num_processes)]
envs = SubprocVecEnv(envs)

tensorflow/agents:

tf_env = tf_py_environment.TFPyEnvironment(
    suite_gym.load(env_name))
agent = dqn_agent.DqnAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer)

The code snippets show differences in environment setup and agent initialization between the two libraries. pytorch-a2c-ppo-acktr-gail uses a more straightforward approach, while tensorflow/agents provides a more structured and verbose implementation.

garage

1,965

A toolkit for reproducible reinforcement learning research.

Pros of garage

More comprehensive framework with support for multiple RL algorithms and environments
Better documentation and tutorials for easier onboarding
Active development and maintenance with regular updates

Cons of garage

Steeper learning curve due to its more complex architecture
Potentially slower execution compared to the lightweight pytorch-a2c-ppo-acktr-gail
Less focused on specific algorithms, which may impact performance optimization

Code Comparison

garage:

from garage import wrap_experiment
from garage.tf.algos import PPO
from garage.tf.policies import GaussianMLPPolicy

@wrap_experiment
def ppo_experiment(ctxt=None):
    policy = GaussianMLPPolicy(env_spec=env.spec)
    algo = PPO(env_spec=env.spec, policy=policy, ...)

pytorch-a2c-ppo-acktr-gail:

from a2c_ppo_acktr import algo, utils
from a2c_ppo_acktr.model import Policy

envs = make_vec_envs(env_name, num_processes, ...)
actor_critic = Policy(obs_shape, action_space, ...)
agent = algo.PPO(actor_critic, args.clip_param, ...)

The garage example shows a more modular approach with separate policy and algorithm classes, while pytorch-a2c-ppo-acktr-gail combines the policy and value function in a single actor-critic model. garage also provides a experiment wrapper for easier setup and logging.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pytorch-a2c-ppo-acktr

Update (April 12th, 2021)

PPO is great, but Soft Actor Critic can be better for many continuous control tasks. Please check out my new RL repository in jax.

Please use hyper parameters from this readme. With other hyper parameters things might not work (it's RL after all)!

This is a PyTorch implementation of

Advantage Actor Critic (A2C), a synchronous deterministic version of A3C
Proximal Policy Optimization PPO
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation ACKTR
Generative Adversarial Imitation Learning GAIL

Also see the OpenAI posts: A2C/ACKTR and PPO for more information.

This implementation is inspired by the OpenAI baselines for A2C, ACKTR and PPO. It uses the same hyper parameters and the model since they were well tuned for Atari games.

Please use this bibtex if you want to cite this repository in your publications:

@misc{pytorchrl,
  author = {Kostrikov, Ilya},
  title = {PyTorch Implementations of Reinforcement Learning Algorithms},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail}},
}

Supported (and tested) environments (via OpenAI Gym)

Atari Learning Environment
MuJoCo
PyBullet (including Racecar, Minitaur and Kuka)
DeepMind Control Suite (via dm_control2gym)

I highly recommend PyBullet as a free open source alternative to MuJoCo for continuous control tasks.

All environments are operated using exactly the same Gym interface. See their documentations for a comprehensive list.

To use the DeepMind Control Suite environments, set the flag --env-name dm.<domain_name>.<task_name>, where domain_name and task_name are the name of a domain (e.g. hopper) and a task within that domain (e.g. stand) from the DeepMind Control Suite. Refer to their repo and their tech report for a full list of available domains and tasks. Other than setting the task, the API for interacting with the environment is exactly the same as for all the Gym environments thanks to dm_control2gym.

Requirements

Python 3 (it might work with Python 2, but I didn't test it)
PyTorch
Stable baselines3

In order to install requirements, follow:

# PyTorch
conda install pytorch torchvision -c soumith

# Other requirements
pip install -r requirements.txt

# Gym Atari
conda install -c conda-forge gym-atari

Contributions

Contributions are very welcome. If you know how to make this code better, please open an issue. If you want to submit a pull request, please open an issue first. Also see a todo list below.

Also I'm searching for volunteers to run all experiments on Atari and MuJoCo (with multiple random seeds).

Disclaimer

It's extremely difficult to reproduce results for Reinforcement Learning methods. See "Deep Reinforcement Learning that Matters" for more information. I tried to reproduce OpenAI results as closely as possible. However, majors differences in performance can be caused even by minor differences in TensorFlow and PyTorch libraries.

TODO

Improve this README file. Rearrange images.
Improve performance of KFAC, see kfac.py for more information
Run evaluation for all games and algorithms

Visualization

In order to visualize the results use visualize.ipynb.

Training

Atari

A2C

python main.py --env-name "PongNoFrameskip-v4"

PPO

python main.py --env-name "PongNoFrameskip-v4" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 0.5 --num-processes 8 --num-steps 128 --num-mini-batch 4 --log-interval 1 --use-linear-lr-decay --entropy-coef 0.01

ACKTR

python main.py --env-name "PongNoFrameskip-v4" --algo acktr --num-processes 32 --num-steps 20

MuJoCo

Please always try to use --use-proper-time-limits flag. It properly handles partial trajectories (see https://github.com/sfujim/TD3/blob/master/main.py#L123).

A2C

python main.py --env-name "Reacher-v2" --num-env-steps 1000000

PPO

python main.py --env-name "Reacher-v2" --algo ppo --use-gae --log-interval 1 --num-steps 2048 --num-processes 1 --lr 3e-4 --entropy-coef 0 --value-loss-coef 0.5 --ppo-epoch 10 --num-mini-batch 32 --gamma 0.99 --gae-lambda 0.95 --num-env-steps 1000000 --use-linear-lr-decay --use-proper-time-limits

ACKTR

ACKTR requires some modifications to be made specifically for MuJoCo. But at the moment, I want to keep this code as unified as possible. Thus, I'm going for better ways to integrate it into the codebase.

Enjoy

Atari

python enjoy.py --load-dir trained_models/a2c --env-name "PongNoFrameskip-v4"

MuJoCo

python enjoy.py --load-dir trained_models/ppo --env-name "Reacher-v2"

Results

A2C

BreakoutNoFrameskip-v4

SeaquestNoFrameskip-v4

QbertNoFrameskip-v4

beamriderNoFrameskip-v4

PPO

BreakoutNoFrameskip-v4

SeaquestNoFrameskip-v4

QbertNoFrameskip-v4

beamriderNoFrameskip-v4

ACKTR

BreakoutNoFrameskip-v4

SeaquestNoFrameskip-v4

QbertNoFrameskip-v4

beamriderNoFrameskip-v4

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot