Tutorial 3: RLlib (2) — A Gentle RLlib Tutorial
Once you’ve installed Ray and RLlib with
pip install ray[rllib]
, you can train your first RL agent with a single command in the command line:
rllib train --run=A2C --env=CartPole-v0
This will tell your computer to train using the Advantage Actor Critic Algorithm (A2C) using the CartPole environment.
A2C and a host of other algorithms are already built into the library meaning you don’t have to worry about the details of implementing those yourself. This is really great, particularly if you’re looking to train using a standard environment and algorithm. If you want to do more, however, you’re going to have to dig a bit deeper.
RLlib Agents
The various algorithms you can access are available through ray.rllib.agents. Here, you can find a long list of different implementations in both PyTorch and TensorFlow to begin playing with.
These are all accessed using the algorithm’s trainer method. For example, if you want to use A2C as shown above, you can run:
import ray
from ray.rllib import agents
ray.init()
trainer = agents.a3c.A2CTrainer(env='CartPole-v0')
If you want to try a DQN instead, you can call:
trainer = agents.dqn.DQNTrainer(env='CartPole-v0') # Deep Q Network
All the algorithms follow the same basic construction alternating from lower case algo abbreviation to uppercase algo abbreviation followed by “Trainer.”
Changing hyperparameters is as easy as passing a dictionary of configurations to the config argument. A quick way to see what’s available to you is to call trainer.config to print out the options that are available for your chosen algorithm. A few examples include:
-
fcnet_hiddens controls the number of hidden units and hidden layers (passed as a dictionary called model into config and then a list, I’ll show an example below).
-
vf_share_layers determines whether or not you have one neural network with multiple output heads or separate value and policy networks.
-
num_workers sets the number of processors for parallelization.
-
num_gpus to set the number of GPU’s you will use.
There are lots of others to set and customize from the network (typically located in model dictionary) to various callbacks and multi-agent settings.
Example1: Training PPO for CartPole
I want to turn and show a quick example to get you started and show you how this works with a standard, OpenAI Gym environment. Choose your IDE or text editor of choice and try the following:
import ray
from ray.rllib import agents
ray.init() # Skip or set to ignore if already called
config = {'gamma': 0.9,
'lr': 1e-2,
'num_workers': 4,
'train_batch_size': 1000,
'model': {
'fcnet_hiddens': [128, 128]
}}
trainer = agents.ppo.PPOTrainer(env='CartPole-v0', config=config)
results = trainer.train()
The config dictionary changed the defaults for the values above. You can see how we can influence the number of layers and nodes in the network by nesting(嵌套)a dictionary called model in the config dictionary. Once we’ve specified our configuration, calling the train() method on our trainer object will send the environment to the workers and begin collecting data. Once enough data is collected (1,000 samples according to our settings above) the model will update and send the output to a new dictionary called results.
If you want to run multiple updates, you can set up a training loop to continuously call the train() method for a given number of iterations or until some other threshold has been reached.
Example2: Hyperparameter Tuning with Ray Tune
First, similar as the above, we train a PyTorch model using the PPO algorithm. The code below uses a neural network consisting of a single hidden layer of 32 neurons and linear activation functions.
import ray
from ray.rllib.agents.ppo import PPOTrainer
config = {
"env": "CartPole-v0",
# Change the following line to `“framework”: “tf”` to use tensorflow
"framework": "torch",
"model": {
"fcnet_hiddens": [32],
"fcnet_activation": "linear",
},
}
stop = {"episode_reward_mean": 195}
ray.shutdown()
ray.init(
num_cpus=3,
include_dashboard=False,
ignore_reinit_error=True,
log_to_driver=False,
)
# execute training
analysis = ray.tune.run(
"PPO",
config=config,
stop=stop,
checkpoint_at_end=True,
)
RLlib provides a Trainer class which holds a policy for environment interaction. Through the trainer interface, a policy can be trained, action computed, and checkpointed. While the analysis object returned from ray.tune.run earlier did not contain any trainer instances, it has all the information needed to reconstruct one from a saved checkpoint because checkpoint_at_end=True was passed as a parameter. The code below shows this.
# restore a trainer from the last checkpoint
trial = analysis.get_best_logdir("episode_reward_mean", "max")
checkpoint = analysis.get_best_checkpoint(
trial,
"training_iteration",
"max",
)
trainer = PPOTrainer(config=config)
trainer.restore(checkpoint)
Let’s now create a video of the trained model. We will choose the action recommended by the trained model instead of acting randomly. If you are running this code in Google Colab, it is important to note that there is no display driver available for generating videos. However, it is possible to install a virtual display driver to get it to work.
# install dependencies needed for recording videos
!apt-get install -y xvfb x11-utils
!pip install pyvirtualdisplay==0.2.*
The next step is to start an instance of the virtual display:
from pyvirtualdisplay import Display
display = Display(visible=False, size=(1400, 900))
_ = display.start()
OpenAI gym has a VideoRecorder wrapper that can record a video of the running environment in MP4 format.
from gym.wrappers.monitoring.video_recorder import VideoRecorder
from base64 import b64encode
from IPython.display import HTML
after_training = "after_training.mp4"
after_video = VideoRecorder(env, after_training)
observation = env.reset()
done = False
while not done:
env.render()
after_video.capture_frame()
action = trainer.compute_action(observation)
observation, reward, done, info = env.step(action)
after_video.close()
env.close()
# You should get a video similar to the one below.
html = render_mp4(after_training)
HTML(html)
def render_mp4(videopath: str) -> str:
"""
Gets a string containing a b4-encoded version of the MP4 video
at the specified path.
"""
mp4 = open(videopath, 'rb').read()
base64_encoded_mp4 = b64encode(mp4).decode()
return f'<video width=400 controls><source src="data:video/mp4;' \
f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'
The pole balances nicely which means the agent has solved the cartpole environment!
Let’s now try to find hyperparameters that can solve the CartPole environment in the fewest timesteps with Ray Tune. It is a library for experiment execution and hyperparameter tuning at any scale. While this tutorial will only use grid search, note that Ray Tune also gives you access to more efficient hyperparameter tuning algorithms like population based training, BayesOptSearch, and HyperBand/ASHA.
Enter the following code, and be prepared for it to take a while to run:
parameter_search_config = {
"env": "CartPole-v0",
"framework": "torch",
# Hyperparameter tuning
"model": {
"fcnet_hiddens": ray.tune.grid_search([[32], [64]]),
"fcnet_activation": ray.tune.grid_search(["linear", "relu"]),
},
"lr": ray.tune.uniform(1e-7, 1e-2)
}
# To explicitly stop or restart Ray, use the shutdown API.
ray.shutdown()
ray.init(
num_cpus=12,
include_dashboard=False,
ignore_reinit_error=True,
log_to_driver=False,
)
parameter_search_analysis = ray.tune.run(
"PPO",
config=parameter_search_config,
stop=stop,
num_samples=5,
metric="timesteps_total",
mode="min",
)
print(
"Best hyperparameters found:",
parameter_search_analysis.best_config,
)
By asking for 12 CPU cores by passing in num_cpus=12 to ray.init, four trials get run in parallel across three cpus each. Specifying num_samples=5 means that you will get five random samples for the learning rate. For each of those, there are two values for the size of the hidden layer, and two values for the activation function. So, there will be 5 * 2 * 2 = 20 trials, shown with their statuses in the output of the cell as the calculation runs.
Note that Ray prints the current best configuration as it goes. This includes all the default values that have been set, which is a good place to find other parameters that could be tweaked.
After running this, the final output might be similar to the following output:
INFO tune.py:549 -- Total run time: 3658.24 seconds (3657.45 seconds for the tuning loop).
Best hyperparameters found: {'env': 'CartPole-v0', 'framework': 'torch', 'model': {'fcnet_hiddens': [64], 'fcnet_activation': 'relu'}, 'lr': 0.006733929096170726};'''