Paper 31: Deep Reinforcement Learning that Matters

Abstract

Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining the progress in DRL.
Reproducing results for state-of-the-art DRL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful.
In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in DRL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.

1 Introduction

To maintain rapid progress in RL research, `it is important that existing works can be easily reproduced and compared to accurately judge improvements offered by novel methods`.

However, reproducing DRL results is seldom straightforward, and the literature reports a wide range of results for the same baseline algorithms. `Reproducibility can be affected by extrinsic factors (e.g. hyperparameters or codebases) and intrinsic factors (e.g. effects of random seeds or environment properties)`. We investigate these sources of variance in reported results through a representative set of experiments. For clarity, we focus our investigation on policy gradient (PG) methods in continuous control. We note that the `diversity of metrics and lack of significance testing（显著性测试）in the RL literature creates the potential for misleading reporting of results`. We demonstrate possible benefits of significance testing using techniques common in ML and statistics.

In each section of our experimental analysis, we pose questions regarding key factors affecting reproducibility. We find that `there are numerous sources of non-determinism when reproducing and comparing RL algorithms`. To this end, we show that `fine details of experimental procedure can be critical`. Based on our experiments, we `conclude with possible recommendations, lines of investigation, and points of discussion for future works` to ensure that DRL is reproducible and continues to matter.

2 Technical Background

We experiment with TRPO, DDPG, PPO and ACKTR.

3 Experimental Analysis

We pose several questions about the factors affecting reproducibility of state-of-the-art RL methods. We perform a set of experiments designed to provide insight into the questions posed. In particular, `we investigate the effects` of:

specific hyperparameters on algorithm performance if not properly tuned;
random seeds and the number of averaged experiment trials;
specific environment characteristics;
differences in algorithm performance due to stochastic environments;
differences due to codebases with most other factors held constant.

For most of our experiments, except for those comparing codebases, we generally use the `OpenAI Baselines implementations` of the following algorithms: ACKTR, PPO, DDPG, TRPO. We use the Hopper-v1 and HalfCheetah-v1 MuJoCo environments from OpenAI Gym. `These two environments provide contrasting（差异大的）dynamics (the former being more unstable)`.

To ensure fairness we run `five experiment trials for each evaluation, each with a different preset random seed` (all experiments use the same set of random seeds). In all cases, we highlight important results here, with full descriptions of experimental setups and additional learning curves included in the supplemental material. Unless otherwise mentioned, `we use default settings whenever possible, while modifying only the hyperparameters of interest`.

We use `multilayer perceptron function approximators` in all cases. We `denote the hidden layer sizes and activations as (N,M, activation)`. For default settings, we vary the hyperparameters under investigation one at a time. For DDPG we use a network structure of (64, 64, ReLU) for both actor and critic. For TRPO and PPO, we use (64, 64, tanh) for the policy. For ACKTR, we use (64, 64, tanh) for the actor and (64, 64, ELU) for the critic.

3.1 Hyperparameters

What is the magnitude of the effect hyperparameter settings can have on baseline performance?

Tuned hyperparameters play a large role in eliciting the best results from many algorithms. However, `the choice of optimal hyperparameter configuration is often not consistent in related literature, and the range of values considered is often not reported`. Furthermore, poor `hyperparameter selection can be detrimental to a fair comparison against baseline algorithms`. Here, we investigate several aspects of hyperparameter selection on performance.

3.2 Network Architecture

How does the choice of network architecture for the policy and value function approximation affect performance?

In previous work, it is shown that policy network architecture can significantly impact results in both TRPO and DDPG. Furthermore, certain activation functions such as ReLU have been shown to cause worsened learning performance due to the “dying relu” problem. As such, `we examine network architecture and activation functions for both policy and value function approximators`.

The results show how significantly performance can be affected by simple changes to the policy or value network activations. We find that usually ReLU or Leaky ReLU activations perform the best across environments and algorithms. `The effects are not consistent across algorithms or environments. This inconsistency demonstrates how interconnected（互联的）network architecture is to algorithm methodology`. For example, using a large network with PPO may require tweaking other hyperparameters such as the trust region clipping or learning rate to compensate（补偿）for the architectural change. This intricate interplay（复杂的相互影响）of hyperparameters is one of the reasons reproducing current policy gradient methods is so difficult. `It is exceedingly important to choose an appropriate architecture for proper baseline results`. This also suggests a possible need for hyperparameter agnostic（不可知论）algorithms — that is algorithms that incorporate hyperparameter adaptation as part of the design — such that fair comparisons can be made without concern about improper settings for the task at hand.

3.3 Reward Scale

How can the reward scale affect results? Why is reward rescaling used?

Reward rescaling has been used in several recent works to improve results for DDPG. This involves simply multiplying the rewards generated from an environment by some scalar (ˆr = rˆσ) for training. Often, these works report using a reward scale of ˆσ =0.1. In Atari domains, this is akin to clipping the rewards to [0, 1]. By intuition, in gradient based methods (as used in most deep RL) a large and sparse output scale can result in problems regarding saturation and inefficiency in learning. Therefore `clipping or rescaling rewards compresses the space of estimated expected returns in action value function based methods such as DDPG`. We run a set of `experiments using reward rescaling in DDPG (with and without layer normalization)` for insights into how this aspect affects performance.

The results show that reward rescaling can have a large effect, but `results were inconsistent across environments and scaling values`. In particular, `layer normalization changes how the rescaling factor affects results, suggesting that these impacts are due to the use of deep networks and gradient-based methods`. With the value function approximator tracking a moving target distribution, this can potentially affect learning in unstable environments where a deep Q-value function approximator is used. Furthermore, some environments may have untuned reward scales. Therefore, we suggest that `this hyperparameter has the potential to have a large impact if considered properly`. Rather than rescaling rewards in some environments, `a more principled approach should be taken to address this`. An initial foray into this problem is to `adaptively rescale reward targets with normalized stochastic gradient`, but further research is needed.

3.4 Random Seeds and Trials

Can random seeds drastically alter performance? Can one distort results by averaging an improper number of trials?

A major concern with DRL is the variance in results due to environment stochasticity or stochasticity in the learning process (e.g. random weight initialization). As such, `even averaging several learning results together across totally different random seeds can lead to the reporting of misleading result`s. We highlight this in the form of an experiment.

We perform 10 experiment trials, for the same hyperparameter configuration, only varying the random seed across all 10 trials. We then split the trials into two sets of 5 and average these two groupings together. The results show that the performance of algorithms can be drastically different. We demonstrate that `the variance between runs is enough to create statistically different distributions just from varying random seeds`. Unfortunately, in recent reported results, it is not uncommon for the top-N trials to be selected from among several trials or averaged over only small number of trials (N< 5). Our experiment with random seeds shows that this can be potentially misleading. Particularly for HalfCheetah, it is possible to get learning curves that do not fall within the same distribution at all, just by averaging different runs with the same hyperparameters, but different random seeds. `While there can be no specific number of trials specified as a recommendation, it is possible that power analysis methods can be used to give a general idea to this extent as we will discuss later`. However, more investigation is needed to answer this open problem.

3.5 Environments

How do the environment properties affect variability in reported RL algorithm performance?

To assess how the choice of evaluation environment can affect the presented results, we use our aforementioned default set of hyperparameters across our chosen testbed of algorithms and investigate how well each algorithm performs across an extended suite of continuous control tasks. For these experiments, we use the following environments from OpenAI Gym: Hopper-v1, HalfCheetah-v1, Swimmer-v1 and Walker2d-v1. `The choice of environment often plays an important role in demonstrating how well a new proposed algorithm performs against baselines`. In continuous control tasks, often the environments have random stochasticity, shortened trajectories, or different dynamic properties. We demonstrate that, as a result of these differences, `algorithm performance can vary across environments and the best performing algorithm across all environments is not always clear`. Thus it is increasingly important to `present results for a wide range of environments` and not only pick those which show a novel work outperforming other methods.

As shown in results, in environments with stable dynamics (e.g. HalfCheetah-v1), DDPG outperforms all other algorithsm. However, as dynamics become more unstable (e.g. in Hopper-v1) performance gains rapidly diminish. `Only the favourable and stable HalfCheetah when reporting DDPG-based experiments would be unfair`.

By reaching a local optimum, learning curves can indicate successful optimization of the policy over time, when in reality the returns achieved are not qualitatively representative of learning the desired behaviour, as demonstrated in video replays of the learned policy. Therefore, it is important to `show not only returns but demonstrations of the learned policy in action`. Without `understanding what the evaluation returns indicate`, it is possible that misleading results can be reported which in reality only optimize local optima rather than reaching the desired behaviour.

3.6 Codebases

Are commonly used baseline implementations comparable?

In many cases, authors implement their own versions of baseline algorithms to compare against. We investigate several implementation of TRPO. We also compare several implementations of DDPG. Our goal is to `draw attention to the variance due to implementation details across algorithms`. We run a subset of our architecture experiments as with the OpenAI baselines implementations using the same hyperparameters as in those experiments.

The results show that `implementation differences which are often not reflected in publications can have dramatic impacts on performance`. This demonstrates `the necessity that implementation details be enumerated, codebases packaged with publications, and that performance of baseline experiments in novel works matches the original baseline publication code`.

4 Reporting Evaluation Metrics

In this section we analyze some of the evaluation metrics commonly used in the RL literature. In practice, RL algorithms are often evaluated by simply presenting plots or tables of average cumulative reward (average returns) and, more recently, of maximum reward achieved over a fixed number of timesteps. Due to the unstable nature of many of these algorithms, simply reporting the maximum returns is typically inadequate for fair comparison; even reporting average returns can be misleading as the range of performance across seeds and trials is unknown. Alone, these may not provide a clear picture of an algorithm’s range of performance. However, `when combined with confidence intervals（置信区间）, this may be adequate to make an informed decision given a large enough number of trials`. As such, we investigate `using the bootstrap and significance testing as in ML to evaluate algorithm performance`.

Online View vs. Policy Optimization

An important distinction when reporting results is the online learning view versus the policy optimization view of RL. `In the online view, an agent will optimize the returns across the entire learning process and there is not necessarily an end to the agent’s trajectory`. In this view, evaluations can use the average cumulative rewards across the entire learning process (balancing exploration and exploitation), or can possibly use offline evaluation. The alternate view corresponds to policy optimization, where evaluation is performed using a target policy in an offline manner. `In the policy optimization view it is important to run evaluations across the entire length of the task trajectory with a single target policy to determine the average returns that the target can obtain`. We focus on evaluation methods for the policy optimization view (with offline evaluation), but the same principles can be applied to the online view.

Confidence Bounds

The sample bootstrap has been a popular method to gain insight into a population distribution from a smaller sample. Bootstrap methods are particularly popular for A/B testing, and we can borrow some ideas from this field. `Generally a bootstrap estimator is obtained by resampling with replacement many times to generate a statistically relevant mean and confidence bound`. Using this technique, we can gain insight into what is the 95% confidence interval of the results from our section on environments. The table shows the bootstrap mean and 95% confidence bounds on our environment experiments. `Confidence intervals can vary wildly between algorithms and environments`. We find that TRPO and PPO are the most stable with small confidence bounds from the bootstrap. In cases where confidence bounds are exceedingly large, it may be necessary to run more trials (i.e. increase the sample size).

Power Analysis

`Another method to determine if the sample size must be increased is bootstrap power analysis`. If we use our sample and give it some uniform lift (for example, scaling uniformly by 1.25), we can run many bootstrap simulations and determine what percentage of the simulations result in statistically significant values with the lift. If there is a small percentage of significant values, a larger sample size is needed (more trials must be run). We do this across all environment experiment trial runs and indeed find that, in more unstable settings, the bootstrap power percentage leans towards insignificant results in the lift experiment. Conversely, in stable trials (e.g. TRPO on Hopper-v1) with a small sample size, the lift experiment shows that no more trials are needed to generate significant comparisons. These results are provided in the supplemental material.

Significance

`An important factor when deciding on an RL algorithm to use is the significance of the reported gains based on a given metric`. Several works have investigated the use of significance metrics to assess the reliability of reported evaluation metrics in ML. However, few works in RL assess the significance of reported metrics. Based on our experimental results which indicate that algorithm performance can vary wildly based simply on perturbations（扰动）of random seeds, it is clear that some metric is necessary for assessing the significance of algorithm performance gains and the confidence of reported metrics. While more research and investigation is needed to determine the best metrics for assessing RL algorithms, `we investigate an initial set of metrics based on results from ML`.

In supervised learning, k-fold t-test, corrected resampled t-test, and other significance metrics have been discussed when comparing ML results. However, the assumptions pertaining to the underlying data with corrected metrics do not necessarily apply in RL. `Further work is needed to investigate proper corrected significance tests for RL`. Nonetheless, we explore several significance measures which give insight into whether a novel algorithm is truly performing as the state-of-the-art. `We consider the simple 2-sample t-test (sorting all final evaluation returns across N random trials with different random seeds); the Kolmogorov-Smirnov test; and bootstrap percent differences with 95% confidence intervals.` All calculated metrics can be found in the supplemental. Generally, we find that the significance values match up to what is to be expected.

5 Discussion and Conclusion

We find that both intrinsic (e.g. random seeds, environment properties) and extrinsic sources (e.g. hyperparameters, codebases) of non-determinism can contribute to difficulties in reproducing baseline algorithms. Moreover, we find that highly varied results due to intrinsic sources bolster（支持）the `need for using proper significance analysis`. We `propose several such methods` and show their value on a subset of our experiments.

What recommendations can we draw from our experiments?

Based on our experimental results and investigations, we can `provide some general recommendations`.

Hyperparameters can have significantly different effects across algorithms and environments. Thus it is important to find the working set which at least matches the original reported performance of baseline algorithms through standard hyperparameter searches.
Similarly, new baseline algorithm implementations used for comparison should match the original codebase results if available.
Overall, due to the high variance across trials and random seeds of RL algorithms, many trials must be run with different random seeds when comparing performance. Unless random seed selection is explicitly part of the algorithm, averaging multiple runs over different random seeds gives insight into the population distribution of the algorithm performance on an environment.
Similarly, due to these effects, it is important to perform proper significance testing to determine if the higher average returns are in fact representative of better performance.

We highlight several forms of significance testing and find that they give generally expected results when taking confidence intervals into consideration. Furthermore, we demonstrate that bootstrapping and power analysis are possible ways to gain insight into the number of trial runs necessary to make an informed decision about the significance of algorithm performance gains. In general, however, `the most important step to reproducibility is to report all hyperparameters, implementation details, experimental setup, and evaluation methods for both baseline comparison methods and novel work`. Without the publication of implementations and related details, wasted effort on reproducing state-of-the-art works will plague the community and slow down progress.

What are possible future lines of investigation?

Due to the significant effects of hyperparameters (particularly reward scaling), `another possibly important line of future investigation is in building hyperparameter agnostic（不可知论）algorithms`. Such an approach would ensure that there is no unfairness introduced from external sources when comparing algorithms agnostic to parameters such as reward scale, batch size, or network structure. Furthermore, `while we investigate an initial set of significance metrics here, they may not be the best fit for comparing RL algorithms`. Several works have begun investigating policy evaluation methods for the purposes of safe RL, but further work is needed in significance testing and statistical analysis. `Similar lines of investigation would be helpful to determine the best methods for evaluating performance gain significance`.

How can we ensure that deep RL matters?

We discuss many different factors affecting reproducibility of RL algorithms. The sensitivity of these algorithms to changes in reward scale, environment dynamics, and random seeds can be considerable and varies between algorithms and settings. `Since benchmark environments are proxies for real-world applications to gauge（评估）generalized algorithm performance, perhaps more emphasis should be placed on the applicability of RL algorithms to real-world tasks`. That is, as there is often no clear winner among all benchmark environments, `perhaps recommended areas of application should be demonstrated along with benchmark environment results when presenting a new algorithm`. Maybe new methods should be answering the question: `in what setting would this work be useful?` This is something that is addressed for ML in previous work and may warrant more discussion for RL. As a community, `we must not only ensure reproducible results with fair comparisons, but we must also consider what are the best ways to demonstrate that RL continues to matter`.

Abstract

Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining the progress in DRL.

1 Introduction

To maintain rapid progress in RL research, it is important that existing works can be easily reproduced and compared to accurately judge improvements offered by novel methods.

2 Technical Background

We experiment with TRPO, DDPG, PPO and ACKTR.

3 Experimental Analysis

We pose several questions about the factors affecting reproducibility of state-of-the-art RL methods. We perform a set of experiments designed to provide insight into the questions posed. In particular, we investigate the effects of:

specific hyperparameters on algorithm performance if not properly tuned;

random seeds and the number of averaged experiment trials;

specific environment characteristics;

differences in algorithm performance due to stochastic environments;

differences due to codebases with most other factors held constant.

3.1 Hyperparameters

What is the magnitude of the effect hyperparameter settings can have on baseline performance?

3.2 Network Architecture

How does the choice of network architecture for the policy and value function approximation affect performance?

3.3 Reward Scale

How can the reward scale affect results? Why is reward rescaling used?

3.4 Random Seeds and Trials

Can random seeds drastically alter performance? Can one distort results by averaging an improper number of trials?

3.5 Environments

How do the environment properties affect variability in reported RL algorithm performance?

3.6 Codebases

Are commonly used baseline implementations comparable?

4 Reporting Evaluation Metrics

Online View vs. Policy Optimization

Confidence Bounds

Power Analysis

Significance

5 Discussion and Conclusion

What recommendations can we draw from our experiments?

Based on our experimental results and investigations, we can provide some general recommendations.

Hyperparameters can have significantly different effects across algorithms and environments. Thus it is important to find the working set which at least matches the original reported performance of baseline algorithms through standard hyperparameter searches.

Similarly, new baseline algorithm implementations used for comparison should match the original codebase results if available.

Similarly, due to these effects, it is important to perform proper significance testing to determine if the higher average returns are in fact representative of better performance.

What are possible future lines of investigation?

How can we ensure that deep RL matters?

To maintain rapid progress in RL research, `it is important that existing works can be easily reproduced and compared to accurately judge improvements offered by novel methods`.

We pose several questions about the factors affecting reproducibility of state-of-the-art RL methods. We perform a set of experiments designed to provide insight into the questions posed. In particular, `we investigate the effects` of:

Based on our experimental results and investigations, we can `provide some general recommendations`.