Paper 18: Recurrent Experience Replay in Distributed Reinforcement Learning (R2D2)

Abstract

In this paper we investigate the training of RNN-based RL agents from distributed prioritized experience replay.
We study the effects of parameter lag resulting in representational drift and recurrent state staleness（老化）and empirically derive an improved training strategy.
Using a single network architecture and fixed set of hyper- parameters, the resulting agent, Recurrent Replay Distributed DQN, quadruples（四倍）the previous state of the art.

1 Introduction

The earliest of successes in RL leveraged `experience replay` for data efficiency and stacked a fixed number of consecutive frames to overcome the partial observability in Atari 2600 games. However, with progress towards increasingly difficult, partially observable domains, `the need for more advanced memory-based representations increases`, necessitating more principled solutions such as RNNs. The use of LSTMs within RL has been widely adopted to `overcome partial observability`.

In this paper we `investigate the training of RNNs with experience replay`. We have three primary contributions.

First, we demonstrate the effect of experience replay on parameter lag, leading to representational drift and recurrent state staleness. This is potentially exacerbated（加重）in the distributed training setting, and ultimately results in diminished training stability and performance.
Second, we perform an empirical study into the effects of several approaches to RNN training with experience replay, mitigating（减轻）the aforementioned effects.
Third, we present an agent that integrates these findings to achieve significant advances in the state of the art on Atari-57 and matches the state of the art on DMLab-30.

2 Background

2.1 Reinforcement Learning

We model the environment as a Partially Observable Markov Decision Process (POMDP) given the tuple (S, A, T, R, Ω, O). `Ω gives the set of observations potentially received by the agent and O is the observation function mapping (unobserved) states to probability distributions over observations`.

Within this framework, the agent receives an observation o ∈ Ω, which may only contain partial information about the underlying state s ∈ S. When the agent takes an action a ∈ A the environment responds by transitioning to state s’ ∼ T(·|s, a) and giving the agent a new observation o’ ∼ Ω(·|s’) and reward r ∼ R(s, a).

Although there are many approaches to RL in POMDPs, we focus on `using RNNs with backpropagation through time (BPTT) to learn a representation that disambiguates（消除歧义）the true state of the POMDP`.

DQN training a CNN to represent a value function with Q-learning, from data continuously collected in a replay buffer. A3C uses an LSTM and are trained directly on the online stream of experience without using a replay buffer. `DQN and LSTM can be combined`.

2.2 Distributed Reinforcement Learning

Recent advances in reinforcement learning have achieved significantly improved performance by leveraging `distributed training architectures which separate learning from acting, collecting data from many actors running in parallel on separate environment instances`.

2.3 The Recurrent Replay Distributed DQN Agent

We propose a new agent, the Recurrent Replay Distributed DQN (R2D2), and use it to `study the interplay（相互作用）between recurrent state, experience replay, and distributed training`. R2D2 is most similar to Ape-X, built upon `prioritized distributed replay` and `n-step double Q-learning` (with n = 5), generating experience by `a large number of actors` (typically 256) and learning from batches of replayed experience by `a single learner`. Like Ape-X, we use the `dueling network architecture`, but provide `an LSTM layer after the convolutional stack`.

Instead of regular (s, a, r, s’) transition tuples, we store `fixed-length (m = 80) sequences of (s, a, r) in replay`, with adjacent（邻近的）sequences `overlapping` each other by 40 time steps, and never crossing episode boundaries. When training, we unroll both online and target networks on the same sequence of states to generate value estimates and targets. We leave details of our exact treatment of recurrent states in replay for the next sections.

We do not clip rewards, but instead use an `invertible（可逆的）value function` rescaling of the form

which results in the following `n-step targets` for the Q-value function:

Here, θ− denotes the target network parameters which are copied from the online network parameters θ every 2500 learner steps.

Our `replay prioritization` differs from that of Ape-X in that we use `a mixture of max and mean priority exponent to 0.9`. This more aggressive scheme is motivated by our observation that averaging absolute n-step TD-errors δi over the sequence: p = η maxi δi + (1 − η)¯δ. We set η and the over long sequences tends to wash out large errors, thereby compressing the range of priorities and limiting the ability of prioritization to pick out useful experience.

Finally, we used the discount of γ = 0.997, and disabled the loss-of-life-as-episode-end heuristic that has been used in Atari agents in some of the work. A full list of `hyper-parameters` is provided in the Appendix.

We train the R2D2 agent with a single GPU-based learner, performing approximately 5 network updates per second (each update on a mini-batch of 64 length-80 sequences), and each actor performing ∼ 260 environment steps per second on Atari (∼ 130 per second on DMLab).

3 Training Recurrent RL Agents with Experience Replay

In order to achieve good performance in a partially observed environment, an RL agent requires a `state representation` that encodes information about its state-action trajectory in addition to its current observation. The most common way to achieve this is by using an RNN, typically `an LSTM as part of the agent’s state encoding`. `To train an RNN from replay and enable it to learn meaningful long-term dependencies`, `whole state-action trajectories` need to be stored in replay and used for training the network. DRQN `compared two strategies of training an LSTM from replayed experience`:

Using a zero start state to initialize the network at the beginning of sampled sequences.

The zero start state strategy’s appeal lies in its `simplicity`, and it `allows independent decorrelated sampling of relatively short sequences`, which is important for robust optimization of a neural network. On the other hand, it `forces the RNN to learn to recover meaningful predictions from an atypical（非典型的）initial recurrent state` (‘initial recurrent state mismatch’), which may limit its ability to fully rely on its recurrent state and learn to exploit long temporal correlations.

Replaying whole episode trajectories.

The second strategy `avoids the problem of finding a suitable initial state`, but `creates a number of practical, computational, and algorithmic issues` due to varying and potentially environment-dependent sequence length, and `higher variance of network updates` because of the highly correlated nature of states in a trajectory when compared to training on randomly sampled batches of experience tuples.

In DRQN, the difference between the two strategies for empirical agent performance on a set of Atari games is `little`, and therefore `opted for the simpler zero start state strategy`.

But in R2D2, to fix these issues, we `propose and evaluate two strategies for training a RNN from randomly sampled replay sequences`, that can `be used individually or in combination`:

Stored state: Storing the recurrent state in replay and using it to initialize the network at training time. This partially remedies the weakness of the zero start state strategy, however it may suffer from the effect of ‘representational drift’ leading to ‘recurrent state staleness’, as the stored recurrent state generated by a sufficiently old network could differ significantly from a typical state produced by a more recent version.
Burn-in: Allow the network a ‘bsurn-in period’ by using a portion of the replay sequence only for unrolling the network and producing a start state, and update the network only on the remaining part of the sequence. We hypothesize that this allows the network to partially recover from a poor start state (zero, or stored but stale) and find itself in a better initial state before being required to produce accurate outputs.

In all our `experiments` we will be using the proposed agent architecture from Section 2.3 with replay sequences of length m = 80, with an optional burn-in prefix of l = 40 or 20 steps. `Our aim is to assess the negative effects of representational drift and recurrent state staleness on network training and how they are mitigated by the different training strategies`. For that, we will `compare` the Q-values produced by the network on sampled replay sequences when unrolled using one of these strategies and the Q-values produced when using the true stored recurrent states at each step (see the following figure, showing different sources for the hidden state).

More formally, let ot, . . . , ot+m and ht, . . . , ht+m denote the `replay sequence of observations and stored recurrent states`, and denote by ht+1 = h(ot, ht; θ) and q(ht; θ) the `recurrent state and Q-value vector output` by the recurrent neural network with parameter vector θ, respectively. We write hˆt for the `hidden state`, used during training and initialized under one of the above strategies (either hˆt = 0 or ˆht = ht). Then ˆht+i = h(ot+i−1, ˆht+i−1; ˆθ) is computed by unrolling the network with parameters ˆθ on the sequence ot, . . . , ot+l+m−1.

We estimate the impact of `representational drift` and `recurrent state staleness` by their effect on the Q-value estimates, by measuring `Q-value discrepancy`

for the first (i = l) and last (i = l + m − 1) states of the non-burn-in part of the replay sequence. The `normalization by the maximal Q-value` helps comparability between different environments and training stages, as the Q-value range of an agent can vary drastically between these. Note that we are `not directly comparing the Q-values produced at acting and training time`, q(ht; θ) and q(ˆht; ˆθ), as these can naturally be expected to be distinct as the agent is being trained. Instead we `focus on the difference that results from applying the same network (parameterized by ˆθ) to the distinct recurrent states`.

`Zero start state heuristic results in a significantly more severe effect of recurrent state staleness` on the outputs of the network. As hypothesized above, this effect is `greatly reduced for the last sequence states compared to the first ones`, after the RNN has had time to `recover from the atypical start state`, but the effect of staleness is still substantially worse here for the zero state than the stored state strategy. Another potential downside of the pure zero state heuristic is that it `prevents the agent from strongly relying on its recurrent state and exploit long-term temporal dependencies`.

We conclude the section with the observation that `both stored state and burn-in strategy provide substantial advantages over the naive zero state training strategy`, in terms of (indirect) measures of the effect of representation drift and recurrent state staleness, and empirical performance. The `combination of both methods` consistently yields the smallest discrepancy on the last sequence states and the most robust performance gains.

Abstract

In this paper we investigate the training of RNN-based RL agents from distributed prioritized experience replay.

We study the effects of parameter lag resulting in representational drift and recurrent state staleness（老化）and empirically derive an improved training strategy.

Using a single network architecture and fixed set of hyper- parameters, the resulting agent, Recurrent Replay Distributed DQN, quadruples（四倍）the previous state of the art.

1 Introduction

In this paper we investigate the training of RNNs with experience replay. We have three primary contributions.

First, we demonstrate the effect of experience replay on parameter lag, leading to representational drift and recurrent state staleness. This is potentially exacerbated（加重）in the distributed training setting, and ultimately results in diminished training stability and performance.

Second, we perform an empirical study into the effects of several approaches to RNN training with experience replay, mitigating（减轻）the aforementioned effects.

Third, we present an agent that integrates these findings to achieve significant advances in the state of the art on Atari-57 and matches the state of the art on DMLab-30.

2 Background

2.1 Reinforcement Learning

We model the environment as a Partially Observable Markov Decision Process (POMDP) given the tuple (S, A, T, R, Ω, O). Ω gives the set of observations potentially received by the agent and O is the observation function mapping (unobserved) states to probability distributions over observations.

Although there are many approaches to RL in POMDPs, we focus on using RNNs with backpropagation through time (BPTT) to learn a representation that disambiguates（消除歧义）the true state of the POMDP.

DQN training a CNN to represent a value function with Q-learning, from data continuously collected in a replay buffer. A3C uses an LSTM and are trained directly on the online stream of experience without using a replay buffer. DQN and LSTM can be combined.

2.2 Distributed Reinforcement Learning

Recent advances in reinforcement learning have achieved significantly improved performance by leveraging distributed training architectures which separate learning from acting, collecting data from many actors running in parallel on separate environment instances.

2.3 The Recurrent Replay Distributed DQN Agent

We do not clip rewards, but instead use an invertible（可逆的）value function rescaling of the form

which results in the following n-step targets for the Q-value function:

Here, θ− denotes the target network parameters which are copied from the online network parameters θ every 2500 learner steps.

Finally, we used the discount of γ = 0.997, and disabled the loss-of-life-as-episode-end heuristic that has been used in Atari agents in some of the work. A full list of hyper-parameters is provided in the Appendix.

We train the R2D2 agent with a single GPU-based learner, performing approximately 5 network updates per second (each update on a mini-batch of 64 length-80 sequences), and each actor performing ∼ 260 environment steps per second on Atari (∼ 130 per second on DMLab).

3 Training Recurrent RL Agents with Experience Replay

Using a zero start state to initialize the network at the beginning of sampled sequences.

Replaying whole episode trajectories.

In DRQN, the difference between the two strategies for empirical agent performance on a set of Atari games is little, and therefore opted for the simpler zero start state strategy.

But in R2D2, to fix these issues, we propose and evaluate two strategies for training a RNN from randomly sampled replay sequences, that can be used individually or in combination:

We estimate the impact of representational drift and recurrent state staleness by their effect on the Q-value estimates, by measuring Q-value discrepancy

In this paper we `investigate the training of RNNs with experience replay`. We have three primary contributions.

Although there are many approaches to RL in POMDPs, we focus on `using RNNs with backpropagation through time (BPTT) to learn a representation that disambiguates（消除歧义）the true state of the POMDP`.

DQN training a CNN to represent a value function with Q-learning, from data continuously collected in a replay buffer. A3C uses an LSTM and are trained directly on the online stream of experience without using a replay buffer. `DQN and LSTM can be combined`.

Recent advances in reinforcement learning have achieved significantly improved performance by leveraging `distributed training architectures which separate learning from acting, collecting data from many actors running in parallel on separate environment instances`.

We do not clip rewards, but instead use an `invertible（可逆的）value function` rescaling of the form

which results in the following `n-step targets` for the Q-value function:

Finally, we used the discount of γ = 0.997, and disabled the loss-of-life-as-episode-end heuristic that has been used in Atari agents in some of the work. A full list of `hyper-parameters` is provided in the Appendix.

In DRQN, the difference between the two strategies for empirical agent performance on a set of Atari games is `little`, and therefore `opted for the simpler zero start state strategy`.

But in R2D2, to fix these issues, we `propose and evaluate two strategies for training a RNN from randomly sampled replay sequences`, that can `be used individually or in combination`:

We estimate the impact of `representational drift` and `recurrent state staleness` by their effect on the Q-value estimates, by measuring `Q-value discrepancy`