Paper 19: Deep Recurrent Q-Learning for Partially Observable MDPs (DRQN)

Abstract

DRL controllers have limited memory and rely on being able to perceive the complete game screen at each decision point.
This article investigates the effects of adding recurrency to a DQN by replacing the first post-convolutional fully-connected layer with a recurrent LSTM.
The resulting DRQN, although capable of seeing only a single frame at each timestep, successfully integrates information through time.
Given the same length of history, recurrency is a viable alternative to stacking a history of frames in the DQN’s input layer and while recurrency confers no systematic advantage when learning to play the game, the recurrent net can better adapt at evaluation time if the quality of observations changes.

1 Introduction

DQN are limited in the sense that they learn a mapping from a `limited number of past states, or game screens` in the case of Atari 2600. Thus DQN will be unable to master games that require the player to remember events more distant than four screens in the past. Put differently, any game that requires a memory of more than four frames will `appear non-Markovian` because the future game states (and rewards) depend on more than just DQN’s current input. Instead of a Markov Decision Process (MDP), the game becomes a `Partially-Observable Markov Decision Process (POMDP)`.

We observe that DQN’s performance declines when given incomplete state observations and hypothesize that DQN may be modified to better deal with POMDPs by leveraging advances in RNNs. Therefore we `introduce the DRQN, a combination of a LSTM and a DQN`. Crucially, we demonstrate that DRQN is `capable of handing partial observability`, and that when trained with full observations and evaluated with partial observations, DRQN better handles the loss of information than does DQN. Thus, `recurrency confers benefits as the quality of observations degrades`.

Our experiments show that `adding recurrency to Deep Q-Learning` allows the Q-network network to `better estimate the underlying system state`, narrowing the gap between Q(o, a|θ) and Q(s, a|θ). Stated differently, recurrent deep Q-networks can better approximate actual Q-values from sequences of observations, leading to `better policies in partially observed environments`.

2 DRQN Architecture

To isolate the effects of recurrency, we minimally modify the architecture of DQN, `replacing only its first fully connected layer with a recurrent LSTM layer of the same size`. Depicted in the following figure, the architecture of DRQN takes a single 84 × 84 preprocessed image. This image is processed by three convolutional layers and the outputs are fed to the fully connected LSTM layer. Finally, a linear layer outputs a Q-Value for each action. During training, the parameters for both the convolutional and recurrent portions of the network are learned jointly from scratch.

DRQN convolves `three times over a single-channel` image of the game screen.

3 Stable Recurrent Updates

Updating a recurrent, convolutional network requires each backward pass to contain many time-steps of game screens and target values. Additionally, the `LSTM’s initial hidden state may either be zeroed or carried forward from its previous values`. We consider two types of updates:

Bootstrapped Sequential Updates

Episodes are selected randomly from the replay memory and `updates begin at the beginning of the episode and proceed forward through time to the conclusion of the episode`. The targets at each time-step are generated from the target Q-network, Qˆ. `The RNN’s hidden state is carried forward throughout the episode`.

Bootstrapped Random Updates

Episodes are selected randomly from the replay memory and `updates begin at random points in the episode and proceed for only unroll iterations time-steps` (e.g. one backward call). The targets at each time-step are generated from the target Q-network, Qˆ. `The RNN’s initial state is zeroed at the start of the update`.

Sequential updates have the advantage of carrying the LSTM’s hidden state forward from the beginning of the episode. However, by sampling experiences sequentially for a full episode, they `violate DQN’s random sampling policy`.

Random updates better adhere to the policy of randomly sampling experience, but, as a consequence, the LSTM’s hidden state must be zeroed at the start of each update. `Zeroing the hidden state makes it harder for the LSTM to learn functions that span longer time scales than the number of time-steps reached by back propagation through time`.

Experiments indicate that `both types of updates are viable` and yield convergent policies with similar performance across a set of games. Therefore, to limit complexity, all results herein `use the randomized update strategy`.

Having addressed the architecture and updating of a DRQN, we now show `how it performs on domains featuring partial observability`.

4 Atari Games: MDP or POMDP?

DQN infers the full state of an Atari game by expanding the state representation to encompass the last four game screens. Many games that were previously POMDPs now become MDPs.

Since the explored games are fully observable given four input frames, we need a way to `introduce partial observability without reducing the number of input frames` given to DQN.

5 Flickering Atari Games

To address this problem, we introduce the Flickering Pong POMDP - a modification to the classic game of Pong such that `at each time-step, the screen is either fully revealed or fully obscured with probability p = 0.5`. Obscuring frames in this manner probabilistically induces an incomplete memory of observations needed for Pong to become a POMDP.

In order to succeed at the game of Flickering Pong, it is necessary to `integrate information across frames to estimate relevant variables` such as the location and velocity of the ball and the location of the paddle. Since half of the frames are obscured in expectation, a successful player must be robust to the possibility of several potentially contiguous obscured inputs.

Perhaps the most important opportunity presented by a history of game screens is the ability to `convolutionally detect object velocity`. We have a figure to visualize the game screens maximizing the activations of different convolutional filters and confirms that the `10-frame DQN’s filters do detect object velocity`, though perhaps less reliably than normal unobscured Pong.

Remarkably, DRQN performs well at this task even when given only one input frame per time-step. `With a single frame it is impossible for DRQN’s convolutional layers to detect any type of velocity`. Instead, the higher-level recurrent layer must compensate（补偿）for both the flickering game screen and the lack of convolutional velocity detection. `Individual units in the LSTM layer are capable of integrating noisy single-frame information through time to detect high-level Pong events`.

DRQN is trained using backpropagation through time for the last ten time-steps. Thus both the non-recurrent 10-frame DQN and the recurrent 1-frame DRQN have access to the same history of game screens. Thus, when dealing with partial observability, a choice exists between using a non-recurrent deep network with a long history of observations or using a recurrent network trained with a single observation at each time-step. The results in this section show that `recurrent networks can integrate information through time and serve as a viable alternative to stacking frames in the input layer of a convoluational network`.

Abstract

DRL controllers have limited memory and rely on being able to perceive the complete game screen at each decision point.

This article investigates the effects of adding recurrency to a DQN by replacing the first post-convolutional fully-connected layer with a recurrent LSTM.

The resulting DRQN, although capable of seeing only a single frame at each timestep, successfully integrates information through time.

1 Introduction

2 DRQN Architecture

DRQN convolves three times over a single-channel image of the game screen.

3 Stable Recurrent Updates

Updating a recurrent, convolutional network requires each backward pass to contain many time-steps of game screens and target values. Additionally, the LSTM’s initial hidden state may either be zeroed or carried forward from its previous values. We consider two types of updates:

Bootstrapped Sequential Updates

Bootstrapped Random Updates

Sequential updates have the advantage of carrying the LSTM’s hidden state forward from the beginning of the episode. However, by sampling experiences sequentially for a full episode, they violate DQN’s random sampling policy.

Experiments indicate that both types of updates are viable and yield convergent policies with similar performance across a set of games. Therefore, to limit complexity, all results herein use the randomized update strategy.

Having addressed the architecture and updating of a DRQN, we now show how it performs on domains featuring partial observability.

4 Atari Games: MDP or POMDP?

DQN infers the full state of an Atari game by expanding the state representation to encompass the last four game screens. Many games that were previously POMDPs now become MDPs.

Since the explored games are fully observable given four input frames, we need a way to introduce partial observability without reducing the number of input frames given to DQN.

5 Flickering Atari Games

DRQN convolves `three times over a single-channel` image of the game screen.

Updating a recurrent, convolutional network requires each backward pass to contain many time-steps of game screens and target values. Additionally, the `LSTM’s initial hidden state may either be zeroed or carried forward from its previous values`. We consider two types of updates:

Sequential updates have the advantage of carrying the LSTM’s hidden state forward from the beginning of the episode. However, by sampling experiences sequentially for a full episode, they `violate DQN’s random sampling policy`.

Experiments indicate that `both types of updates are viable` and yield convergent policies with similar performance across a set of games. Therefore, to limit complexity, all results herein `use the randomized update strategy`.

Having addressed the architecture and updating of a DRQN, we now show `how it performs on domains featuring partial observability`.

Since the explored games are fully observable given four input frames, we need a way to `introduce partial observability without reducing the number of input frames` given to DQN.