Paper 11: Deterministic Policy Gradient Algorithms (DPG)

Feb 27, 2020

Abstract

We consider deterministic policy gradient algorithms for reinforcement learning with continuous actions.
The deterministic policy gradient is the expected gradient of the action-value function, so it can be estimated much more efficiently than the usual stochastic policy gradient.
To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy.

1 Introduction

We show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply `follows the gradient of the action-value function`.

In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space. As a result, `computing the stochastic policy gradient may require more samples`, especially if the action space has many dimensions.

To ensure that our deterministic policy gradient algorithms continue to `explore` satisfactorily, we introduce an `off-policy learning algorithm`. The basic idea is to `choose actions according to a stochastic behaviour policy` (to ensure adequate exploration), but to `learn about a deterministic target policy` (exploiting the efficiency of the deterministic policy gradient).

We use the deterministic policy gradient to derive an `off-policy actor-critic algorithm` that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient.

We also introduce a notion of `compatible（兼容的）function approximation` for deterministic policy gradients, to `ensure that the approximation does not bias the policy gradient`.

Our algorithms `require no more computation than prior methods`: the computational cost of each update is linear in the action dimensionality and the number of policy parameters.

2 Background

Preliminaries

We denote the density at state s’ after transitioning for t time steps from state s by p(s → s’, t, π). We also denote the (improper) `discounted state distribution` by

We can then write the `performance objective` as an expectation,

where Es∼ρ [·] denotes the (improper) expected value with respect to discounted state distribution ρ(s).

Stochastic Policy Gradient Theorem

The basic idea behind Policy gradient algorithms is to adjust the parameters θ of the policy in the direction of the performance gradient ∇θJ(πθ). The fundamental result underlying these algorithms is the `policy gradient theorem`,

Despite the fact that the state distribution ρπ(s) depends on the policy parameters, the policy gradient does not depend on the gradient of the state distribution.

The policy gradient theorem has been used to derive a variety of policy gradient algorithms, by forming a `sample-based estimate of this expectation`. One issue that these algorithms must address is how to estimate the action-value function Qπ(s, a). Perhaps the simplest approach is to use `a sample return to estimate the value of Qπ(st, at)`, which leads to a variant of the REINFORCE algorithm.

Stochastic Actor-Critic Algorithms

For the critic, in general, substituting a function approximator Qw(s, a) for the true action-value function Qπ(s, a) may `introduce bias`. However, if the function approximator is compatible such that i)

and ii) the parameters w are chosen to minimize the mean-squared error

then there is `no bias`.

Off-Policy Actor-Critic

It is often useful to `estimate the policy gradient off-policy from trajectories sampled from a distinct behaviour policy` β(a|s) ≠ πθ(a|s). In an off-policy setting, the `performance objective` is typically modified to be the value function of the target policy, averaged over the state distribution of the behaviour policy，

Differentiating the performance objective and applying an approximation gives the `off-policy policy-gradient`

This approximation drops a term that depends on the action-value gradient ∇θQπ(s, a).

3 Gradients of Deterministic Policies

We now consider how the policy gradient framework may be extended to deterministic policies. Our main result is a `deterministic policy gradient theorem`. We provide several ways to derive and understand this result.

First we provide an informal intuition behind the form of the deterministic policy gradient.
We then give a formal proof of the deterministic policy gradient theorem from first principles.
Finally, we show that the deterministic policy gradient theorem is in fact a limiting case of the stochastic policy gradient theorem.

3.1 Action-Value Gradients

In continuous action spaces, we `move the policy in the direction of the gradient of Q`, rather than globally maximizing Q. `Each state suggests a different direction of policy improvement; these may be averaged together by taking an expectation with respect to the state distribution ρµ(s)`,

By applying the `chain rule` we see that the policy improvement may be decomposed into the gradient of the action-value with respect to actions, and the gradient of the policy with respect to the policy parameters.

By convention ∇θµθ(s) is a Jacobian matrix such that each column is the gradient ∇θ[µθ(s)]d of the dth action dimension of the policy with respect to the policy parameters θ. However, `by changing the policy, different states are visited and the state distribution ρµ will change`. As a result it is not immediately obvious that this approach guarantees improvement, `without taking account of the change to distribution`. However, the theory below shows that, like the stochastic policy gradient theorem, `there is no need to compute the gradient of the state distribution; and that the intuitive update outlined above is following precisely the gradient of the performance objective`.

3.2 Deterministic Policy Gradient Theorem

We write the performance objective as an expectation again,

We now provide the `deterministic policy gradient theorem` analogue to the policy gradient theorem.

3.3 Limit of the Stochastic Policy Gradient

The deterministic policy gradient is indeed `a special (limiting) case of the stochastic policy gradient` - variance parameter σ = 0.

4 Deterministic Actor-Critic Algorithms

We now use the deterministic policy gradient theorem to derive both on-policy and off-policy actor-critic algorithms.

4.1 On-Policy Deterministic Actor-Critic

In general, behaving according to a deterministic policy will `not ensure adequate exploration` and may lead to sub-optimal solutions.

Like the stochastic actor-critic, the deterministic actor-critic consists of two components.

In the following deterministic actor-critic algorithm, the critic uses Sarsa updates to estimate the action-value function.

4.2 Off-Policy Deterministic Actor-Critic

We now consider off-policy methods that learn a deterministic target policy µθ(s) `from trajectories generated by an arbitrary stochastic behaviour policy π(s, a)`. As before, we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy,

The `off-policy deterministic policy gradient`

We now develop an actor-critic algorithm that updates the policy in the direction of the off-policy deterministic policy gradient. In the following off-policy deterministic actor-critic (OPDAC) algorithm, the critic uses Q-learning updates to estimate the action-value function.

Because the deterministic policy gradient removes the integral over actions, we can `avoid importance sampling` in the actor; and by using Q-learning, we can avoid importance sampling in the critic.

4.3 Compatible Function Approximation

In general, substituting an approximate Qw(s, a) into the deterministic policy gradient will not necessarily follow the true gradient (nor indeed will it necessarily be an ascent direction at all). We find a critic Qw(s, a) such that the gradient ∇aQµ(s, a) can be replaced by ∇aQw(s, a), without affecting the deterministic policy gradient.