What's the difference between Q-Learning and Policy Gradients methods?

Question

What's the difference between Q-Learning and Policy Gradients methods?

1 Answer

Robin · Answer 1 · 2023-05-06T15:05:35+0000

In Q-Learning we learn a Q-function that satisfies the Bellman (Optimality) Equation. This is most often achieved by minimizing the Mean Squared Bellman Error (MSBE) as the loss function. The Q-function is then used to obtain a policy (e.g. by greedily selecting the action with maximum value).

Policy Gradient methods directly try to maximize the expected return by taking small steps in the direction of the policy gradient. The policy gradient is the derivative of the expected return with respect to the policy parameters.

On vs. Off-Policy

The Policy Gradient is derived as an expectation over trajectories (s1,a1,r1,s2,a2,...,rn), which is estimated by a sample mean. To get an unbiased estimate of the gradient, the trajectories have to be sampled from the current policy. Thus, policy gradient methods are on-policy methods.

Q-Learning only makes sure to satisfy the Bellman-Equation. This equation has to hold true for all transitions. Therefore, Q-learning can also use experiences collected from previous policies and is off-policy.

Stability and Sample Efficiency

By directly optimizing the return and thus the actual performance on a given task, Policy Gradient methods tend to more stably converge to a good behavior. Indeed being on-policy, makes them very sample inefficient.

Q-learning find a function that is guaranteed to satisfy the Bellman-Equation, but this does not guarantee to result in near-optimal behavior. Several tricks are used to improve convergence and in this case, Q-learning is more sample efficient.

What's the difference between Q-Learning and Policy Gradients methods?

Please log in or register to answer this question.

1 Answer