0 votes
in Reinforcement Learning by
Why do regular Q-Learning and DQN overestimate the Q values?

1 Answer

0 votes
by
Regular Q-learning and Deep Q-Networks (DQN) can overestimate Q-values in certain situations because they use the same set of weights to estimate both the action-value function and the target value in their update rules.

In regular Q-learning, the update rule for the Q-value is:

Q(s, a) ← Q(s, a) + α [r + γ max(Q(s', a')) - Q(s, a)]

where r is the immediate reward, γ is the discount factor, and α is the learning rate. The max(Q(s', a')) term is an estimate of the optimal Q-value for the next state s', and the Q(s, a) term is an estimate of the Q-value for the current state-action pair (s, a). However, in some cases, the max(Q(s', a')) term may overestimate the true value of the Q-function, leading to an overestimate of the Q-value for the current state-action pair (s, a).

Similarly, in DQN, the update rule for the Q-value is:

Q(s, a) ← Q(s, a) + α [r + γ max(Q'(s', a')) - Q(s, a)]

where Q'(s', a') is the target network used to estimate the Q-value for the next state-action pair. Like regular Q-learning, DQN can also overestimate Q-values if the target network overestimates the Q-values for the next state-action pair.

To address this overestimation issue, several modifications have been proposed, such as Double Q-learning, which uses two sets of weights to estimate the action-value function and the target value, and prioritized experience replay, which gives higher priority to experiences with higher TD errors. These modifications have been shown to improve the stability and performance of Q-learning and DQN.
...