0 votes
in Reinforcement Learning by
Why would you use a Policy-based method instead o a Value-based method?

1 Answer

0 votes
by

In Value-based methods, we improve the Q function iteratively, and once we have the optimal Q function, then we extract optimal policy by selecting the action in each state that has the maximum Q value.

One of the disadvantages of the value-based method is that it is suitable only for discrete environments (environments with a discrete action space), and we cannot apply value-based methods in continuous environments (environments with a continuous action space).

For example, say we are training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph.

In this case, we can discretize the continuous actions into speed (0 to 10) as action 1, speed (10 to 20) as action 2, and so on. After discretization, we can compute the Q value of all possible state-action pairs. However, discretization is not always desirable. We might lose several important features and we might end up in an action space with a huge set of actions.

Most real-world problems have continuous action space, say, a self-driving car, or a robot learning to walk and more. Apart from having a continuous action space they also have a high dimension. Thus, using the Deep Q Network and other value-based methods cannot deal with the continuous action space effectively.

So, we use the Policy-based methods. With policy-based methods, we don't need to compute the Q function (Q values) to find the optimal policy; instead, it finds the optimal policy by parameterizing the policy using some parameter 𝜃. The basic idea is to find the best θ that produces the highest return.

In addition to that, Most policy-based methods use a stochastic policy. We know that with a stochastic policy, we select actions based on the probability distribution over the action space, which allows the agent to explore different actions instead of performing the same action every time. Thus, policy-based methods take care of the exploration-exploitation trade-off implicitly by using a stochastic policy.

...