0 votes
in Reinforcement Learning by
What is Markov Decision Process?

1 Answer

0 votes
by

A state is Markov if and only if:

A Markov decision process (MDP) refers to a stochastic decision-making process that uses a mathematical framework to model the decision-making of a dynamic system. It is used in scenarios where the results are either random or controlled by a decision maker, which makes sequential decisions over time. MDPs evaluate which actions the decision maker should take considering the current state and environment of the system.

MDPs rely on variables such as the environment, agent’s actions, and rewards to decide the system’s next optimal action. They are classified into four types — finite, infinite, continuous, or discrete — depending on various factors such as sets of actions, available states, and the decision-making frequency.

MDPs have been around since the early part of the 1950s. The name Markov refers to the Russian mathematician Andrey Markov who played a pivotal role in shaping stochastic processes. In its initial days, MDPs were known to solve issues related to inventory management and control, queuing optimization, and routing matters. Today, MDPs find applications in studying optimization problems via dynamic programming, robotics, automatic control, economics, manufacturing, etc.

In artificial intelligence, MDPs model sequential decision-making scenarios with probabilistic dynamics. They are used to design intelligent machines or agents that need to function longer in an environment where actions can yield uncertain results. 

MDP models are typically popular in two sub-areas of AI: probabilistic planning and reinforcement learning (RL). 

Probabilistic planning is the discipline that uses known models to accomplish an agent’s goals and objectives. While doing so, it emphasizes guiding machines or agents to make decisions while enabling them to learn how to behave to achieve their goals. 

Reinforcement learning allows applications to learn from the feedback the agents receive from the environment.

Let’s understand this through a real-life example:

Consider a hungry antelope in a wildlife sanctuary looking for food in its environment. It stumbles upon a place with a mushroom on the right and a cauliflower on the left. If the antelope eats the mushroom, it receives water as a reward. However, if it opts for the cauliflower, the nearby lion’s cage opens and sets the lion free in the sanctuary. With time, the antelope learns to choose the side of the mushroom, as this choice offers a valuable reward in return. 

In the above MDP example, two important elements exist — agent and environment. The agent here is the antelope, which acts as a decision-maker. The environment reveals the surrounding (wildlife sanctuary) in which the antelope resides. As the agent performs different actions, different situations emerge. These situations are labeled as states. For example, when the antelope performs an action of eating the mushroom, it receives the reward (water) in correspondence with the action and transitions to another state. The agent (antelope) repeats the process over a period and learns the optimal action at each state.

In the context of MDP, we can formalize that the antelope knows the optimal action to perform (eat the mushroom). Therefore, it does not prefer eating the cauliflower as it generates a reward that can harm its survival. The example illustrates that MDP is essential in capturing the dynamics of RL problems.

This equation means that the current state of the agent only depends on the previous state and not on any state prior to that.
This Markov Decision Process is used in Reinforcement Learning.
...