Computer scienceData scienceMachine learningProbabilistic models

Markov models

22 minutes read

Markov models represent an important concept in the domain of pattern recognition and probabilistic modeling. These are statistical models that estimate the probability of transitioning from one state to another within a system. They are particularly well-suited for scenarios where the future state of a system depends only on its current state. This simple yet effective stochastic process has several applications involving sequences of events, such as speech recognition, handwriting recognition, weather forecasting, and stock price prediction.

In this topic, we will discuss the core principles and various types of Markov models, such as Markov Chains and Hidden Markov models.

Elements of Markov models

Before we dive into an in-depth discussion of Markov models, let's first define the elements of these models. For better intuition, imagine predicting tomorrow's weather using past data like sunny, rainy, or cloudy days. In this scenario, the key components of Markov models are:

	Element	Definition	Example
1	States	It is the distinct conditions or situations in a system.	sunny(S), rainy(R), cloudy(C)
2	Transition probability	It is defined as the probability of moving from one state to another.	P(S → S) = 0.70 (probability of transitioning from a sunny day to another sunny day) P(S → R) = 0.10 (probability of transitioning from a sunny day to a rainy day) P(S → C) = 0.20 (probability of transitioning from a sunny day to a cloudy day)
3	Transition diagram	It is a graphical representation showing different states and transitions.	See Fig.: 1
4	Emission probability	It is the probability of observing a specific outcome given a certain state.	Let the current state be rainy (R). Therefore, a rainy day can lead to people either walking with (U) or without an umbrella (N). So emission probabilities are: P(U \| R) = 0.80 P(N \| R) = 0.20

Note:

These transition probabilities are based on the analysis of historical weather patterns, such as the weather trends over the past few weeks, while taking into consideration the current season or other weather-related factors. These probabilities from a given state sum up to 1, reflecting all possible outcomes for the next day's weather.
As for the emission probabilities, they're not derived from a mathematical computation in this case. Instead, they've been intuitively assigned for the purpose of illustration.

Functional principle of Markov models

At the heart of Markov models lies the concept of memorylessness, which posits that the future state of a process depends solely on its current state, disregarding the sequence of events that preceded it. We refer to this theory as the Markov Property. This simplification allows Markov models to effectively analyze and predict patterns in complex systems, especially in situations where we have limited historical data.

To put this in layman's terms, Markov models predict what will happen next in an event based on what is happening now. They work by assigning probabilities to different transitions from one state to another within a system. These transitions are governed by a set of probabilities, each representing the likelihood of moving from one specific state to another.

Types of Markov models

Now, having discussed the fundamentals of Markov models, we can discuss its types. For the sake of simplicity, we will assume that our system is autonomous, i.e., it operates independently. In this context, there can be two scenarios:

The system state is entirely observable, as in the case with coin tosses or board games, where the outcome depends only on the current toss or current bet. These are modeled using Markov Chains.
The system state is partially observable, such as in the stock market, where stock prices are affected by discernible economic factors. These are modeled using Hidden Markov Models.

Markov Chains

A Markov Chain is a simple Markov model that refers to a stochastic process with a discrete set of states and time steps. It is used to represent a system that transitions from one state to another in a sequential manner. It's important to note that although the terms Markov chain and Markov model are often used interchangeably, the Markov chain is more specific, referring to a simple, discrete-time stochastic process.

In more formal terms, let $X$ be a stochastic process, a Markov chain, having the Markov property, can be expressed as:

$P(X_{t+1} = x | X_t = x_t, X_{t-1} = x_{t-1}, ..., X_0 = x_0) = P(X_{t+1} = x | X_t = x_t) \quad \quad \quad\quad\quad \text{[eq. 1]}$

This means that the probability of the process being in a particular state $x$ at time $t+1$ depends only on the state at time $t$ .

State transition diagram

Now, a Markov chain can be described by one large transition matrix consisting of all transition probabilities, where each element indicates the likelihood of transitioning from one state to another. Let's extend the above example and create a state transition diagram and transition matrix. Since the states of the system in a Markov chain are directly observable and the transition probabilities between states are known (based on past weather patterns), we initialize our transition matrix as follows:

\begin{pmatrix} P(S \rightarrow S) & P(S \rightarrow C) & P(S \rightarrow R) \\ P(C \rightarrow S) & P(C \rightarrow C) & P(C \rightarrow R) \\ P(R \rightarrow S) & P(R \rightarrow C) & P(R \rightarrow R) \end{pmatrix} = \begin{pmatrix} 0.7 & 0.2 & 0.1 \\ 0.3 & 0.5 & 0.2 \\ 0.25 & 0.35 & 0.4 \end{pmatrix}

Extended state transition diagram

Hidden Markov Models

A Hidden Markov Model (HMM) is an extension of the basic Markov model that deals with hidden (or unobservable) states — hence the name, Hidden Markov Model. In many practical situations, the states cannot be directly observed. Instead, what we observe are certain outcomes that provide indirect information about these hidden states. HMMs are designed to handle such situations. Consider a scenario where we are not allowed to see the weather but instead, but we're informed about whether or not people on the street are carrying umbrellas. These actions indirectly tell us about the hidden weather states.

So, how do we predict the weather then? To do this, we construct a simple HMM for the weather. In order to formally define and formulate an HMM, we must identify the key components of an HMM. They are:

1. Hidden States: These are the actual states of the model which we cannot observe directly. A set of $N$ states (≡ weather states):

Q = q_1 q_2...q_N

2. Observations: These are the visible outputs or data influenced by the hidden states. A sequence of $T$ observations (≡ umbrella or no umbrella):

O = o_1 o_2...o_T

3. Transition Probabilities: These are the probabilities of transitioning from one hidden state to another, similar to a standard Markov model. Here, each $a_{ij}$ represents the probability of transitioning from state $i$ to state $j$ .

A = a_{11}...a_{ij}...a_{NN} \\ \text{where} \quad a_{ij} = P(X_{t+1} = q_j | X_t = q_i)

4. Initial Probabilities: These are the probabilities of the system being in each possible state at the start of the process, also known as prior probabilities.

\pi = \pi_1, \pi_2, \dots \pi_N,

5. Emission Probabilities: These are the probabilities that a particular observation will be generated from a hidden state.

B = b_i(o_t) = P(Y_t = o | X_t=q_i)

This represents the probability of observation $o_t$ being generated from state $q_i$ .

Now, with these properties defined, we can formally introduce an HMM as a 5-element tuple:

\lambda = (Q, O, A, B, \pi)

In an HMM, the model learns the hidden states and their transition and emission probabilities based on the observed data. The learning involves tackling three problems:

Likelihood (Problem 1): Calculate the probability of observing a sequence given the model parameters ( $A$ and $B$ ). This is solved using the Forward algorithm, which computes the probability of observing a sequence given the model.
Decoding (Problem 2): Determine the best hidden state sequence based on the observed data and model parameters. This is solved using the Viterbi algorithm—a method for finding the most probable sequence of hidden states given the observed data and the model.
Learning (Problem 3): Estimate the model parameters (A and B) from the observed data and the set of states. This is solved using the Expectation-Maximization (EM) algorithm, iteratively updating transition and emission probabilities to maximize the likelihood of the observed sequence.

HMM assumptions

We make two fundamental assumptions in an HMM because they facilitate tractable computations and efficient modeling enabling a simpler representation of the system. They are:

1. Markov Assumption (First-order Markov Chain): The probability of a particular state $q_i$ depends only on its immediate preceding state $q_{i-1}$ . This is written as:

P(q_i | q_1, q_2, ..., q_{i-1}) = P(q_i | q_{i-1})

This assumption implies that the future state depends only on the current state, and not on the sequence of states that led to it.

2. Output Independence: The probability of an output observation $o_i$ depends only on the state $q_i$ that produced this observation, and is independent of all other states or observations. This can be written as:

P(o_i | q_1, ..., q_i, ..., q_T, o_1, ..., o_i, ..., o_T) = P(o_i | q_i)

This means the observed output at any point is influenced only by the current state.

Let us extend our earlier example of weather with three states: Sunny (S), Cloudy (C), and Rainy (R). Assume that we can't directly observe the weather but can observe whether people are carrying umbrellas.

Hidden States: Sunny (S), Cloudy (C), Rainy (R)
Observable States: Umbrella (U), No Umbrella (N)
Initial State Probabilities: These represent the likelihood of starting from each hidden state when the sequence begins. Since the occurrences of S, C, and R are equally likely, they share an equal probability.
$I = \begin{bmatrix} P(S) \\ P(C) \\ P(R) \end{bmatrix} = \begin{bmatrix} \frac{1}{3} \\ \frac{1}{3} \\ \frac{1}{3} \end{bmatrix}$
Transition Probabilities:
$A = \begin{bmatrix} P(S|S) & P(C|S) & P(R|S) \\ P(S|C) & P(C|C) & P(R|C) \\ P(S|R) & P(C|R) & P(R|R) \end{bmatrix} = \begin{bmatrix} 0.7 & 0.2 & 0.1 \\ 0.3 & 0.5 & 0.2 \\ 0.25 & 0.35 & 0.4 \end{bmatrix}$
Emission Probability:
$E = \begin{bmatrix} P(U | S) & P(N | S) \\ P(U | C) & P(N | C) \\ P(U | R) & P(N | R) \end{bmatrix} = \begin{bmatrix} 0.1 & 0.9 \\ 0.5 & 0.5 \\ 0.8 & 0.2 \end{bmatrix}$

Now, a possible problem statement for this situation might be: if, over a three-day period, we observe the pattern "Umbrella, No Umbrella, Umbrella" (U, N, U), then what is the most likely sequence of weather conditions? There could be various possibilities such as SSS (sunny-sunny-sunny), SSC, SSR, SCS, and so on. Let's consider one case for illustration — SCR. Meaning, we want to find the likelihood of the weather sequence being Sunny, Cloudy, Rainy (SCR) during these three days. To do this, we utilize initial state probabilities, transition probabilities, and emission probabilities.

P(Y=\text{UNU}, X=\text{SCR}) = P(U | S) \cdot (P(N | C) \cdot P(U | R) \cdot P(S) \cdot P(C|S) \cdot P(R | C)

= 0.1 \cdot 0.5 \cdot 0.8 \cdot \frac{1}{3} \cdot 0.2 \cdot 0.2

P(Y=\text{UNU}, X=\text{SCR}) = 0.0005333

By evaluating the probability of each combination, the weather sequence with the highest probability will be considered the most likely scenario given the observed events of umbrella usage (Umbrella, No Umbrella, Umbrella). Now, let's see this mathematically.

We are interested in finding a weather sequence $X$ that maximizes the probability of observing a sequence $Y$ (i.e., U, N, U) given the hidden states (S, C, R). That is to say, we want to find:

\arg \max_{x_1,...,x_T} P(x_1,...,x_T | y_1,...,y_T)

Now, Bayes' theorem comes to our rescue, and using the independence assumptions we made above, we deduce the following expression:

\arg \max_{x_1,...,x_T} P(x_1,...,x_T | y_1,...,y_T) = \arg \max_{x_1,...,x_T} \frac{P(Y|X) \times P(X)}{P(Y)} \\ = \arg \max_{x_1,...,x_T} \prod P(Y_i|X_i) \times P(X_i|X_{i-1})

Arg max will give us that particular sequence of $X$ for which the above expression is maximum. However, this naive approach is too expensive due to the exponential growth of possible hidden state sequences with the length of the observation sequence. We therefore use the Viterbi algorithm, a dynamic programming approach, to find the most likely sequence of hidden states given the observations.

Conclusion

As a result, we can summarize everything as follows:

Markov models, such as Markov Chains and HMMs, simplify complex systems by assuming that future states depend only on the current state, making them effective for analyzing sequences of events.
Markov models can be categorized into Markov Chains for systems with fully observable states, and Hidden Markov Models for systems with partially observable states.
HMMs learn hidden states and their transition/emission probabilities through likelihood estimation, use decoding algorithms to determine optimal hidden state sequences, and implement parameter estimation.

2 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo