Computer scienceData scienceMachine learningIntroduction to machine learning

Expectation-maximization algorithm

9 minutes read

Real-world data science challenges diverge significantly from the scenarios encountered in training, Kaggle competitions, or online hackathons. We often tend to think that the process involved in data science is very simple: acquiring the initial data, performing routine cleaning procedures, engineering features, selecting models from libraries like scikit-learn or Keras, and commencing training.

However, the reality is far from this straightforward approach. Frequently, datasets contain missing or unobservable features, and addressing this issue plays a crucial role since identifying and inferring these missing features can significantly enhance the quality of the predictions.

This is where the Expectation-Maximization (EM) algorithm comes into play. The Expectation-Maximization (EM) algorithm is an iterative statistical method used for estimating parameters in probabilistic models, particularly when dealing with incomplete or hidden data.

How does the EM algorithm work

The EM algorithm works in 4 steps:

Initialization: We start by initializing the parameters of the model. These parameters describe the underlying distributions or characteristics of the data. In cases where you have biased observations, such as biased coins, one of the parameters you might need to initialize is the bias. The bias represents the probability of a specific outcome, like getting heads when flipping a coin.
Expectation (E-step): In the E-step, we calculate the expected values or probabilities of hidden or unobservable variables given the current parameter estimates. If your model involves biased observations, we will use the bias parameter to estimate the likelihood of observing specific outcomes, taking into account the known biases.
Maximization (M-step): In the M-step, we update the model parameters to maximize the expected complete data likelihood calculated in the E-step. For models with biased observations, this step may include adjusting the bias parameter to better explain the observed data. The aim is to find the bias that best fits the data, considering the known biases in the observations.
Convergence: The algorithm iteratively repeats the E-step and M-step until a convergence criterion is met. During each iteration, the bias parameter, along with other model parameters, is optimized to better fit the data. The algorithm continues until the parameters stabilize, indicating convergence.

Here is a flowchart depicting the workings of this algorithm visually:

The EM algorithm flowchart

Understanding the EM algorithm through an example

The example in this section is adapted from https://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf

Let's explore an example using two biased coins to understand how the EM algorithm works. When we say biased, we mean the probability of getting heads on a coin toss for the first coin is $\theta_A$ instead of 0.5, and for the second biased coin, it is $\theta_B$ . Here are the results of several coin flips for these two biased coins:

										Coin A	Coin B
H	H	T	T	H	H	T	H	T	H		6H, 4T
H	H	T	H	T	H	H	H	H	H	8H, 2T
H	T	H	H	T	H	T	T	H	H	6H, 4T
H	T	T	T	T	T	T	H	T	T		2H, 8T
T	H	T	H	T	H	T	H	T	H	5H, 5T
										19H, 11T	8H, 12T

From this we can calculate the probability of Heads for Coin A is:

$\theta_A = 19/30=0.63$

and for Coin B it is:

$\theta_B=8/20=0.4$

But what if we had the following results for the coin flips instead?

H	H	T	T	H	H	T	H	T	H
H	H	T	H	T	H	H	H	H	H
H	T	H	H	T	H	T	T	H	H
H	T	T	T	T	T	T	H	T	T
T	H	T	H	T	H	T	H	T	H

Here, we don't know which row is resulted from Coin A or Coin B. So how do we calculate $\theta_A$ and $\theta_B$ then? As per the four steps we stated earlier, we will start with the first step – initialization. Let us assusme that $\theta_A = 0.6$ and $\theta_B = 0.5$ in the above case.

Now let us go over to the second step, i.e. expectation. For the first experiment, we have 6 Heads and 4 Tails. Hence the probability of such results, if the 1st experiment belonged to Coin A, is:

$(0.6)^6 \times (0.4)^4 = 0.00119 \: [P(\text{Head})=0.6, P(\text{Tails})=0.4]$

and if it belongs to Coin B, the Probability would be:

$(0.5)^6 \times (0.5)^4 = 0.00097 \: [P(\text{Heads})=0.5; \: P(\text{Tails})=0.5]$

Next, we normalize these 2 probabilities using the following formula: $P(\text{Coin A used for1st experiment}) = \frac{P(\text{1st experiment belongs to Coin A})}{P(\text{1st experiment belongs to Coin A}) + P(\text{1st experiment belongs to Coin B})}$

We get the following probabilities:

$P(\text{1st coin used for 1st experiment})=0.55$

$P(\text{2nd coin used for 1st experiment})=0.45$

Similarly, for the 2nd experiment, we have 8 Heads & 2 Tails, and so on. Once we calculate the probabilities for all the 5 steps given above, we get that:

For Coin A, we have 15.1 H and 9.7 T
For Coin B, 11.9 H and 13.3 T

Now, let us consider the 3rd step; maximization. In this step, what we want to do is to converge to the correct values of $\theta_A$ and $\theta_B$ . Since the bias represented the probability of a Head, we will calculate the revised bias:

$\theta_A = \frac{{\text{Heads due to Coin A}}}{{\text{All Heads observed}}} = \frac{{15.1}}{{15.1 + 11.9}} = 0.56$ Similarly, for Coin B, $\theta_B = 0.44$ . Now we will again switch back to the Expectation step using the revised biases. After 10 such iterations, we will get $\theta_A = 0.65$ and $\theta_B = 0.41$ .

As you can observe, the calculated values for the probabilities after normalization are remarkably close to the actual values we knew from the experiment where we had prior knowledge of the coin identities (i.e., $\theta_A = 0.63$ and $\theta_B = 0.4$ ).

Advantages and disadvantages of the method

Expectation-Maximization (EM) algorithm comes with its own set of strengths and limitations. Let's explore how EM can be useful while also being mindful of its potential challenges.

Advantages:

Handles incomplete data: EM is particularly effective when dealing with incomplete or missing data. It allows the algorithm to still learn patterns and make estimations even when parts of the data are unavailable.
Unsupervised learning: EM is useful for unsupervised learning tasks, such as clustering and density estimation. It identifies hidden structures and relationships in data without relying on labeled examples.
Flexibility with distributions: EM is flexible in accommodating various probability distributions for modeling the data. This makes it suitable for a wide range of applications, as it can handle both continuous and discrete data.
Probabilistic framework: The algorithm provides probabilistic estimates, giving a measure of uncertainty in its predictions. This is useful for understanding the confidence levels of the inferred hidden variables.
Global and local optima: While EM might converge to local optima, it can be less sensitive to initialization than some other optimization methods. It can also find different local optima, offering insights into various solutions.

Disadvantages:

Convergence to local optima: EM's iterative nature makes it susceptible to getting stuck in local optima, where the algorithm finds solutions that are not the best globally.
Sensitivity to initialization: Although EM is less sensitive to initialization than some other methods, the choice of initial parameters can still impact convergence and results.
Slow convergence: EM might converge slowly, especially when dealing with high-dimensional data. The algorithm requires multiple iterations, which can be time-consuming for large datasets.
Complexity of implementation: Implementing EM correctly can be complex, especially for certain applications. Debugging and fine-tuning the algorithm can require a deep understanding of its inner workings.
Assumption of distribution: EM assumes a certain parametric form for the data distribution. If the data doesn't fit this assumption well, the algorithm's performance might be compromised.
Model selection: Selecting an appropriate model or number of clusters/components can be challenging. The algorithm's effectiveness relies on making informed decisions about the underlying structure of the data.
Scalability: EM might not scale well to very large datasets, as each iteration involves computations for all data points.

Applications

EM algorithm has numerous real-life applications and use cases, here are a few of them:

The EM Algorithm is often used in data clustering and computer vision, where it is employed to iteratively estimate parameters for image segmentation. It probabilistically assigns pixels to foreground and background regions based on a statistical model.
It is also used in natural language processing, where it is applied for tasks like topic modeling. Here, it iteratively estimates the distribution of topics in a document corpus while assigning words to topics, thus uncovering latent structures in the text data.
The EM algorithm is used for parameter estimation in mixed models and quantitative genetics
It is used in psychometrics for estimating item parameters and latent abilities of item response theory models
Some other applications include medical image reconstruction, structural engineering, etc.

Conclusion

What did we learn from this topic:

The Expectation-Maximization (EM) algorithm is an iterative statistical method used for estimating parameters in probabilistic models;
EM algorithm works in 4 steps: initialization, expectation, maximization, convergence;
Expectation-Maximization algorithm is a useful tool for handling hidden variables and incomplete data.

How did you like the theory?

Report a typo