# EM algorithm correctness in context of pLSA.

My current research interest is topic modelling, and I’m trying to investigate concepts and proofs related to it. This evening I’ve managed to understand what is EM algorithm under the hood.

Disclaimer: the following text highly intersects with Wikipedia article about EM algorithm.

The EM algorithms emerges in following context. Let’s consider a statistical model that comprises unknown parameter , unobservable data and observable data , i.e. we are given a joint distribution . Assume we have to find maximum-likelihood estimate for given , i.e. we are to maximize . However, a formula for obtained by marginalizing joint distribution can be unsuitable to apply usual optimization methods. In pLSA is a set of all terms in all documents, while is a set of their topics and consists of topic-term matrix and document-topic matrix .

Fortunately, there’s a beautiful iterative approach to solve the problem. Since the word “iterative” was included in the name, one could guess that we need a rule to get better from worse , moreover this rule should be computationally feasible. And here a crazy idea comes: let’s find that maximizes expectation of likelihood under assumption that distribution of is defined by and : . I’d like to comment at this point: given and , what is *the best conj**ecture we can make *about distribution ? Nothing smarter that replace with is possible here. And *if we actually believe that the conjecture was right*, what is our next step? Right, maximize expected log-likelihood .

Surprisingly, it works. Let’s prove that the log-likelihood increases at each iteration. It can be rewritten in such a way:

We replace both sides by their expectation under the above mentioned assumption about distribution:

A step of EM-algorithm increases the minuend (since we maximize it) and decreases the subtrahend (since it’s minus Kullback-Leibler divergence). Therefore, likelihood increases at each step.

Finally, let’s apply at in the context of pLSA. Given one can get this probability for word in document to have topic :

Hence, an optimization target at each step will look like:

,

where is number of occurrences of word in document .

This optimization problem (including usual probability distribution constraints on , ) turns out to be very easy. Happy end 🙂

## Leave a Reply