Skip to content
May 25, 2013 / like2think

EM algorithm correctness in context of pLSA.

My current research interest is topic modelling, and I’m trying to investigate concepts and proofs related to it. This evening I’ve managed to understand what is EM algorithm under the hood.

Disclaimer: the following text highly intersects with Wikipedia article about EM algorithm.

The EM algorithms emerges in following context. Let’s consider a statistical model that comprises unknown parameter \Sigma, unobservable data Z and observable data X, i.e. we are given a joint distribution p(X,Z|\Sigma). Assume we have to find maximum-likelihood estimate for \Theta given X, i.e. we are to maximize p(X| \Sigma). However, a formula for p(X|\Sigma)  obtained by marginalizing joint distribution can be unsuitable to apply usual optimization methods. In pLSA X is a set of all terms in all documents, while Z is a set of their topics and \Sigma consists of topic-term matrix \phi_{t,w} and document-topic matrix \theta_{d,t}.

Fortunately, there’s a beautiful iterative approach to solve the problem. Since the word “iterative” was included in the name, one could guess that we need a rule to get better \Sigma^{t+1} from worse \Sigma^{t}, moreover this rule should be computationally feasible. And here a crazy idea comes: let’s find \Sigma^{t+1} that maximizes expectation of likelihood p(X,Z|\Sigma^{t+1}) under assumption that distribution of Z is defined by X and \Sigma^t: p(Z|X, \Sigma^{t}). I’d like to comment at this point: given X and \Theta^{t}, what is the best conjecture we can make about distribution p(Z|X,\Theta)? Nothing smarter that replace \Theta with \Theta_{t} is possible here. And if we actually believe that the conjecture was right, what is our next step? Right, maximize expected log-likelihood E_{Z|X, \Sigma^t}{\left\{\log p(X,Z|\Sigma^{t+1})\right\}}.

Surprisingly, it works. Let’s prove that the log-likelihood increases at each iteration. It can be rewritten in such a way:

\log p(X|\Sigma) = \log p(X,Z|\Sigma) - \log p(Z|X,\Sigma)

We replace both sides by their expectation under the above mentioned assumption about Z distribution:

P(X|\Sigma) = E_{Z|X, \Sigma^t}{\left\{\log p(X,Z|\Sigma)\right\}} -    E_{Z|X, \Sigma^t}{\left\{\log p(Z|X\Sigma)\right\}}

A step of EM-algorithm increases the minuend (since we maximize it) and decreases the subtrahend (since it’s minus Kullback-Leibler divergence). Therefore, likelihood increases at each step.

Finally, let’s apply at in the context of pLSA. Given \Sigma^t=(\theta^t, \phi^t) one can get this probability for word w in document d to have topic z:

p(z|w,d,\Sigma^t) = \frac{\theta^t_{d, z} \phi^t_{z,w}}{\sum\limits_{s} \theta^t_{d, s} \phi^t_{s,w}}

Hence, an optimization target at each step will look like:

E\log p(X,Z|\Sigma^{t+1}) = \sum\limits_{dw} n_{dw}\sum\limits_{z} \frac{\theta^t_{dz} \phi^t_{zw}}{\sum\limits_{s} \theta^t_{ds} \phi^t_{sw}}\log(\theta^{t+1}_{dz} \phi^{t+1}_{zw}),

where n_{dw} is number of occurrences of word w in document d.

This optimization problem (including usual probability distribution constraints on \theta, \phi) turns out to be very easy. Happy end 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: