My current research interest is topic modelling, and I’m trying to investigate concepts and proofs related to it. This evening I’ve managed to understand what is EM algorithm under the hood.

Disclaimer: the following text highly intersects with Wikipedia article about EM algorithm.

The EM algorithms emerges in following context. Let’s consider a statistical model that comprises unknown parameter $\Sigma$, unobservable data $Z$ and observable data $X$, i.e. we are given a joint distribution $p(X,Z|\Sigma)$. Assume we have to find maximum-likelihood estimate for $\Theta$ given $X$, i.e. we are to maximize $p(X| \Sigma)$. However, a formula for $p(X|\Sigma)$  obtained by marginalizing joint distribution can be unsuitable to apply usual optimization methods. In pLSA $X$ is a set of all terms in all documents, while $Z$ is a set of their topics and $\Sigma$ consists of topic-term matrix $\phi_{t,w}$ and document-topic matrix $\theta_{d,t}$.

Fortunately, there’s a beautiful iterative approach to solve the problem. Since the word “iterative” was included in the name, one could guess that we need a rule to get better $\Sigma^{t+1}$ from worse $\Sigma^{t}$, moreover this rule should be computationally feasible. And here a crazy idea comes: let’s find $\Sigma^{t+1}$ that maximizes expectation of likelihood $p(X,Z|\Sigma^{t+1})$ under assumption that distribution of $Z$ is defined by $X$ and $\Sigma^t$: $p(Z|X, \Sigma^{t})$. I’d like to comment at this point: given $X$ and $\Theta^{t}$, what is the best conjecture we can make about distribution $p(Z|X,\Theta)$? Nothing smarter that replace $\Theta$ with $\Theta_{t}$ is possible here. And if we actually believe that the conjecture was right, what is our next step? Right, maximize expected log-likelihood $E_{Z|X, \Sigma^t}{\left\{\log p(X,Z|\Sigma^{t+1})\right\}}$.

Surprisingly, it works. Let’s prove that the log-likelihood increases at each iteration. It can be rewritten in such a way:

$\log p(X|\Sigma) = \log p(X,Z|\Sigma) - \log p(Z|X,\Sigma)$

We replace both sides by their expectation under the above mentioned assumption about $Z$ distribution:

$P(X|\Sigma) = E_{Z|X, \Sigma^t}{\left\{\log p(X,Z|\Sigma)\right\}} - E_{Z|X, \Sigma^t}{\left\{\log p(Z|X\Sigma)\right\}}$

A step of EM-algorithm increases the minuend (since we maximize it) and decreases the subtrahend (since it’s minus Kullback-Leibler divergence). Therefore, likelihood increases at each step.

Finally, let’s apply at in the context of pLSA. Given $\Sigma^t=(\theta^t, \phi^t)$ one can get this probability for word $w$ in document $d$ to have topic $z$:

$p(z|w,d,\Sigma^t) = \frac{\theta^t_{d, z} \phi^t_{z,w}}{\sum\limits_{s} \theta^t_{d, s} \phi^t_{s,w}}$

Hence, an optimization target at each step will look like:

$E\log p(X,Z|\Sigma^{t+1}) = \sum\limits_{dw} n_{dw}\sum\limits_{z} \frac{\theta^t_{dz} \phi^t_{zw}}{\sum\limits_{s} \theta^t_{ds} \phi^t_{sw}}\log(\theta^{t+1}_{dz} \phi^{t+1}_{zw})$,

where $n_{dw}$ is number of occurrences of word $w$ in document $d$.

This optimization problem (including usual probability distribution constraints on $\theta$, $\phi$) turns out to be very easy. Happy end

I admire statistics for both beauty and necessity for various practical purposes. Right now I’m taking a course of applied statistics in machine learning and it’s wonderful, but not enough. I want to understand it deeper.

My “read-the-proof” todo list follows:

- proof of moment method convergence

- proof of convergence for max-likelihood estimates

- proof of bootstrap estimates convergences

- theoretical basement of EM algorithm

To generalize, it looks like that: they have pretty and easy methods in statistics but firstly it’s not clear why they work, and secondly what are all these various methods needed for.

In order to make some clean up in my knowledge of mathematical programming, I’ve decided to reconsider Lagrange Multipliers Rule. As it always happens with calculus-related things, it required from me good understanding of prerequisites, and in this case it appeared, that it relies on Implicit Function Theorem.

In traditional courses of Calculus in soviet books it’s proved by induction on number of free variables. I strongly dislike such an approach, cause it doesn’t give any evidence why it’s valid. I’d like to share with you a wonderful explanation why it works: http://www.mth.uct.ac.za/~jratzkin/class-notes/implicit.pdf.

Up: either the proof is a little bit wrong or I missed something in it: what is the domain of contracting function?

• mosh
• some static analyzer of pythonic code
• ssh-agent
• git
• valgrind
• svn book
• tex basics (as D. Knuth invented it)
• learn to use gdb with template classes and other complex stuff
• fundamental group
• pcp theorem
• mathematical statistic tests

This concept is one of the first introduced in linear algebra university course. However, as it happens with most knowledge you were not ready to understand, just few facts left in my head.

The reason why I had to return to it was use of eigenvalues and eigenvectors in graph clustering. Their existence was guaranteed by well-known theorem, that eigenvectors of symmetric matrices are orthonormal basis of vector space.

But what is orthonormal? We need scalar product to introduce such concept. In turn, we need basis to define scalar product. It’s not a problem if we stay at matrix level of understanding vector spaces, cause when we have a matrix it means, that we’ve already chosen some basis. But what if we try to make one step up to abstract linear spaces and linear operators defined on them?

Symmetric matrix corresponds to self-adjoint operator $A$:
$(x, Ay) = (Ax, y)$

As we see, we need scalar product (and thus a basis) to introduce self-adjointness. So my question is: is self-adjointness property of linear operator or linear operator with respect to some basis of vector space?

The answer is easy. Let’s fix basis $E$. Operator is self-adjoint in basis $E$ if and only if it is distension in a set of directions, that can be obtained from $E$ with isometry (orthogonal transformations). Thus, any operator with system of eigenvectors $F$ unequivalent to $E$ (there’s no isometry to map $E$ to $F$) is not self-adjoint with respect to $E$ but is self-adjoint with respect to $F$!

So self-adjointness is property of operator with respect to basis.

Well, having done with one question we meet another one: what about orthogonal operators? Is this property basis independent or not?

The answer is negative again. Unfortunately, I don’t know beautiful explanation. But I’ll show a counterexample. If discussed property doesn’t depend on the basis, all the matrices of form $S^{-1}AS$ should be orthogonal is $A$ is orthogonal. But it doesn’t hold for:

$A = \left( \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array} \right)$

$S = \left( \begin{array}{cc} 0 & 1 \\ 1 & 1 \end{array} \right)$

$S^{-1}AS = \left( \begin{array}{cc} -1 & 0 \\ 1 & 1 \end{array} \right)$

There I will tell about concepts and problems of mathematics and IT that excite me. Make comments, ask questions – hope that you like interesting discussions as much as I do.