My current research interest is topic modelling, and I’m trying to investigate concepts and proofs related to it. This evening I’ve managed to understand what is EM algorithm under the hood.
Disclaimer: the following text highly intersects with Wikipedia article about EM algorithm.
The EM algorithms emerges in following context. Let’s consider a statistical model that comprises unknown parameter , unobservable data
and observable data
, i.e. we are given a joint distribution
. Assume we have to find maximum-likelihood estimate for
given
, i.e. we are to maximize
. However, a formula for
obtained by marginalizing joint distribution can be unsuitable to apply usual optimization methods. In pLSA
is a set of all terms in all documents, while
is a set of their topics and
consists of topic-term matrix
and document-topic matrix
.
Fortunately, there’s a beautiful iterative approach to solve the problem. Since the word “iterative” was included in the name, one could guess that we need a rule to get better from worse
, moreover this rule should be computationally feasible. And here a crazy idea comes: let’s find
that maximizes expectation of likelihood
under assumption that distribution of
is defined by
and
:
. I’d like to comment at this point: given
and
, what is the best conjecture we can make about distribution
? Nothing smarter that replace
with
is possible here. And if we actually believe that the conjecture was right, what is our next step? Right, maximize expected log-likelihood
.
Surprisingly, it works. Let’s prove that the log-likelihood increases at each iteration. It can be rewritten in such a way:
We replace both sides by their expectation under the above mentioned assumption about distribution:
A step of EM-algorithm increases the minuend (since we maximize it) and decreases the subtrahend (since it’s minus Kullback-Leibler divergence). Therefore, likelihood increases at each step.
Finally, let’s apply at in the context of pLSA. Given one can get this probability for word
in document
to have topic
:
Hence, an optimization target at each step will look like:
,
where is number of occurrences of word
in document
.
This optimization problem (including usual probability distribution constraints on ,
) turns out to be very easy. Happy end
I admire statistics for both beauty and necessity for various practical purposes. Right now I’m taking a course of applied statistics in machine learning and it’s wonderful, but not enough. I want to understand it deeper.
My “read-the-proof” todo list follows:
- proof of moment method convergence
- proof of convergence for max-likelihood estimates
- proof of bootstrap estimates convergences
- theoretical basement of EM algorithm
To generalize, it looks like that: they have pretty and easy methods in statistics but firstly it’s not clear why they work, and secondly what are all these various methods needed for.
In order to make some clean up in my knowledge of mathematical programming, I’ve decided to reconsider Lagrange Multipliers Rule. As it always happens with calculus-related things, it required from me good understanding of prerequisites, and in this case it appeared, that it relies on Implicit Function Theorem.
In traditional courses of Calculus in soviet books it’s proved by induction on number of free variables. I strongly dislike such an approach, cause it doesn’t give any evidence why it’s valid. I’d like to share with you a wonderful explanation why it works: http://www.mth.uct.ac.za/~jratzkin/class-notes/implicit.pdf.
Up: either the proof is a little bit wrong or I missed something in it: what is the domain of contracting function?
- mosh
- some static analyzer of pythonic code
- ssh-agent
- git
- valgrind
- svn book
- tex basics (as D. Knuth invented it)
- learn to use gdb with template classes and other complex stuff
- p-adic numbers
- fundamental group
- pcp theorem
- mathematical statistic tests
This concept is one of the first introduced in linear algebra university course. However, as it happens with most knowledge you were not ready to understand, just few facts left in my head.
The reason why I had to return to it was use of eigenvalues and eigenvectors in graph clustering. Their existence was guaranteed by well-known theorem, that eigenvectors of symmetric matrices are orthonormal basis of vector space.
But what is orthonormal? We need scalar product to introduce such concept. In turn, we need basis to define scalar product. It’s not a problem if we stay at matrix level of understanding vector spaces, cause when we have a matrix it means, that we’ve already chosen some basis. But what if we try to make one step up to abstract linear spaces and linear operators defined on them?
Symmetric matrix corresponds to self-adjoint operator :
As we see, we need scalar product (and thus a basis) to introduce self-adjointness. So my question is: is self-adjointness property of linear operator or linear operator with respect to some basis of vector space?
The answer is easy. Let’s fix basis . Operator is self-adjoint in basis
if and only if it is distension in a set of directions, that can be obtained from
with isometry (orthogonal transformations). Thus, any operator with system of eigenvectors
unequivalent to
(there’s no isometry to map
to
) is not self-adjoint with respect to
but is self-adjoint with respect to
!
So self-adjointness is property of operator with respect to basis.
Well, having done with one question we meet another one: what about orthogonal operators? Is this property basis independent or not?
The answer is negative again. Unfortunately, I don’t know beautiful explanation. But I’ll show a counterexample. If discussed property doesn’t depend on the basis, all the matrices of form should be orthogonal is
is orthogonal. But it doesn’t hold for:
There I will tell about concepts and problems of mathematics and IT that excite me. Make comments, ask questions – hope that you like interesting discussions as much as I do.
