OOM means Observable Operator Model, yet another model of discrete-time discrete-value stochastic process. As long as I’m studying a tutorial on this topic I’m going to write a few notes to highlight those ideas that look most important for me.
At first OOM is about predictors. Predictors are a family of functions , which for every process realization prefix map each possible word into it’s probability to go after . In other words they represent all the possibles states of a process, where state is an entity that determines future of the process in an unique way.
Let’s consider the most stupid process ever: constant process. For it all the predictors coincide. The same situation is observed for “coin-tossing” process: future is always the same. However it differs from the previous case because predictor is at least non-degenerate.
The easiest dependency between the future and the past happens in a Markov chain. For it we will have as many different predictors as many states there are in it. However, it’s still a finite number. Even if we consider all the Markov processes (allowing in this way dependency on earlier-than-previous symbols), the number of different predictors will be easily bounded.
In order to see more complicated set of predictors we should try something more powerful, for example Hidden Markov Models (HMM). It’s easy to obtain a beautiful set that contains all the predictors of the process described by HMM. Let’s consider predictors corresponding to each state of hidden chain. We claim, that all the predictors lie in the subspace spanned over these functions. Then reason is that the past of the process up to current moment gives posterior probability for each of the states, and thus all the predictors can be expressed as a mixtures of the state predictors.
And here comes OOM. The main idea is: let’s considered all the processes, for which all the predictors lie in -dimensional space. That means that the future of the process can be encoded by real numbers. That’s a nice property but there is one more even nicer. For each alphabet character let’s consider an operator that works in a way . It turns out that this guy is linear!
It’s quite a common for a log-likelihood function to be concave. This great property guarantees uniqueness of local maximal, that serves as a theoretical base for numerous iterative algorithms. One of the way to learn of topic model is maximizing log-likelihood: $latex a + b$
My current research interest is topic modelling, and I’m trying to investigate concepts and proofs related to it. This evening I’ve managed to understand what is EM algorithm under the hood.
Disclaimer: the following text highly intersects with Wikipedia article about EM algorithm.
The EM algorithms emerges in following context. Let’s consider a statistical model that comprises unknown parameter , unobservable data and observable data , i.e. we are given a joint distribution . Assume we have to find maximum-likelihood estimate for given , i.e. we are to maximize . However, a formula for obtained by marginalizing joint distribution can be unsuitable to apply usual optimization methods. In pLSA is a set of all terms in all documents, while is a set of their topics and consists of topic-term matrix and document-topic matrix .
Fortunately, there’s a beautiful iterative approach to solve the problem. Since the word “iterative” was included in the name, one could guess that we need a rule to get better from worse , moreover this rule should be computationally feasible. And here a crazy idea comes: let’s find that maximizes expectation of likelihood under assumption that distribution of is defined by and : . I’d like to comment at this point: given and , what is the best conjecture we can make about distribution ? Nothing smarter that replace with is possible here. And if we actually believe that the conjecture was right, what is our next step? Right, maximize expected log-likelihood .
Surprisingly, it works. Let’s prove that the log-likelihood increases at each iteration. It can be rewritten in such a way:
We replace both sides by their expectation under the above mentioned assumption about distribution:
A step of EM-algorithm increases the minuend (since we maximize it) and decreases the subtrahend (since it’s minus Kullback-Leibler divergence). Therefore, likelihood increases at each step.
Finally, let’s apply at in the context of pLSA. Given one can get this probability for word in document to have topic :
Hence, an optimization target at each step will look like:
where is number of occurrences of word in document .
This optimization problem (including usual probability distribution constraints on , ) turns out to be very easy. Happy end
I admire statistics for both beauty and necessity for various practical purposes. Right now I’m taking a course of applied statistics in machine learning and it’s wonderful, but not enough. I want to understand it deeper.
My “read-the-proof” todo list follows:
- proof of moment method convergence
- proof of convergence for max-likelihood estimates
- proof of bootstrap estimates convergences
- theoretical basement of EM algorithm
To generalize, it looks like that: they have pretty and easy methods in statistics but firstly it’s not clear why they work, and secondly what are all these various methods needed for.
In order to make some clean up in my knowledge of mathematical programming, I’ve decided to reconsider Lagrange Multipliers Rule. As it always happens with calculus-related things, it required from me good understanding of prerequisites, and in this case it appeared, that it relies on Implicit Function Theorem.
In traditional courses of Calculus in soviet books it’s proved by induction on number of free variables. I strongly dislike such an approach, cause it doesn’t give any evidence why it’s valid. I’d like to share with you a wonderful explanation why it works: http://www.mth.uct.ac.za/~jratzkin/class-notes/implicit.pdf.
Up: either the proof is a little bit wrong or I missed something in it: what is the domain of contracting function?
- some static analyzer of pythonic code
- svn book
- tex basics (as D. Knuth invented it)
- learn to use gdb with template classes and other complex stuff