Being a recent expat from a country with poor average command of English language I find it challenging to read novels with lots of fancy words in original. I’ve started compiling a list of internet resources that can be of help in such situation:

1. The English-to-English dictionaries of renowned English universities (Oxford, Cambridge) are surely the most authoritative source of word explanations. All the articles there contains example sentences, for polysemantic words one will find several meanings listed, phrasal verbs are included. Additionally one escapes paying the price of translating into other language and enriches his vocabulary working with them. I would rate them as the number one tool in my toolbox, but even such a tool is not a holy grail. Sometimes it is almost impossible to make it without any translation to your native language and one example is often not enough. Finally it is not comprehensive since languages always evolve faster than official dictionaries.
2. Google Translate – I think everybody knows this one. The big G offers you statistics based machine translation. This implies pros and cons: it’s versatile, sometimes capable to translate idioms, copes with inflected words. But reliability of automatic methods is limited, and what’s more important it doesn’t give you contexts for translation variants and any grammatical information. I never trust it fully and use it only as auxiliary tool: it can give you a valuable cue what to do next and also is a good spell checker 🙂
3. vocabulary.com looks like a perspective mixture of hand-crafted and automated approaches. It greets your with a verbose explanation of the word (I wonder how they had them all written… ) and offers a range of synonyms and examples. I can’t just use it and forget about all others primarily because I don’t know how much one can trust it. There is of course a lot of other tools like the vocabulary.com, but so far this one seems to be fairly good to forget about others.
4. Good hand-crafted bilingual dictionary can help you sometimes when the forementioned approaches fail. For me, the Russian language speaker, such a dictionary is Lingvo.
5. Urban dictionary is irreplaceable if you have to deal with slang and/or obscene words.
6. Translation through wikipedia works like this: you find an article of your interest in English, let it be http://en.wikipedia.org/wiki/Curvature, and search for your native language in the panel which  lists corresponding articles. For example this link informs you that curvature in Russian is кривизна. This method is especially useful for scientific terms.

Usually I go in top-bottom direction through such list when I need to understand what some involved word like “stunt” or “mockery”  means, skipping those options that obviously don’t suit to the particular case I work with.

The first word of the day is exhaustive, that is including all elements or aspects (thanks Oxford dictionary…). The exhaustive list of parameters includes all possible parameters. Don’t confuse it with exhausted which means “extremely tired”.

P.S. I will try to extend my vocabulary by posting one fancy English word a day. Try to not get bored 🙂

Want to avoid unrecoverable brain damage caused by long search of that buggy modificatios? Vim user? Pay attention for https://github.com/airblade/vim-gitgutter.

OOM means Observable Operator Model, yet another model of discrete-time discrete-value stochastic process. As long as I’m studying a tutorial on this topic I’m going to write a few notes to highlight those ideas that look most important for me.

At first OOM is about predictors. Predictors are a family of functions $g_b$, which for every process realization prefix $b$ map each possible word $a$ into it’s probability to go after $b$. In other words they represent all the possibles states of a process, where state is an entity that determines future of the process in an unique way.

Let’s consider the most stupid process ever: constant process. For it all the predictors coincide. The same situation is observed for “coin-tossing” process: future is always the same. However it differs from the previous case because predictor is at least non-degenerate.

The easiest dependency between the future and the past happens in a Markov chain. For it we will have as many different predictors as many states there are in it. However, it’s still a finite number. Even if we consider all the Markov processes (allowing in this way dependency on earlier-than-previous symbols), the number of different predictors will be easily bounded.

In order to see more complicated set of predictors we should try something more powerful, for example Hidden Markov Models (HMM). It’s easy to obtain a beautiful set that contains all the predictors of the process described by HMM. Let’s consider predictors corresponding to each state of hidden chain. We claim, that all the predictors lie in the subspace spanned over these functions. Then reason is that the past of the process up to current moment gives posterior probability for each of the states, and thus all the predictors can be expressed as a mixtures of the state predictors.

And here comes OOM. The main idea is: let’s considered all the processes, for which all the predictors lie in $m$-dimensional space. That means that the future of the process can be encoded by $m$ real numbers. That’s a nice property but there is one more even nicer. For each alphabet character $a$ let’s consider an operator $t_a$ that works in a way $t_a(g_b) = P(a|b) g_{ba}$. It turns out that this guy is linear!

It’s quite a common for a log-likelihood function to be concave. This great property guarantees uniqueness of local maximal, that serves as a theoretical base for numerous iterative algorithms. One of the way to learn of topic model is maximizing log-likelihood: $latex a + b$

My current research interest is topic modelling, and I’m trying to investigate concepts and proofs related to it. This evening I’ve managed to understand what is EM algorithm under the hood.

Disclaimer: the following text highly intersects with Wikipedia article about EM algorithm.

The EM algorithms emerges in following context. Let’s consider a statistical model that comprises unknown parameter $\Sigma$, unobservable data $Z$ and observable data $X$, i.e. we are given a joint distribution $p(X,Z|\Sigma)$. Assume we have to find maximum-likelihood estimate for $\Theta$ given $X$, i.e. we are to maximize $p(X| \Sigma)$. However, a formula for $p(X|\Sigma)$  obtained by marginalizing joint distribution can be unsuitable to apply usual optimization methods. In pLSA $X$ is a set of all terms in all documents, while $Z$ is a set of their topics and $\Sigma$ consists of topic-term matrix $\phi_{t,w}$ and document-topic matrix $\theta_{d,t}$.

Fortunately, there’s a beautiful iterative approach to solve the problem. Since the word “iterative” was included in the name, one could guess that we need a rule to get better $\Sigma^{t+1}$ from worse $\Sigma^{t}$, moreover this rule should be computationally feasible. And here a crazy idea comes: let’s find $\Sigma^{t+1}$ that maximizes expectation of likelihood $p(X,Z|\Sigma^{t+1})$ under assumption that distribution of $Z$ is defined by $X$ and $\Sigma^t$: $p(Z|X, \Sigma^{t})$. I’d like to comment at this point: given $X$ and $\Theta^{t}$, what is the best conjecture we can make about distribution $p(Z|X,\Theta)$? Nothing smarter that replace $\Theta$ with $\Theta_{t}$ is possible here. And if we actually believe that the conjecture was right, what is our next step? Right, maximize expected log-likelihood $E_{Z|X, \Sigma^t}{\left\{\log p(X,Z|\Sigma^{t+1})\right\}}$.

Surprisingly, it works. Let’s prove that the log-likelihood increases at each iteration. It can be rewritten in such a way:

$\log p(X|\Sigma) = \log p(X,Z|\Sigma) - \log p(Z|X,\Sigma)$

We replace both sides by their expectation under the above mentioned assumption about $Z$ distribution:

$P(X|\Sigma) = E_{Z|X, \Sigma^t}{\left\{\log p(X,Z|\Sigma)\right\}} - E_{Z|X, \Sigma^t}{\left\{\log p(Z|X\Sigma)\right\}}$

A step of EM-algorithm increases the minuend (since we maximize it) and decreases the subtrahend (since it’s minus Kullback-Leibler divergence). Therefore, likelihood increases at each step.

Finally, let’s apply at in the context of pLSA. Given $\Sigma^t=(\theta^t, \phi^t)$ one can get this probability for word $w$ in document $d$ to have topic $z$:

$p(z|w,d,\Sigma^t) = \frac{\theta^t_{d, z} \phi^t_{z,w}}{\sum\limits_{s} \theta^t_{d, s} \phi^t_{s,w}}$

Hence, an optimization target at each step will look like:

$E\log p(X,Z|\Sigma^{t+1}) = \sum\limits_{dw} n_{dw}\sum\limits_{z} \frac{\theta^t_{dz} \phi^t_{zw}}{\sum\limits_{s} \theta^t_{ds} \phi^t_{sw}}\log(\theta^{t+1}_{dz} \phi^{t+1}_{zw})$,

where $n_{dw}$ is number of occurrences of word $w$ in document $d$.

This optimization problem (including usual probability distribution constraints on $\theta$, $\phi$) turns out to be very easy. Happy end 🙂

I admire statistics for both beauty and necessity for various practical purposes. Right now I’m taking a course of applied statistics in machine learning and it’s wonderful, but not enough. I want to understand it deeper.