MATH | Liyan Tang

Gaussian mixture model (GMM), k-means

Gaussian mixture model (GMM) A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Interpretation from geometry $p(x)$ is a weighted sum of multiple Gaussian distribution. $$p(x)=\sum_{k=1}^{K} \alpha_{k} \cdot \mathcal{N}\left(x | \mu_{k}, \Sigma_{k}\right) $$ Interpretation from mixture model setup: The total number of Gaussian distribution $K$. $x$, a sample (observed variable)....

Probabilistic Graphical Model (PGM)

Probabilistic Graphical Model (PGM) Definition: A probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. In general, PGM obeys following rules: $$ \begin{aligned} &\text {Sum Rule : } p\left(x_{1}\right)=\int p\left(x_{1}, x_{2}\right) d x_{2}\\ &\text {Product Rule : } p\left(x_{1}, x_{2}\right)=p\left(x_{1} | x_{2}\right) p\left(x_{2}\right)\\ &\text {Chain Rule: } p\left(x_{1}, x_{2}, \cdots, x_{p}\right)=\prod_{i=1}^{p} p\left(x_{i} | x_{i+1, x_{i+2}} \ldots x_{p}\right)\\ &\text {Bayesian Rule: } p\left(x_{1} | x_{2}\right)=\frac{p\left(x_{2} | x_{1}\right) p\left(x_{1}\right)}{p\left(x_{2}\right)} \end{aligned} $$...

Hidden Markov Model (HMM)

Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post EM Algorithm convex optimization primal and dual problem Let’s get started! Conditional independence $A$ and $B$ are conditionally independent given $C$ if and only if, given knowledge that $C$ occurs, knowledge of whether $A$ occurs provides no information on the likelihood of $B$ occurring, and knowledge of whether $B$ occurs provides no information on the likelihood of $A$ occurring....

EM (Expectation–Maximization) Algorithm

Jensen’s inequality Theorem: Let $f$ be a convex function, and let $X$ be a random variable. Then: $$E[f(X)] \geq f(E[X])$$ $\quad$ Moreover, if $f$ is strictly convex, then $E[f(X)] = f(E[X])$ holds true if and only if $X$ is a constant. Later in the post we are going to use the following fact from the Jensen’s inequality: Suppose $\lambda_j \geq 0$ for all $j$ and $\sum_j \lambda_j = 1$, then $$ \log \sum_j \lambda_j y_j \geq \sum_j \lambda_j , log , y_j$$...

Skip-gram

Comparison between CBOW and Skip-gram The major difference is that skip-gram is better for infrequent words than CBOW in word2vec. For simplicity, suppose there is a sentence “$w_1w_2w_3w_4$”, and the window size is $1$. For CBOW, it learns to predict the word given a context, or to maximize the following probability $$ p(w_2|w_1,w_3) \cdot P(w_3|w_2,w_4)$$ This is an issue for infrequent words, since they don’t appear very often in a given context....

NLP Basics, Spell Correction with Noisy Channel

NLP = NLU + NLG NLU: Natural Language Understanding NLG: Natural Language Generation NLG may be viewed as the opposite of NLU: whereas in NLU, the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into human understandable words. Classical applications in NLP Question Answering Sentiment Analysis...