ML | Liyan Tang

Knowledge Distillation

Currently, especially in NLP, very large scale models are being trained. A large portion of those can’t even fit on an average person’s hardware. We can train a small network that can run on the limited computational resource of our mobile device. But small models can’t extract many complex features that can be handy in generating predictions unless you devise some elegant algorithm to do so. Plus, due to the Law of diminishing returns, a great increase in the size of model barely maps to a small increase in the accuracy....

Intro to Deep Learning and Backpropagation

Deep Learning v.s. Machine Learning The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage. Forward Propagation The general procedure is the following: $$ \begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &…… \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned} $$...

Gaussian mixture model (GMM), k-means

Gaussian mixture model (GMM) A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Interpretation from geometry $p(x)$ is a weighted sum of multiple Gaussian distribution. $$p(x)=\sum_{k=1}^{K} \alpha_{k} \cdot \mathcal{N}\left(x | \mu_{k}, \Sigma_{k}\right) $$ Interpretation from mixture model setup: The total number of Gaussian distribution $K$. $x$, a sample (observed variable)....

Probabilistic Graphical Model (PGM)

Probabilistic Graphical Model (PGM) Definition: A probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. In general, PGM obeys following rules: $$ \begin{aligned} &\text {Sum Rule : } p\left(x_{1}\right)=\int p\left(x_{1}, x_{2}\right) d x_{2}\\ &\text {Product Rule : } p\left(x_{1}, x_{2}\right)=p\left(x_{1} | x_{2}\right) p\left(x_{2}\right)\\ &\text {Chain Rule: } p\left(x_{1}, x_{2}, \cdots, x_{p}\right)=\prod_{i=1}^{p} p\left(x_{i} | x_{i+1, x_{i+2}} \ldots x_{p}\right)\\ &\text {Bayesian Rule: } p\left(x_{1} | x_{2}\right)=\frac{p\left(x_{2} | x_{1}\right) p\left(x_{1}\right)}{p\left(x_{2}\right)} \end{aligned} $$...

EM (Expectation–Maximization) Algorithm

Jensen’s inequality Theorem: Let $f$ be a convex function, and let $X$ be a random variable. Then: $$E[f(X)] \geq f(E[X])$$ $\quad$ Moreover, if $f$ is strictly convex, then $E[f(X)] = f(E[X])$ holds true if and only if $X$ is a constant. Later in the post we are going to use the following fact from the Jensen’s inequality: Suppose $\lambda_j \geq 0$ for all $j$ and $\sum_j \lambda_j = 1$, then $$ \log \sum_j \lambda_j y_j \geq \sum_j \lambda_j , log , y_j$$...

Distributed representation, Hyperbolic Space, Gaussian/Graph Embedding

Overview of various word representation and Embedding methods Local Representation v.s. Distributed Representation One-hot encoding is local representation and is good for local generalization; distributed representation is good for global generalization. Comparison between local generalization and global generalization: Here is an example for better understanding this pair of concepts. Suppose now you have a bunch of ingredients and you’re able to cook 100 different meals with these ingredients....