Posts

Kaggle: Tweet Sentiment Extraction - common methods

Note This post is the first part of overall summarization of the competition. The second half is here. Before we start I attended two NLP competition in June, Tweet Sentiment Extraction and Jigsaw Multilingual Toxic Comment Classification, and I’m happy to be a Kaggle Expert from now on :) Tweet Sentiment Extraction Goal: The objective in this competition is to “Extract support phrases for sentiment labels”. More precisely, this competition asks kagglers to construct a model that can figure out what word or phrase best supports the given tweet from the labeled sentiment....

Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM)

Sequence Data There are many sequence data in applications. Here are some examples Machine translation from text sequence to text sequence. Text Summarization from text sequence to text sequence. Sentiment classification from text sequence to categories. Music Generation from nothing or some simple stuff (character, integer, etc) to wave sequence. Name entity recognition (NER)...

Intro to Deep Learning and Backpropagation

Deep Learning v.s. Machine Learning The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage. Forward Propagation The general procedure is the following: $$ \begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &…… \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned} $$...

Log-Linear Model, Conditional Random Field(CRF)

Log-Linear model Let $x$ be an example, and let $y$ be a possible label for it. A log-linear model assumes that $$ p(y | x ; w)=\frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)} $$ where the partition function $$ Z(x, w)=\sum_{y^{\prime}} \exp [\sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] $$ Note that in $\sum_{y^{\prime}}$, we make a summation over all possible $y$. Therefore, given $x$, the label predicted by the model is $$ \hat{y}=\underset{y}{\operatorname{argmax}} p(y | x ; w)=\underset{y}{\operatorname{argmax}} \sum_{j=1}^J w_{j} F_{j}(x, y) $$...

Gaussian mixture model (GMM), k-means

Gaussian mixture model (GMM) A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Interpretation from geometry $p(x)$ is a weighted sum of multiple Gaussian distribution. $$p(x)=\sum_{k=1}^{K} \alpha_{k} \cdot \mathcal{N}\left(x | \mu_{k}, \Sigma_{k}\right) $$ Interpretation from mixture model setup: The total number of Gaussian distribution $K$. $x$, a sample (observed variable)....

Probabilistic Graphical Model (PGM)

Probabilistic Graphical Model (PGM) Definition: A probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. In general, PGM obeys following rules: $$ \begin{aligned} &\text {Sum Rule : } p\left(x_{1}\right)=\int p\left(x_{1}, x_{2}\right) d x_{2}\\ &\text {Product Rule : } p\left(x_{1}, x_{2}\right)=p\left(x_{1} | x_{2}\right) p\left(x_{2}\right)\\ &\text {Chain Rule: } p\left(x_{1}, x_{2}, \cdots, x_{p}\right)=\prod_{i=1}^{p} p\left(x_{i} | x_{i+1, x_{i+2}} \ldots x_{p}\right)\\ &\text {Bayesian Rule: } p\left(x_{1} | x_{2}\right)=\frac{p\left(x_{2} | x_{1}\right) p\left(x_{1}\right)}{p\left(x_{2}\right)} \end{aligned} $$...