Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM)

Sequence Data There are many sequence data in applications. Here are some examples Machine translation from text sequence to text sequence. Text Summarization from text sequence to text sequence. Sentiment classification from text sequence to categories. Music Generation from nothing or some simple stuff (character, integer, etc) to wave sequence. Name entity recognition (NER)...

June 4, 2020 · 5 min

Log-Linear Model, Conditional Random Field(CRF)

Log-Linear model Let $x$ be an example, and let $y$ be a possible label for it. A log-linear model assumes that $$ p(y | x ; w)=\frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)} $$ where the partition function $$ Z(x, w)=\sum_{y^{\prime}} \exp [\sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] $$ Note that in $\sum_{y^{\prime}}$, we make a summation over all possible $y$. Therefore, given $x$, the label predicted by the model is $$ \hat{y}=\underset{y}{\operatorname{argmax}} p(y | x ; w)=\underset{y}{\operatorname{argmax}} \sum_{j=1}^J w_{j} F_{j}(x, y) $$...

May 19, 2020 · 8 min

Hidden Markov Model (HMM)

Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post EM Algorithm convex optimization primal and dual problem Let’s get started! Conditional independence $A$ and $B$ are conditionally independent given $C$ if and only if, given knowledge that $C$ occurs, knowledge of whether $A$ occurs provides no information on the likelihood of $B$ occurring, and knowledge of whether $B$ occurs provides no information on the likelihood of $A$ occurring....

May 3, 2020 · 14 min

Skip-gram

Comparison between CBOW and Skip-gram The major difference is that skip-gram is better for infrequent words than CBOW in word2vec. For simplicity, suppose there is a sentence “$w_1w_2w_3w_4$”, and the window size is $1$. For CBOW, it learns to predict the word given a context, or to maximize the following probability $$ p(w_2|w_1,w_3) \cdot P(w_3|w_2,w_4)$$ This is an issue for infrequent words, since they don’t appear very often in a given context....

April 18, 2020 · 8 min

NLP Basics, Spell Correction with Noisy Channel

NLP = NLU + NLG NLU: Natural Language Understanding NLG: Natural Language Generation NLG may be viewed as the opposite of NLU: whereas in NLU, the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into human understandable words. Classical applications in NLP Question Answering Sentiment Analysis...

April 10, 2020 · 9 min

Kaggle: Google Quest Q&A Labeling - my solution

Kaggle: Google Quest Q&A Labeling summary General Part Congratulations to all winners of this competition. Your hard work paid off! First, I have to say thanks to the authors of the following three published notebooks: https://www.kaggle.com/akensert/bert-base-tf2-0-now-huggingface-transformer, https://www.kaggle.com/abhishek/distilbert-use-features-oof, https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe. These notebooks showed awesome ways to build models, visualize the dataset and extract features from non-text data. Our initial plan was to take question title, question body and answer all into a Bert based model....

April 4, 2020 · 7 min