Posts

Hidden Markov Model (HMM)

Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post EM Algorithm convex optimization primal and dual problem Let’s get started! Conditional independence $A$ and $B$ are conditionally independent given $C$ if and only if, given knowledge that $C$ occurs, knowledge of whether $A$ occurs provides no information on the likelihood of $B$ occurring, and knowledge of whether $B$ occurs provides no information on the likelihood of $A$ occurring....

EM (Expectation–Maximization) Algorithm

Jensen’s inequality Theorem: Let $f$ be a convex function, and let $X$ be a random variable. Then: $$E[f(X)] \geq f(E[X])$$ $\quad$ Moreover, if $f$ is strictly convex, then $E[f(X)] = f(E[X])$ holds true if and only if $X$ is a constant. Later in the post we are going to use the following fact from the Jensen’s inequality: Suppose $\lambda_j \geq 0$ for all $j$ and $\sum_j \lambda_j = 1$, then $$ \log \sum_j \lambda_j y_j \geq \sum_j \lambda_j , log , y_j$$...

Skip-gram

Comparison between CBOW and Skip-gram The major difference is that skip-gram is better for infrequent words than CBOW in word2vec. For simplicity, suppose there is a sentence “$w_1w_2w_3w_4$”, and the window size is $1$. For CBOW, it learns to predict the word given a context, or to maximize the following probability $$ p(w_2|w_1,w_3) \cdot P(w_3|w_2,w_4)$$ This is an issue for infrequent words, since they don’t appear very often in a given context....

Distributed representation, Hyperbolic Space, Gaussian/Graph Embedding

Overview of various word representation and Embedding methods Local Representation v.s. Distributed Representation One-hot encoding is local representation and is good for local generalization; distributed representation is good for global generalization. Comparison between local generalization and global generalization: Here is an example for better understanding this pair of concepts. Suppose now you have a bunch of ingredients and you’re able to cook 100 different meals with these ingredients....

NLP Basics, Spell Correction with Noisy Channel

NLP = NLU + NLG NLU: Natural Language Understanding NLG: Natural Language Generation NLG may be viewed as the opposite of NLU: whereas in NLU, the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into human understandable words. Classical applications in NLP Question Answering Sentiment Analysis...

Kaggle: Google Quest Q&A Labeling - my solution

Kaggle: Google Quest Q&A Labeling summary General Part Congratulations to all winners of this competition. Your hard work paid off! First, I have to say thanks to the authors of the following three published notebooks: https://www.kaggle.com/akensert/bert-base-tf2-0-now-huggingface-transformer, https://www.kaggle.com/abhishek/distilbert-use-features-oof, https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe. These notebooks showed awesome ways to build models, visualize the dataset and extract features from non-text data. Our initial plan was to take question title, question body and answer all into a Bert based model....