SVM, Dual SVM, Non-linear SVM

Linear SVM Idea We want to find a hyper-plane $w^\top x + b = 0$ that maximizes the margin. Set up We first show that the vector $w$ is orthogonal to this hyper-plane. Let $x_1$, $x_2$ be any element on the hyper-plane. So we have $w^\top x_1 + b = 0$ and $w^\top x_2 + b = 0$. Then $w^\top (x_1 - x_2) = 0$, which implies $w$ is orthogonal to the hyper-plane....

April 1, 2020 · 7 min

Introduction to Convex Optimization - Primal problem to Dual problem

Consider an optimization problem in the standard form (we call this a primal problem): We denote the optimal value of this as $p^\star$. We don’t assume the problem is convex. The Lagrange dual function We define the Lagrangian $L$ associated with the problem as $$ L(x,\lambda, v) = f_0(x) + \sum^m_{i=1}\lambda_if_i(x) + \sum^p_{i=1}v_ih_i(x)$$ We call vectors $\lambda$ and $v$ the dual variables or Lagrange multiplier vectors associated with the problem (1)....

March 29, 2020 · 7 min

Introduction to Convex Optimization - Basic Concepts

Optimization problem All optimization problems can be written as: Optimization Categories convex v.s. non-convex Deep Neural Network is non-convex continuous v.s.discrete Most are continuous variable; tree structure is discrete constrained v.s. non-constrained We add prior to make it a constrained problem smooth v.s.non-smooth Most are smooth optimization Different initialization brings different optimum (if not convex) Idea: Give up global optimal and find a good local optimal....

March 28, 2020 · 8 min

XGBoost

Bagging v.s. Boosting: Bagging: Leverages unstable base learners that are weak because of overfitting. Boosting: Leverages stable base learners that are weak because of underfitting. XGBoost Learning Process through XGBoost: How to set a objective function. Hard to solve directly, how to approximate. How to add tree structures into the objective function. Still hard to optimize, then have to use greedy algorithms. Step 1. How to set an objective function Suppose we trained $K$ trees, then the prediction for the ith sample is $$\hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F}$$ where $K$ is the number of trees, $f$ is a function in the functional space $\mathcal{F}$, and $\mathcal{F}$ is the set of all possible CARTs....

March 27, 2020 · 4 min

Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP)

MLE v.s. MAP MLE: learn parameters from data. MAP: add a prior (experience) into the model; more reliable if data is limited. As we have more and more data, the prior becomes less useful. As data increase, MAP $\rightarrow$ MLE. Notation: $D = {(x_1, y_1), (x_2, y_2), …, (x_n, y_n)}$ Framework: MLE: $\mathop{\rm arg\ max} P(D \ |\ \theta)$ MAP: $\mathop{\rm arg\ max} P(\theta \ |\ D)$ Note that taking a product of some numbers less than 1 would approaching 0 as the number of those numbers goes to infinity, it would be not practical to compute, because of computation underflow....

March 26, 2020 · 1 min

Logistic Regression, L1, L2 regularization, Gradient/Coordinate descent

Generative model v.s. Discriminative model: Examples: Generative model: Naive Bayes, HMM, VAE, GAN. Discriminative model:Logistic Regression, CRF. Objective function: Generative model: $max \ p \ (x,y)$ Discriminative model:$max \ p \ (y \ |\ x)$ Difference: Generative model: We first assume a distribution of the data in the consideration of computation efficiency and features of the data. Next, the model will learn the parameters of distribution of the data....

March 25, 2020 · 4 min