Generative model v.s. Discriminative model:
Examples:
- Generative model: Naive Bayes, HMM, VAE, GAN.
- Discriminative model:Logistic Regression, CRF.
Objective function:
- Generative model:
- Discriminative model:
Difference:
- Generative model: We first assume a distribution of the data in the consideration of computation efficiency and features of the data. Next, the model will learn the parameters of distribution of the data. Then, we can use the model to generate new data. (e.g. generate new data from a normal distribution.)
- Discriminative model: The only purpose is to classify, that is, to tell the difference. As long as it can find a way to tell the difference, it doesn’t need to learn anything else.
Relation:
- Generative model:
, it has a prior term . - Discriminative model:
- Both models can do classification problems, but discriminative model can do classification problems only. Usually, for classification problems, discriminative model performs better. On the other hand, if we have limited data, generative model might perform better (since it has a prior term, which plays the role of a regularization).
Logistic regression
Formula:
Derivative formula:
Logistic Regression does not have analytic solutions and we need to use iterative optimization to find a solution recursively.
It spends a lot of computational power to calculate
Derivation of Logistic Regression
Gradient:
True gradient: The true gradient of the population.
Empirical gradient: Use sample gradient to approximate true gradient.
- SGD, approximate bad. Asymptotically converge to true gradient.
- GD, approximate good.
- mini-batch DG, middle.
For smooth optimization, we can use gradient descent.
For non-smooth optimization, we can use coordinate descent (e.g. L1).
Mini-Batch Gradient Descent for Logistic Regression

Way to prevent overfitting:
- More data.
- Regularization.
- Ensemble models.
- Less complicate models.
- Less Feature.
- Add noise (e.g. Dropout)
L1 regularization
L1: Feature Selection, PCA: Features changed.
Why prefer sparsity:
- reduce dimension, then less computation.
- Higher interpretability.
Problem of L1:
- Group Effect: If there is collinearity in features, L1 will randomly choose one from each group. Therefore, the best feature in each group might not be selected.
Coordinate Descent for Lasso
-
Intuition: If we are at a point
such that is minimized along each coordinate axis, then we find a global minimizer. -
step of Coordinate Descent: Note that we don’t need a learning rate here since we are finding the optimum values.

Large eigenvalue by co-linear columns
If a matrix 𝐴 has an eigenvalue
If a matrix
If we plug in $y = X\beta^{} + \epsilon
Multiplying the noise by
If we now add some regularization (aka weight decay), then
Adding a small multiple of the identity to
Building Models with prior knowledge (put relation into models via regularization):
- model +
Regularization - Constraint optimization
- Probabilistic model (e.g. Probabilistic Graph Model (PGM))