Lecture 1: Introduction to Deep Learning
Not about Learning aspect of Deep Learning (except for the first two); System aspect of deep learning: faster training, efficient serving, lower memory consumption.
Lecture 3: Components Overview of Deep Learning System
Typical Deep Learning System Stack
User API: Programming API; Gradient Calculation (Differentiation API)
System Components: Computational Graph Optimization and Execution; Runtime Parallel Scheduling
Architecture: GPU Kernels, Optimizing Device Code; Accelerators and Hardwares
Lecture 4: Backprop and Automatic Differentiation
Numerical differentiation
Backpropagation
Automatic differentiation
Lecture 5: Hardware backends: GPU
Tips for high performance
Lecture 6: Optimize for hardware backends
Optimizations = Too Many Variant of Operators
Neural Networks and Deep Learning by deeplearning.ai on Coursera.
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization by deeplearning.ai on Coursera.
Structuring Machine Learning Projects by deeplearning.ai on Coursera.
Convolutional Neural Networks by deeplearning.ai on Coursera.
Sequence Models by deeplearning.ai on Coursera.
It provides a big picture of AI, and it is interesting to learn AI from a different perspective.
]]>Week 1 (1/30):
Week 2 (2/6):
Week 3 (2/13):
01/23/19 Introduction
Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.
01/28/19 git, GitHub and testing
Usage | Git command |
---|---|
Changes between working directory and what was last staged | git diff |
Changes between staging area and last commit | git diff –staged |
Current Version (most recent commit) | HEAD |
Version before current | HEAD~1 |
Changes made in the last commit | git diff HEAD~1 |
Changes made in the last 2 commits | git diff HEAD~2 |
git log –online –decorate –all –graph | |
moves HEAD to |
git reset –soft |
moves HEAD to |
git reset –mixed |
moved HEAD to |
git reset –hard |
Unit tests – function does the right thing.
Integration tests – system / process does the right thing.
Non-regression tests – bug got removed (and will not be reintroduced).
01/30/19 matplotlib and visualization
import matplotlib.pyplot as plt
ax = plt.gca() # get current axes
fig = plt.gcf() # get current figure
02/04/19 Introduction to supervised learning
Naive Nearest Neighbor | Kd-tree Nearest Neighbor |
---|---|
fit: no time | fit: O(p * n log n) |
memory: O(n * p) | memory: O(n * p) |
predict: O(n * p) | predict: O(k * log(n)) FOR FIXED p! |
n=n_samples | p=n_features |
Parametric model: Number of “parameters” (degrees of freedom) independent of data.
Non-parametric model: Degrees of freedom increase with more data.
02/06/19 Preprocessing
est.fit_transform(X) == est.fit(X).transform(X) # mostly
est.fit_predict(X) == est.fit(X).predict(X) # mostly
For high cardinality categorical features, we can use target-based encoding.
Power Transformation:
\begin{equation} bc_{\lambda}(x) = \begin{cases} \frac{x^{\lambda}-1}{\lambda}, & \mbox{if }\lambda \neq 0 \newline \log(x), & \mbox{if }\lambda = 0 \end{cases} \end{equation}
Only applicable for positive x!
02/11/19 Linear models for Regression
Imputation Methods:
Coefficient of determination $R^2$:
$R^2 (y, y’) = 1 - \sum_{i=0}^{n-1}(y_i-y’i)^2 / \sum{i=0}^{n-1}(y_i-\bar{y})^2$
$\bar{y} = \frac{1}{n}\sum_{i=0}^{n-1}y_i$
Can be negative for biased estimators - or the test set!
Ridge Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_2|w|_2^2$
Lasso Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1$
Elastic Net:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1 + \alpha_2|w|_2^2$
02/13/19 Linear models for Classification, SVMs
Logistic Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1)$
Penalized Logistic Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_2^2$
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_1$
(Soft Margin) Linear SVM
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_2^2$
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_1$
One vs Rest | One vs One |
---|---|
nclasses classifiers | nclasses*(nclasses-1)/2 classifiers |
trained on imbalanced datasets of original size | trained on balanced subsets |
retains some uncertainty | No uncertainty propagated |
Multinomial Logistic Regression
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(p(y=y_i \mid x_i, w, b))$
$p(y=i \mid x) = \frac{e^{w^T_ix+b_i}}{\sum_{j=1}^ne^{w^T_jx+b_j}}$
Kernel SVM: Read the book!
02/18/19 Trees, Forests & Ensembles
Trees are similar to Nearest Neighbors.
Trees and Nearest Neighbors can not extrapolate.
Trees are not stable.
Bagging (Bootstrap AGGregation): Generic way to build “slightly different” models.
Generalization in Ensembles depend on strength of the individual classifiers and (inversely) on their correlation; Uncorrelating them might help, even at the expense of strength.
Randomize in two ways in Random Forest:
For each tree: pick bootstrap sample of data
For each split: pick random sample of features
02/20/19 Gradient Boosting, Calibration
Gradient Boosting Algorithm:
$f_1(x) \approx y$
$f_2(x) \approx y - \gamma f_1(x)$
$f_3(x) \approx y - \gamma f_1(x) - \gamma f_2(x)$
Learning rate: $\gamma$
Gradient Boosting Advantages:
slower to train than RF, but much faster to predict
very fast using XGBoost, LightGBM, pygbm ……
small model size
usually more accurate than Random Forests
When to use tree-based models:
Model non-linear relationships
Doesn’t care about scaling, no need for feature engineering
Single tree: very interpretable (if small)
Random forests very robust, good benchmark
Gradeint boosting often best performance with careful tuning
Calibration curve
Brier Score (for binary classification): “mean squared error of probability estimate”
$BS = \sum_{i=1}^n (p(y_i)-y_i)^2 / n$ (measure both calibration and accuracy)
Calibrating a classifier:
Platt Scaling: $f_{platt} = \frac{1}{1+\exp(-ws(x)-b)}$
Isotonic Regression: Learn monotonically increasing step function in 1d.
02/25/19 Model Evaluation
confusion matrix:
negative class | TN | FP |
positive class | FN | TP |
predicted negative | predicted positive |
$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$
$Precision = \frac{TP}{TP+FP}$
$Recall = \frac{TP}{TP+FN}$
$F = 2\frac{precision \times recall}{precision+recall}$
Precision-Recall curve: (x-axis Recall; y-axis Precision)
Average Precision:
$AveP = \sum_{k=1}^n P(k)\triangle r(k)$
$FPR = \frac{FP}{FP+TN}$
$TPR = \frac{TP}{TP+FN}$
ROC curve: (x-axis FPR; y-axis TPR)
ROC AUC: Area under ROC Curve (Always 0.5 for random/constant prediction)
03/04/19 Learning with Imbalanced Data
03/06/19 Model Interpretration and Feature Selection
C程序设计进阶 by Peking University on Coursera.
Why is NLP hard?
Complexity in representing, learning and using linguistic/situational/word/visual knowledge
Human languages are ambiguous (unlike programming and other formal languages)
Human language interpretation depends on real world, common sense, and contextual knowledge
Main idea of word2vec
Two algorithms
skip-grams (SG): predict context words given target (position independent)
Continuous Bag of Words (CBOW): predict target word from bag-of-words context
Two (moderately efficient training methods)
Hierarchical softmax
Negative sampling
$J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{-m \le j \le m, j \neq 0} \log p(w_{t+j} \mid w_{t})$
$p(o \mid c) = \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}$
Negative sampling: train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word)
The skip-gram model with negative sampling
$J(\theta) = \frac{1}{T} \sum_{t=1}^T J_t(\theta)$
$J_t(\theta) = \log \sigma(u_o^Tv_c) + \sum_{j \sim P(\omega)} [\log \sigma(-u_j^Tv_c)]$ : maximize probability that real outside word appears, minimize probability that random words appear around center word
GloVe: $J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j - \log P_{ij})^2$ Count-based + Direct prediction method
If you only have a small training dataset, don’t train the word vectors.
If you have a very large dataset, it may work better to train word vectors to the task.
The max-margin loss: $J = \max (0, 1-s+s_c)$
Idea for training objective: make score of true window larger and corrupt window’s score lower (until they’re good enough)
Chain rule, Nothing fancy!
TensorFlow = Tensor + Flow