# Applied Machine Learning

** Published:**

COMS W4995 Applied Machine Learning by Andreas C. Müller at Columbia University.

01/23/19 Introduction

Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.

01/28/19 git, GitHub and testing

Usage | Git command |
---|---|

Changes between working directory and what was last staged | git diff |

Changes between staging area and last commit | git diff –staged |

Current Version (most recent commit) | HEAD |

Version before current | HEAD~1 |

Changes made in the last commit | git diff HEAD~1 |

Changes made in the last 2 commits | git diff HEAD~2 |

git log –online –decorate –all –graph | |

moves HEAD to | git reset –soft |

moves HEAD to | git reset –mixed |

moved HEAD to | git reset –hard |

Unit tests – function does the right thing.

Integration tests – system / process does the right thing.

Non-regression tests – bug got removed (and will not be reintroduced).

01/30/19 matplotlib and visualization

`import matplotlib.pyplot as plt`

`ax = plt.gca() # get current axes`

`fig = plt.gcf() # get current figure`

02/04/19 Introduction to supervised learning

Naive Nearest Neighbor | Kd-tree Nearest Neighbor |
---|---|

fit: no time | fit: O(p * n log n) |

memory: O(n * p) | memory: O(n * p) |

predict: O(n * p) | predict: O(k * log(n)) FOR FIXED p! |

n=n_samples | p=n_features |

Parametric model: Number of “parameters” (degrees of freedom) independent of data.

Non-parametric model: Degrees of freedom increase with more data.

02/06/19 Preprocessing

`est.fit_transform(X) == est.fit(X).transform(X) # mostly`

`est.fit_predict(X) == est.fit(X).predict(X) # mostly`

For high cardinality categorical features, we can use target-based encoding.

Power Transformation:

\begin{equation} bc_{\lambda}(x) = \begin{cases} \frac{x^{\lambda}-1}{\lambda}, & \mbox{if }\lambda \neq 0 \newline \log(x), & \mbox{if }\lambda = 0 \end{cases} \end{equation}

Only applicable for positive x!

02/11/19 Linear models for Regression

Imputation Methods:

- Mean/Median
- kNN
- Regression models
- Matrix factorization

Coefficient of determination $R^2$:

$R^2 (y, y’) = 1 - \sum_{i=0}^{n-1}(y_i-y’*i)^2 / \sum*{i=0}^{n-1}(y_i-\bar{y})^2$

$\bar{y} = \frac{1}{n}\sum_{i=0}^{n-1}y_i$

Can be negative for biased estimators - or the test set!

Ridge Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_2|w|_2^2$

Lasso Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1$

Elastic Net:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1 + \alpha_2|w|_2^2$

02/13/19 Linear models for Classification, SVMs

Logistic Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1)$

Penalized Logistic Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_2^2$

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_1$

(Soft Margin) Linear SVM

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_2^2$

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_1$

One vs Rest | One vs One |
---|---|

nclasses classifiers | nclasses*(nclasses-1)/2 classifiers |

trained on imbalanced datasets of original size | trained on balanced subsets |

retains some uncertainty | No uncertainty propagated |

Multinomial Logistic Regression

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(p(y=y_i \mid x_i, w, b))$

$p(y=i \mid x) = \frac{e^{w^T_ix+b_i}}{\sum_{j=1}^ne^{w^T_jx+b_j}}$

Kernel SVM: Read the book!

02/18/19 Trees, Forests & Ensembles

Trees are similar to Nearest Neighbors.

Trees and Nearest Neighbors can not extrapolate.

Trees are not stable.

Bagging (Bootstrap AGGregation): Generic way to build “slightly different” models.

Generalization in Ensembles depend on strength of the individual classifiers and (inversely) on their correlation; Uncorrelating them might help, even at the expense of strength.

Randomize in two ways in Random Forest:

For each tree: pick bootstrap sample of data

For each split: pick random sample of features

02/20/19 Gradient Boosting, Calibration

Gradient Boosting Algorithm:

$f_1(x) \approx y$

$f_2(x) \approx y - \gamma f_1(x)$

$f_3(x) \approx y - \gamma f_1(x) - \gamma f_2(x)$

Learning rate: $\gamma$

Gradient Boosting Advantages:

slower to train than RF, but much faster to predict

very fast using XGBoost, LightGBM, pygbm ……

small model size

usually more accurate than Random Forests

When to use tree-based models:

Model non-linear relationships

Doesn’t care about scaling, no need for feature engineering

Single tree: very interpretable (if small)

Random forests very robust, good benchmark

Gradeint boosting often best performance with careful tuning

Calibration curve

Brier Score (for binary classification): “mean squared error of probability estimate”

$BS = \sum_{i=1}^n (p(y_i)-y_i)^2 / n$ (measure both calibration and accuracy)

Calibrating a classifier:

Platt Scaling: $f_{platt} = \frac{1}{1+\exp(-ws(x)-b)}$

Isotonic Regression: Learn monotonically increasing step function in 1d.

02/25/19 Model Evaluation

confusion matrix:

negative class | TN | FP |

positive class | FN | TP |

predicted negative | predicted positive |

$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

$Precision = \frac{TP}{TP+FP}$

$Recall = \frac{TP}{TP+FN}$

$F = 2\frac{precision \times recall}{precision+recall}$

Precision-Recall curve: (x-axis Recall; y-axis Precision)

Average Precision:

$AveP = \sum_{k=1}^n P(k)\triangle r(k)$

$FPR = \frac{FP}{FP+TN}$

$TPR = \frac{TP}{TP+FN}$

ROC curve: (x-axis FPR; y-axis TPR)

ROC AUC: Area under ROC Curve (Always 0.5 for random/constant prediction)

03/04/19 Learning with Imbalanced Data

03/06/19 Model Interpretration and Feature Selection