Part 3 - AI Tools | rezarezvan.com

DAT410_3

Introduction

When dealing with the machine learning life cycle, we will encounter,

Data collection (Collect user ratings).
Data representation & Processing (Preprocess files, movie_reviews.csv).
Modelling (Collaborative filtering, Content-based filtering).
Learning/Optimization (Minimizing empirical risk of predicting ratings).
Evaluation (Online A/B testing ¹).
Deployment (git add . && git commit -m "LGTM" && git push origin master -f).

At each of these steps we use different tools, but first let’s define what an AI tool is.

What is an AI tool?

When we are talking about AI tools, we are not talking about AI-assisted tools, rather, tools that helps us implement AI systems.

The specifics depend (obviously) on the nature of the system.

We will focus on (statistical) machine learning for prototyping in Python.

Data Representation & Processing

In typical machine learning systems, our data is stored in matrix and or vector form, e.g., a classical table of features $X$ and labels $Y$.

However, more general data usually does not have natural tabular representation. Think of time series, they must cover different time-scales, different lengths and different possible events.

Image data is also not easily stored in (2D) matrices, instead we work with tensors. We usually just think of them as 3D (or higher) matrices, where each depth dimension represents a different color channel.

Graphs are used to represent relations of people, atoms, cities, etc. These can be represented as adjacency matrices.

However (again), despite the different nature of text, images, etc., we can still (usually) store them in matrix and vector form.

But why matrices?

Because (almost) all of machine learning relies on linear algebra,

Linear Regression $y = \beta^T x + c$.
Deep neural networks $h_{l + 1} = \sigma(W_{l + 1} h_l + b_{l})$.
Linear Programming $$\begin{aligned} \min & \quad Ax + b \newline \text{subject to } & \quad Bx + c \leq 0 \end{aligned}$$.

Matrix and vector operations are fast (enough) and well-studied.

Drawbacks & Data Frames

A major drawback of pure matrix and tensor representations is that columns and rows are anonymous. Sometimes we want to know what each row or column represents (explicitly).

Data frames add indices and names, much like a spreadsheet.

Missing Values

Say that we want to predict cardiovascular risk for patients who visit their general practitioner.

We look at electronic health records (EHR) data containing age, sex, weight and cholesterol levels $X$ and the corresponding cardiovascular risk $Y$.

When training our machine learning model, inputs are often assumed to be fully observed, but in reality, it is very common for some features to be missing.

Let $\tilde{X}$ be the actual observed values (which is a subset of the full feature $X$).

We can represent missing values by a missingness mask $M$.

Thus, we can do,

$$ \tilde{X_{ij}} = \begin{cases} X_{ij} & \text{if } M_{ij} = 0 \newline \texttt{nan} & \text{otherwise} \end{cases} $$

Now, suppose we want to learn to predict $Y$ from our tabular features $X$, what can we do with our missing values in $X$?

There are two common solutions imputation and informative missingness.

Imputation

Attempt to impute missing values,

Reconstruct $X$ from $\tilde{X}$.
Predict $Y$ from reconstructed $\hat{X}$.

Any method that works (well) for $X$ works (well) for our reconstructed $\hat{X}$.

Informative Missingness

Make use of missingness itself,

Predict $Y$ from both $\tilde{X}$ and $M$.

From here we can do two more common approaches,

Method sensitive to $M$ (e.g., XGBoost ²).
Simple imputation + missing indicators.

Single VS. Multiple Imputation

The basic idea of imputation is to predict the missing value from observed values of other variables. In the simplest case, we use a single imputation of each value,

$$ f(\text{Age}, \text{Sex}, \text{Weight}) = \text{Cholesterol} $$

The imputation functions can be learned by regression on complete observations or observations with less missingness,

$$ f^{\star} = \arg \min_{f} \mathbb{E} \left[ \left( f(\text{Age}, \text{Sex}, \text{Weight}) - \text{Cholesterol} \right)^2 \right] $$

This becomes tricky when all observations have some missing values.

A more popular method for imputation is Multiple Imputation by Chained Equations, or MICE.

Multiple Imputation: Create more than one sample of each missing value to account for variance.

Chained Equations: Impute one value based on imputations of other variables, works without any complete observations.

Give each variable a placeholder imputed value (e.g., mean)
Repeat for a number of iterations
$\quad$ For each variable $v$:
$\quad$ $\quad$ Regress observed $v$ on other variables in the dataset (including other imputed values)
$\quad$ $\quad$ Impute missing $v$ using regression.

However, imputation methods may fail when data is not missing at random (MAR), i.e., missing values can not be reliably imputed.

A common method in this case is to stitch a simple imputation $\hat{X}$ (e.g., 0) and $M$, as binary indicators, together for prediction.

When do we use which method?

Will data be missing at test/use time?
- → Imputation is not necessary to minimize test error.
Are you fitting a parametric model (e.g., linear regression)?
- → Imputation becomes important to recover parameters.
Is missingness predictable from observed values (MAR)?
- → If not, imputation will lead to biased results.

Model Development

A big development patterin in machine learning systems is the fit, predict, score pattern.

This standardizes the common machine learning workflow,

fit(x, y): Train model to e.g., predict $y$ from $x$.
predict(x): Predict $y$ for $x$.
score(x, y): Evaluate model on data $x, y$.

`fit(x, y)`

The function fit(x, y) is responsible for training and storing model parameters that maximize (or minimize) some objective function.

For example, finding the optimal coefficients in OLS or k-NN.

Our arguments $x,y$ may vary depending on our application, in unsupervised learning there is no $y$, but parameters to fit.

`predict(x)`

The function predict(x) should take a (new) data point $x$ and predict the corresponding outcome (e.g., label/cluster) $y$ for it.

`score(x, y)`

The function score(x, y) assigns a score to the prediction made for $x$ in comparison to the label $y$.

score(x, y) can also be used for hyperparameter selection, for example using cross-validation.

Data Preprocessing

All algorithms are sensitive to the representation of the input data.

A (trivial) example are the coefficients of OLS, which depend on the scale of the covariates,

$$ y = \beta^T(ax) + c = (a \beta)^T x + c. $$

A common step is to standardize features, i.e., to make them have zero mean and unit variance,

$$ X^{(i)} \leftarrow \frac{X - \mu(X)}{\sigma(X)}. $$

Formally, $\mu$ and $\sigma$ are parameters, they are functions of our training data.

Differentiable Systems

When we fit a machine learning algorithm on a training dataset, we are solving an optimization problem.

Empirical risk minimization (ERM) is the most common optimization problem in machine learning,

$$ \underset{\theta}{\min} \frac{1}{N} \sum_{i = 1}^{N} L(f_{\theta}(x^{(i)}), y^{(i)}). $$

Here, $L$ is the loss function, $f_{\theta}$ is the model and $\theta$ are the parameters (to be optimized).

Deep learning and many other AI tools are based on ERM in differentiable systems. Systems where the objective function is differentiable in the parameters $\theta$ (this means that learning using gradient descent is possible).

Gradient Descent

Consider the linear model $f(x) = \theta^T x$ of a 1D label y.

We can measure our error using the mean squared error (MSE),

$$ \hat{R}(\theta) = \frac{1}{N} \sum_{i = 1}^{N} (\theta^T x^{(i)} - y^{(i)})^2. $$

How do we find the parameters $\theta^{\star}$ that minimize $\hat{R}(\theta)$?

We can use gradient descent to find a (local) minimum. Gradient descent says to move in the direction of $-\nabla \hat{R}(\theta)$,

$$ \theta_k \leftarrow \theta_{k - 1} - \eta \nabla_{\theta}(\hat{R}), $$

with step size $\eta$.

Here $\nabla_{\theta}$ is,

$$ \nabla_{\theta}(\hat{R}) = \frac{2}{N} \sum_{i = 1}^{N} x^{(i)} (\theta^T x^{(i)} - y^{(i)}). $$

However, gradient descent only has one non-trivial operation, computing the gradient itself.

As long as $\hat{R}$ is a composition of differentiable functions of $\theta$, differentiation is easy, this is what we exploit in modern machine learning (backpropagation).

Part 3 - AI Tools