Introduction
In this part weāll cover some very central and fundamental concepts and theorems from statistics that are used.
The statistic paradigm
The bread and butter of statistics is inferring the properties of the population from a random sample.
Properties of the random sample are determined by the underlying properties of the population and probability. The random sample can be described in terms of descriptive statistics, such as summary statistics e.g. count, mean, sample variance, order statistics
If the population is very large, we can treat the sample as being drawn from a probability distribution.
Fundamentals of probability
Experiments yield exactly one of the possible outcomes.
The set of all possible outcomes is called the sample space, which we denote as $S$.
An event $E \subseteq S$ is a subset of outcomes.
If the outcome $s$ of the experiment is in the event, that is $s \in E$, we say the event occurs.
The probability distribution $p : S \to [0,1]$ is a function that maps each outcome $s \in S$ to a probability, a number in the interval $[0,1]$, satisfying: $$ \sum_{s \in S} p(s) = 1 $$
Probabilities of the sample space sum up to one, every experiment has an outcome, and the outcome has to be from the sample space The probability of an event is the sum of probabilities of outcomes in the event: $$ P(E) = \sum_{s \in E} p(s) $$
The complement $\bar{E}$ of an event $E$ is the set of all outcomes that are not included in $E$, that is $\bar{E} = S \ \backslash \ E$.
The probability of the complement event satisfies $p(\bar{E}) = 1 - p(E)$.
Random variable $X$ is a function that maps outcomes to numerical values. This means that the probability of random variable receiving a value $x$: $$ P(X = x) = P(\{s \in S | X(s) = x\}) $$
The expected value of a random variable $X$ is: $$ E[X] = \sum_{s \to S} p(s)V(s) \ | \ \text{where } V(s) \text{ is the value of outcome} $$
Events are sets and there exists a universe $S$, we can apply set operations on events, such as
- Intersection $A \cap B = A\ \backslash \ (S\ \backslash\ B) \ | $ meaning the event that both $A$ and $B$ occur.
- Union $A \cup B$ meaning that either $A$ or $B$ or both occur.
If $A$ and $B$ satisfy $P(A \cap B) = P(A)P(B)$, we say they are independent. This means that knowing the event $A$ occurs gives no information about whether $B$ occurs and vice versa.
The conditional probability of $A$ given $B$, denoted $P(A | B)$ is defined as: $$ P(A | B) = \frac{P(A \cap B)}{P(B)} $$
General multiplication rule
It directly follows from the definition of conditional probability that: $$ P(A \cap B) = P(A | B)P(B) = P(B | A)P(A) $$
For $A$ and $B$ to occur simultaneously, $B$ has to occur, and then $A$ has to occur given $B$.
Bayesā theorem
Conditional probability can be reversed; since we already observed that $$ P(A \cap B) = P(A | B)P(B) = P(B | A)P(A) $$
Solving for $P(B | A)$ we get: $$ P(B | A) = \frac{P(A | B)P(B)}{P(A)} $$
Random variables and distributions
The distribution of a random variable $X$ is most importantly characterized by its expected value $E[x]$ and variance $Var[X] = E[X - E[X]^2]$ The expected value tells us about what kind of value is the most expected. Variance tells us about how the values are spread about the expectation.
If the codomain of the RV (the āvaluesā) is finite or countably infinite, we say it is discrete. If the codomain of the RV is uncountable, like $\mathbb{R}$, we say it is continuous.
Discrete random variables
Discrete random variables take each value with a well-defined probability.
The function assigning this probability is called the probability mass function (PMF) $p\ ā¶\ S \to [0,1]$ The expectation satisfies $E[X] = \sum_{x \in S} xp(x)$
Continuous random variables
Continuous random variables take each individual value with probability zero.
Probability only becomes meaningful when an interval of values is considered The probability of a single point is characterized by the probability density function (PDF) $f\ :\ S \to [0,1]$ Probability is obtained by integrating over the interval. The probability that $X$ takes a value in the interval $[a, b]$ can be obtained by computing $P(a \leq X \leq B) = \int_a^b f(x)\ dx$
In particular, the probability that $X$ takes value at most $x$ is known as the cumulative distribution function (CDF) $F ā¶ S \to [0,1]$, so $F(x) = P(X \leq x)$
PDF and CDF are related via $F(x) = \int_{-\infty}^{\infty} f(t)\ dt$ and conversely $f(x) = Fā(x)$. The expectation satisfies, $E[X] = \int_{x \in S} x\ dx$
Exponential distribution
We denote $X \sim Exp(\lambda)$ that $X$ is distributed as the exponential distribution with parameter $\lambda > 0$ if its PDF satisfies: $$ f(x) = \begin{cases} \lambda e^{-\lambda x} & \ | \ x \geq 0 \newline 0 & \ | \ \text{otherwise} \end{cases} $$
Which means: $$ F(x) = \int_0^x f(t)\ dt = \begin{cases} 1 - e^{-\lambda x} & \ | \ x \geq 0 \newline 0 & \ | \ \text{otherwise} \end{cases} $$
$$ E[X] = \int_0^{\infty} tf(t)\ dt = \int_0^{\infty} t\lambda e^{-\lambda t}\ dt = \frac{1}{\lambda} $$
$$ Var[X] = E[X - E[X]^2] = \int_0^{\infty} \left(t - \frac{1}{\lambda}\right)^2 f(t)\ dt = \int_0^{\infty} \left(t - \frac{1}{\lambda}\right)^2 \lambda e^{-\lambda t}\ dt = \frac{1}{\lambda^2} $$
Normal distribution
The normal distribution is one of the most important distributions as a lot of natural phenomena are by their very nature normally distributed.
The PDF of a normal distribution with mean $\mu$ (sometimes called location) and standard deviation $\sigma$ (sometimes called scale) is: $$ f(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2} $$ The CDF $F(x) = \int_{-\infty}^{\infty} f(t)\ dt$ cannot be expressed in closed form in terms of elementary functions.
Normal distribution with $\mu = 0$ and $\sigma = 1$ is called the standard normal distribution and its PDF and CDF are denoted $\varphi(x)$ and $\Phi(x)$, respectively.
We denote $X \sim \mathcal{N}(\mu, \sigma)$ that $X$ is normally distributed.
Z-scores
Suppose we have some observations $x_1, x_2, \ldots, x_n$.
The sample mean is: $$ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i $$
This is an unbiased estimator for the population mean.
The sample variance is: $$ s^2 = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_i - \bar{x})^2 $$
The $\frac{1}{n - 1}$ factor is known as Besselās correction and makes this an unbiased estimator for the population variance.
Sample standard deviation is computed as: $$ s = \sqrt{s^2} $$
This is not unbiased, but this is good enough in most cases.
Population mean and standard deviation are often denoted as $\mu$ and $\sigma$ respectively.
The Z-score for an observation is: $$ z = \frac{x_i - \mu}{\sigma} $$
Since these are usually unknown, we use their unbiased estimators:
$$ z = \frac{x_i - \bar{x}}{s} $$
Computing the Z-score, brings the mean within the range $[0, 1]$ and the variance is 1. This makes observations insensitive to scaling and translation.
Concentration of measure
Knowing just the expectation and the variance of a random variable tells us a lot because of a phenomenon called concentration of measure. Chebyshevās inequality shows that, for any random variable with finite variance $\sigma^2$, the random variable deviates from its expectation by more than $k\sigma$ (more than $k$ standard deviations) is at most $\frac{1}{k^2}$.
This holds for all random variables, regardless of the distribution. This means that always at least 50% of probability mass is within $\sqrt{2}$ standard deviations from the mean, 75% within $2$ standard deviations, and so on.
This is what makes statistics work!
Law of large numbers and Central limit theorem
The law of large numbers says that if we draw samples uniformly at random from an arbitrary probability distribution, then their mean converges to the expectation
Suppose $x_1, x_2, \ldots, x_n$ are independent and identically distributed samples from a probability distribution with expectation $\mu$. Then: $$ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \to \mu \text{ as } n \to \infty $$
So, we can estimate the expectation by computing the arithmetic mean from a sample.
The central limit theorem says that the mean of samples, sampled from an arbitrary distribution, become normally distributed when there are a sufficiently large number of samples. When we say standard error around the mean, we mean: $$ SE = \frac{\sigma}{\sqrt{n}} $$
Which is the standard deviation of the distribution as suggested by CLT.
Estimators for mean and variance
An estimator $\hat{\theta}(x)$ is a function that computes an estimate for a parameter $\theta$ of the underlying population (the distribution) from a sample.
The bias of an estimator is the systematic error of the estimate: $E[\hat{\theta}] - \theta$ If the bias is zero, we say the estimator is unbiased. We already saw that the arithmetic mean is an unbiased estimator for the expectation by the law of large numbers; for a sample $x_1, x_2, \ldots, x_n$ the sample mean is: $$ \hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i $$
For variance, the unbiased sample variance is: $$ \hat{\sigma^2} = \frac{1}{n - 1} \sum_{i = 1}^n \left(x_i - \hat{\mu}\right)^2 $$
Sample standard deviation is simply the square root of the sample variance, although this is slightly biased: $$ \hat{\sigma} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n \left(x_i - \hat{\mu}\right)^2} $$
Confidence interval
A point estimate is often too coarse, as there is a sizable variation around the mean, so instead it is better to give an interval than an individual point.
A confidence level sets the relative proportion of sampled confidence intervals that contain the true value of the parameter in question
The most commonly used confidence level is 95%; this means that out of all possible confidence intervals, the sampled interval should contain the correct parameter value in 95% of the cases.
Higher confidence level yields wider confidence intervals If, by CLT, we assume the estimator is normally distributed, then a confidence interval at 95% confidence level corresponds to the interval $[\hat{\mu} - 1.96SE, \hat{\mu} + 1.96SE]$ As $\sigma$ is often unknown, we usually estimate $SE = \frac{\hat{\sigma}}{\sqrt{n}}$.
Bernoulli distribution
Bernoulli distribution is a fundamental distribution for binary-valued questions: YES/NO, did an event occur? Etc.
$X \sim Bernoulli(p)$ for a parameter $p \in [0,1]$ if its PMF satisfies: $$ P(X = 1) = p \newline P(X = 0) = 1 - p $$
$$ E[X] = p \newline Var[X] = p(1 ā p) $$
Binomial distribution
The Binomial distribution models the number of positive outcomes of independent Bernoulli trials. If $X_1, X_2, \ldots, X_n \sim Bernoulli(p)$ are independent and identically distributed, then $X = \sum_{i = 1}^n X_i \sim Bin(n, p)$. So, $X$ counts how many of the Bernoulli variables were 1. $$ E[X] = np \newline Var[X] = np(p - 1) \newline $$
PMF: $$ P(X = k) = {n\choose k} p^k (1 - p){n - k} $$
Multinomial distribution
What if we have more than two categories of objects? $$ X \sim Mul(n; p_1, p_2, \ldots, p_k) $$
The PMF satisfies: $$ P(X = (x_1, x_2, \ldots, x_k)) = {n \choose {x_1, x_2, \ldots, x_k}} p_1^{x_1} p_2^{x_2} \dots p_k^{x_k} $$
$$ E[X_i] = np_i \newline Var[X_i] = np_i(1 - p_i) $$
Bag of words
Consider a text document of natural language; one way to represent the content of documents that is useful for classification tasks is the bag of words model. A bag is a multiset; a bag of words maps each word of a vocabulary to the number of times it occurred in a document
Classifying bags of words
In the bag of words model, each document $x$ is a vector of length $n$ where $n$ is the size of vocabulary. The elements of a document vector $x_i$ record the multiplicity of the word in that document.
Typically the vectors are sparse: most words never occur in a given document. Suppose we try to classify documents in to classes $C_1, C_2, \ldots, C_k$. Given an observation $x$, we can try to estimate the likelihood that a document from a class $j$ would generate a vector that matches the observation, $P(x | C_j)$.
We can combine this with the prior probability that we know how probable different classes are to occur, $P(C_j)$ By Bayesā theorem, the posterior probability of the class, given evidence, is: $$ P(C_j | x) = \frac{P(x | C_j)P(C_j)}{P(x)} $$
NaĆÆve Bayesian classifier
It would thus make sense to classify the document to the class that has the largest posterior probability. Unfortunately, we often have insufficient information about the dependence between features to make full Bayesian inference feasible. Instead, we can make the NaĆÆve Bayesian assumption: we assume that all features are independent.
Zero-values
If one of the values in some of the features is not observed for some class in the training set, the estimated value $P(X | C)$ = 0, which means that the posterior probability becomes necessarily zero This might not be what we wanted: some values might be exceedingly rare but still occur in practice.
We can deal with this by Laplace Smoothing, we choose a parameter $\alpha > 0$. If we use $d$ values for the feature and our empirical proportion is $p_i = \frac{x_i}{n}$, and we have $d$ different values then we change this to: $$ \hat{p_i} = \frac{x_i + \alpha}{n + \alpha d} $$
This corresponds to adding $\alpha$ imaginary observations of each value
Bernoulli NaĆÆve Bayes
Suppose $x$ is a bag of words over a vocabulary of $n$ words.
We can then treat each $x_i$ as a Bernoulli variable: we donāt care about the count, only if the word is present in a document; that is, $x \in \{0, 1\}^n$.
The estimate $p_{i,j} = P(x_i | C_j)$ is then simply the fraction of documents of class $C_j$ that contain the word $i$.
Likelihood of the document $x$ being generated from class $C_j$ would thus be: $$ P(x | C_j) = \prod_{i = 1}^n p_{i,j}^{x_i} (1 - p_{i, j})^{1 - x_i} $$
Multinomial NaĆÆve Bayes
Suppose $x$ is a bag of words over a vocabulary of $n$ words.
We can then treat each $x_i$ as a binomial variable: this time we care about the count, so $x \in \mathbb{N}^n$. The estimate $p_{i,j} = P(x_i | C_j)$ is then simply the fraction of words in total of class $C_j$ that are equal to the word $i$.
Likelihood of the document $x$ being generated from class $C_j$ would thus be determined by a multinomial distribution: $$ P(x | C_j) = {{\sum_{i =1}^n x_i} \choose {x_1, x_2, \ldots, x_k}} \prod_{i = 1}^n p_{i,j}^{x_i} $$
Logit function
The logit function is defined as: $$ f : (-\infty, \infty) \to [0, 1] $$
$$ f(x) = \dfrac{1}{1 + e^{-cx}} \ | \ c > 0 $$
The logit function turns arbitrary values into probabilities. $$ \lim_{x \to \infty} f(x) = 1 \newline \lim_{x \to -\infty} f(x) = 0 \newline \lim_{x \to 0} f(x) = \frac{1}{2} $$