Skip to main content

Probability Basics

Probability theory provides the mathematical framework for reasoning under uncertainty -- the other foundational language of machine learning alongside linear algebra. Every ML model makes probabilistic assumptions (implicitly or explicitly), and understanding probability is essential for designing, training, and interpreting models.

Sample Space and Events

A **probability space** $(\Omega, \mathcal{F}, P)$ consists of:
  • Sample space Ω\Omega: the set of all possible outcomes of an experiment
  • Event algebra F\mathcal{F}: a σ\sigma-algebra -- a collection of subsets of Ω\Omega closed under complement and countable union
  • Probability measure P:F[0,1]P: \mathcal{F} \to [0, 1]: a function assigning probabilities to events
**Probability spaces in ML:** - **Classification:** $\Omega = \{1, 2, \ldots, K\}$ (class labels), $P$ given by the softmax output - **Language modeling:** $\Omega = \mathcal{V}^T$ (all sequences of length $T$ over vocabulary $\mathcal{V}$), $P(x_1, \ldots, x_T) = \prod_t P(x_t \mid x_{\lt t})$ - **Bayesian inference:** $\Omega = \Theta$ (parameter space), $P$ is the posterior distribution

Axioms of Probability

All of probability theory follows from three axioms [@kolmogorov1933foundations]:
  1. Non-negativity: P(A)0P(A) \geq 0 for all events AA
  2. Normalization: P(Ω)=1P(\Omega) = 1
  3. Countable additivity: For mutually exclusive events A1,A2,A_1, A_2, \ldots: P ⁣(iAi)=iP(Ai)P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i)

From these axioms, all other rules follow:

  • P(Aˉ)=1P(A)P(\bar{A}) = 1 - P(A) (complement)
  • P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B) (inclusion-exclusion)
  • P()=0P(\emptyset) = 0
  • If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B) (monotonicity)

Conditional Probability and Bayes' Theorem

The probability of $A$ given that $B$ has occurred:

P(AB)=P(AB)P(B),P(B)>0P(A | B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

This defines a new probability measure P(B)P(\cdot | B) that concentrates on the event BB.

**Bayes' theorem** inverts the direction of conditioning:

P(θD)=P(Dθ)P(θ)P(D)=likelihood×priorevidenceP(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta) \, P(\theta)}{P(\mathcal{D})} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}

where P(D)=θP(Dθ)P(θ)P(\mathcal{D}) = \sum_\theta P(\mathcal{D}|\theta)P(\theta) (discrete) or P(Dθ)P(θ)dθ\int P(\mathcal{D}|\theta)P(\theta)d\theta (continuous) is the marginal likelihood (evidence).

Bayes' theorem is the engine of Bayesian machine learning:
ComponentInterpretation in ML
P(θ)P(\theta) (prior)Our belief about model parameters before seeing data (e.g., Gaussian \to L2 regularization)
$P(\mathcal{D}\theta)$ (likelihood)
$P(\theta\mathcal{D})$ (posterior)
P(D)P(\mathcal{D}) (evidence)Model quality score for model comparison; intractable for neural networks

MAP estimation maximizes P(θD)P(\theta|\mathcal{D}), which equals MLE + regularization. Full Bayesian inference integrates over P(θD)P(\theta|\mathcal{D}), giving calibrated uncertainty.

**Spam classification with Bayes' theorem.** Suppose 20% of emails are spam ($P(S) = 0.2$). The word "free" appears in 80% of spam and 10% of non-spam emails: $P(\text{free}|S) = 0.8$, $P(\text{free}|\bar{S}) = 0.1$. Given an email containing "free," what is the probability it is spam?

P(Sfree)=P(freeS)P(S)P(free)=0.8×0.20.8×0.2+0.1×0.8=0.160.24=230.67P(S|\text{free}) = \frac{P(\text{free}|S) \cdot P(S)}{P(\text{free})} = \frac{0.8 \times 0.2}{0.8 \times 0.2 + 0.1 \times 0.8} = \frac{0.16}{0.24} = \frac{2}{3} \approx 0.67

The prior probability of spam (20%) is updated to 67% after observing "free." This is the foundation of Naive Bayes classifiers, which assume features are conditionally independent given the class: P(x1,,xny)=iP(xiy)P(x_1, \ldots, x_n | y) = \prod_i P(x_i | y).

For a partition $\{B_1, B_2, \ldots, B_n\}$ of $\Omega$:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A | B_i) P(B_i)

This is the "denominator" in Bayes' theorem and is used in marginalization: P(x)=zP(xz)P(z)P(x) = \sum_z P(x|z)P(z).

Independence

Events $A$ and $B$ are **independent** ($A \perp B$) if $P(A \cap B) = P(A)P(B)$, equivalently $P(A|B) = P(A)$.

Random variables XX and YY are independent if P(XS,YT)=P(XS)P(YT)P(X \in S, Y \in T) = P(X \in S) \cdot P(Y \in T) for all measurable sets S,TS, T. Equivalently, the joint density factors: p(x,y)=p(x)p(y)p(x, y) = p(x)p(y).

Conditional independence: XYZX \perp Y | Z means p(x,yz)=p(xz)p(yz)p(x, y | z) = p(x|z) \cdot p(y|z). This is central to graphical models: nodes in a Bayesian network are conditionally independent given their parents.

The **i.i.d. assumption** (independent and identically distributed) is the most common assumption in ML: training examples $(x_i, y_i)$ are assumed drawn independently from the same distribution $P$. This enables: - The law of large numbers: $\hat{R}(h) \to R(h)$ as $n \to \infty$ - Concentration inequalities: $|\hat{R}(h) - R(h)| = O(1/\sqrt{n})$ - The validity of cross-validation

The i.i.d. assumption breaks in time series, reinforcement learning, and distribution shift settings.

Expectation and Variance

The **expectation** (mean) of a random variable $X$ with respect to distribution $p$:

E[X]=xxp(x)(discrete),E[X]=xp(x)dx(continuous)\mathbb{E}[X] = \sum_x x \, p(x) \quad \text{(discrete)}, \qquad \mathbb{E}[X] = \int x \, p(x) \, dx \quad \text{(continuous)}

For a function g(X)g(X): E[g(X)]=xg(x)p(x)\mathbb{E}[g(X)] = \sum_x g(x) p(x) (LOTUS -- Law of the Unconscious Statistician).

Key properties:

  1. Linearity (always, even for dependent variables): E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]
  2. Product rule for independent variables: If XYX \perp Y, then E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]
  3. Jensen's inequality (for convex gg): g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]
**Jensen's inequality** appears throughout ML: - **Why log-likelihood is the training objective:** $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$ (concave $\log$) gives the ELBO - **Why KL divergence is non-negative:** Apply Jensen to $\mathbb{E}_p[\log(q/p)] \leq \log \mathbb{E}_p[q/p] = 0$ - **Why the average model is better than the average prediction:** Ensemble methods exploit convexity of the loss The **variance** measures the expected squared deviation from the mean:

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

The standard deviation is Std(X)=Var(X)\text{Std}(X) = \sqrt{\text{Var}(X)}. Properties:

Var(aX+b)=a2Var(X),Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(aX + b) = a^2 \text{Var}(X), \qquad \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)

For independent X,YX, Y: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

Covariance and Correlation

The **covariance** between random variables $X$ and $Y$ measures their linear dependence:

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

The Pearson correlation coefficient normalizes covariance to [1,1][-1, 1]:

ρ(X,Y)=Cov(X,Y)Std(X)Std(Y)\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\text{Std}(X) \cdot \text{Std}(Y)}

ρ=1|\rho| = 1 iff XX and YY are linearly related; ρ=0\rho = 0 means uncorrelated.

For a random vector XRnX \in \mathbb{R}^n, the covariance matrix is ΣRn×n\Sigma \in \mathbb{R}^{n \times n} with Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j). This matrix is always symmetric and positive semi-definite (zΣz=Var(zX)0z^\top \Sigma z = \text{Var}(z^\top X) \geq 0).

**Correlation $\neq$ independence.** Uncorrelated ($\text{Cov}(X,Y) = 0$) means no *linear* relationship. Independent means no relationship of *any kind*. For Gaussian random variables, uncorrelated *does* imply independent (a unique property of the Gaussian). For other distributions, $X$ and $Y$ can be uncorrelated yet strongly dependent (e.g., $Y = X^2$ for symmetric $X$).

Concentration Inequalities

These bound the probability that a random variable deviates from its mean:
InequalityStatementConditions
MarkovP(Xa)E[X]/aP(X \geq a) \leq \mathbb{E}[X]/aX0X \geq 0
Chebyshev$P(X - \mu
Hoeffding$P(\bar{X}_n - \mu
Bernstein$P(\bar{X}_n - \mu
**Hoeffding's inequality and generalization.** For a finite hypothesis class $\mathcal{H}$ with $|\mathcal{H}|$ members and bounded loss $\ell \in [0, 1]$, a union bound over Hoeffding gives:

P ⁣(suphHR(h)R^(h)t)2He2nt2P\!\left(\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \geq t\right) \leq 2|\mathcal{H}| e^{-2nt^2}

Setting the RHS =δ= \delta and solving for tt: the generalization gap is at most log(2H/δ)2n\sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2n}} with probability 1δ\geq 1-\delta. This is the foundation of PAC learning bounds.

Notation Summary

SymbolMeaning
Ω\OmegaSample space
F\mathcal{F}σ\sigma-algebra (event space)
P(A)P(A)Probability of event AA
$P(AB)$
E[X]\mathbb{E}[X]Expectation of XX
Var(X)\text{Var}(X)Variance of XX
Cov(X,Y)\text{Cov}(X,Y)Covariance of XX and YY
Σ\SigmaCovariance matrix
ρ\rhoCorrelation coefficient
XYX \perp YXX and YY are independent
$X \perp YZ$
i.i.d.Independent and identically distributed