Probability Basics
Probability theory provides the mathematical framework for reasoning under uncertainty -- the other foundational language of machine learning alongside linear algebra. Every ML model makes probabilistic assumptions (implicitly or explicitly), and understanding probability is essential for designing, training, and interpreting models.
Sample Space and Events
- Sample space : the set of all possible outcomes of an experiment
- Event algebra : a -algebra -- a collection of subsets of closed under complement and countable union
- Probability measure : a function assigning probabilities to events
Axioms of Probability
- Non-negativity: for all events
- Normalization:
- Countable additivity: For mutually exclusive events :
From these axioms, all other rules follow:
- (complement)
- (inclusion-exclusion)
- If , then (monotonicity)
Conditional Probability and Bayes' Theorem
This defines a new probability measure that concentrates on the event .
where (discrete) or (continuous) is the marginal likelihood (evidence).
| Component | Interpretation in ML |
|---|---|
| (prior) | Our belief about model parameters before seeing data (e.g., Gaussian L2 regularization) |
| $P(\mathcal{D} | \theta)$ (likelihood) |
| $P(\theta | \mathcal{D})$ (posterior) |
| (evidence) | Model quality score for model comparison; intractable for neural networks |
MAP estimation maximizes , which equals MLE + regularization. Full Bayesian inference integrates over , giving calibrated uncertainty.
The prior probability of spam (20%) is updated to 67% after observing "free." This is the foundation of Naive Bayes classifiers, which assume features are conditionally independent given the class: .
This is the "denominator" in Bayes' theorem and is used in marginalization: .
Independence
Random variables and are independent if for all measurable sets . Equivalently, the joint density factors: .
Conditional independence: means . This is central to graphical models: nodes in a Bayesian network are conditionally independent given their parents.
The i.i.d. assumption breaks in time series, reinforcement learning, and distribution shift settings.
Expectation and Variance
For a function : (LOTUS -- Law of the Unconscious Statistician).
Key properties:
- Linearity (always, even for dependent variables):
- Product rule for independent variables: If , then
- Jensen's inequality (for convex ):
The standard deviation is . Properties:
For independent : .
Covariance and Correlation
The Pearson correlation coefficient normalizes covariance to :
iff and are linearly related; means uncorrelated.
For a random vector , the covariance matrix is with . This matrix is always symmetric and positive semi-definite ().
Concentration Inequalities
| Inequality | Statement | Conditions |
|---|---|---|
| Markov | ||
| Chebyshev | $P( | X - \mu |
| Hoeffding | $P( | \bar{X}_n - \mu |
| Bernstein | $P( | \bar{X}_n - \mu |
Setting the RHS and solving for : the generalization gap is at most with probability . This is the foundation of PAC learning bounds.
Notation Summary
| Symbol | Meaning |
|---|---|
| Sample space | |
| -algebra (event space) | |
| Probability of event | |
| $P(A | B)$ |
| Expectation of | |
| Variance of | |
| Covariance of and | |
| Covariance matrix | |
| Correlation coefficient | |
| and are independent | |
| $X \perp Y | Z$ |
| i.i.d. | Independent and identically distributed |