Randomized Regularization & Training

Some of the most impactful applications of randomization in deep learning are as regularization and training techniques. Unlike the approximation algorithms discussed in previous sections (which aim to compute something cheaper), these methods use randomness to improve generalization and training dynamics. This section focuses on the foundational techniques (developed through roughly 2020) that remain in wide use; their descendants still appear in 2023-2025 training recipes, where stochastic depth (DropPath) is standard in Vision Transformer stacks and dropout variants persist in large language model training, even as explicit dropout is sometimes reduced or omitted when data and model scale already curb overfitting.

Dropout

Srivastava et al. (2014) (Srivastava et al., 2014) introduced Dropout, arguably the most influential randomized technique in deep learning. During training, each neuron is independently set to zero with probability p (typically p = 0.1-0.5), and at test time, all neurons are active but their outputs are scaled by (1-p). This simple trick provides powerful regularization, reducing overfitting by preventing neurons from co-adapting: each neuron must learn to be useful independently of any particular other neuron. The practical impact has been enormous: dropout is used in virtually every modern deep learning architecture, and its introduction led to a cascade of architectural and theoretical developments. The technique is not without drawbacks, however. Dropout interacts poorly with batch normalization, since the variance shift introduced by dropping units at training time differs from the test-time statistics that batch normalization accumulates, which can degrade accuracy when the two are naively combined. Dropout also slows convergence (each update sees only a thinned sub-network), and for already well-regularized or convolutional architectures the gains are often marginal.

Dropout has several theoretical interpretations:

Ensemble interpretation: Training with dropout is equivalent to training an exponential number of "thinned" networks (one for each dropout mask) and averaging their predictions at test time (Srivastava et al., 2014). For a network with n units, each dropout mask defines one of 2^n possible sub-networks. Training with dropout approximately trains all 2^n sub-networks simultaneously (with shared weights), and test-time scaling approximates geometric averaging over all sub-networks. This connection to ensembles explains dropout's effectiveness: ensemble methods reduce variance, and dropout achieves this implicitly.
Bayesian interpretation: Gal and Ghahramani (2016) (Gal & Ghahramani, 2016) showed that a network trained with dropout is equivalent to a deep Gaussian process, and that applying dropout at test time (Monte Carlo Dropout) provides calibrated uncertainty estimates. This connection enabled practical Bayesian deep learning without the computational overhead of variational inference. Concretely, running T forward passes with dropout and computing the variance of predictions gives an estimate of model uncertainty: high variance indicates that the model is uncertain, which is crucial for safety-critical applications like medical diagnosis and autonomous driving.
Noise injection interpretation: Dropout can be viewed as a form of multiplicative noise injection, related to data augmentation and noise regularization (Wager et al., 2013). Wager et al. (2013) showed that for generalized linear models, dropout is equivalent to an adaptive L2 regularizer whose strength depends on the feature magnitudes (features with larger magnitudes are penalized more heavily), which encourages the model to distribute its reliance across many features rather than concentrating on a few.
Information bottleneck interpretation: Dropout forces the network to learn redundant representations: since any neuron may be dropped, the network must spread information across multiple neurons. This creates an implicit information bottleneck where each layer must encode sufficient information for downstream computation even with random dropout, leading to more robust and transferable features.

Stochastic Depth

Huang et al. (2016) (Huang et al., 2016) proposed Stochastic Depth, which randomly drops entire layers (residual blocks) during training. Each residual block is included with probability that decreases linearly from 1 (at the input) to a minimum value (at the output). At test time, all blocks are active with appropriately scaled outputs. Stochastic Depth enables training of very deep networks (up to 1202 layers) that would otherwise suffer from optimization difficulties, and has been adopted in architectures like Vision Transformers (where it is called DropPath) and DeiT.

The connection between stochastic depth and ensemble methods is direct: training with stochastic depth implicitly trains an ensemble of networks of different depths, with the test-time network being the ensemble average. This perspective connects to the broader theme of randomization as implicit ensemble learning. The benefit is largely confined to residual architectures deep enough to overfit or suffer optimization difficulties; on shallower networks the random depth schedule introduces gradient noise that can hurt rather than help, and the drop-probability schedule adds a hyperparameter that must be tuned per architecture.

DropConnect, DropBlock, and Variants

DropConnect (Wan et al., 2013) (Wan et al., 2013) generalizes dropout by randomly zeroing individual weights rather than entire neuron outputs. DropConnect is strictly more general than dropout (dropout is a special case where all outgoing weights of a neuron are dropped together) and provides stronger regularization at the cost of higher computational overhead. That overhead is the main limitation: sampling and applying a separate Bernoulli mask per weight is far more expensive than masking whole activations, and the extra regularization only pays off on relatively small or heavily overfitting models, which is why DropConnect has seen limited adoption compared to plain dropout.

DropBlock (Ghiasi et al., 2018) (Ghiasi et al., 2018) extends dropout to spatially structured data (images) by dropping contiguous regions of feature maps rather than individual elements. Standard dropout is relatively ineffective for convolutional networks because spatial correlation means that neighboring neurons carry redundant information, so dropping individual neurons does not remove the information. DropBlock addresses this by dropping entire spatial blocks, forcing the network to learn more distributed representations. The trade-off is that the block size and drop rate must be scheduled carefully: too aggressive a setting removes too much spatial context and can destabilize training, and the method is specific to convolutional feature maps rather than a general-purpose regularizer.

Stochastic Gradient Descent

Perhaps the most impactful randomized algorithm in all of machine learning is stochastic gradient descent (SGD). The idea of approximating gradients with random samples traces back to Robbins and Monro (1951) (Robbins & Monro, 1951), but its modern impact on deep learning is immense. By computing gradients on random mini-batches rather than the full dataset, SGD reduces the cost of each optimization step from O(n) to O(b) (where b is the batch size), enabling training on datasets of billions of examples. The noise in stochastic gradients also provides implicit regularization: SGD tends to find flatter minima than full-batch gradient descent, which often generalize better (Bottou & Bousquet, 2008). The same noise, however, is a double-edged sword: gradient variance slows convergence and means SGD requires more iterations and careful learning-rate scheduling than deterministic methods, and the implicit regularization can vanish in the large-batch regime, where models often generalize worse unless the learning rate and schedule are retuned.

The implicit regularization of SGD is one of the deepest theoretical questions in deep learning. Several complementary perspectives have emerged. First, the stochastic gradient noise is approximately Gaussian with covariance proportional to the loss gradient variation, and this noise drives the optimization trajectory away from sharp minima (where the noise is large) toward flat minima (where the noise is small). This bias toward flat minima is the "noise-induced regularization" effect. Second, the learning rate and batch size interact as an effective temperature: the ratio learning_rate / batch_size controls the amount of noise, and higher temperature (larger ratio) leads to flatter minima. Smith and Le (2018) (Smith & Le, 2018) formalized this view via a stochastic differential equation analysis, identifying the noise scale as proportional to learning_rate / batch_size. This explains why reducing the batch size and increasing the learning rate (with linear scaling) often improves generalization, as discovered empirically by Goyal et al. (2017) (Goyal et al., 2017) for ImageNet training. Third, for overparameterized models where many global minima exist, SGD's noise helps select minima with better generalization properties through a mechanism related to algorithmic stability (Hardt et al., 2016).

Modern optimizers like Adam (Kingma & Ba, 2015) combine stochastic gradient estimates with adaptive learning rates (using running statistics of gradient moments), further improving convergence on the highly non-convex loss landscapes of deep neural networks. Adam maintains exponential moving averages of the first moment (mean) and second moment (uncentered variance) of the gradients, using them to adaptively scale the learning rate for each parameter. The convergence theory for SGD on non-convex problems, including over-parameterized neural networks, has been the subject of intense theoretical investigation (Allen-Zhu et al., 2019). Recent work by Allen-Zhu et al. (2019) proved that SGD converges to global minima of overparameterized networks in polynomial time, providing theoretical justification for the empirical success of SGD on deep networks.

Random Data Augmentation

Data augmentation (applying random transformations to training data) is a form of randomized regularization that has proven essential for training modern vision models. The fundamental principle is that augmentation implicitly encodes invariances: by training on randomly transformed versions of the data, the model learns representations that are invariant to those transformations. This is equivalent to data-driven regularization, where the augmentation policy defines a prior over the transformation group that the model should be invariant to. The catch is that this prior must actually match the task: overly aggressive or ill-chosen transformations can induce a distribution shift between augmented training data and real test data, and label-mixing schemes can introduce label noise when the assumed invariance does not hold, both of which hurt rather than help generalization. Techniques range from classical (random crops, flips, color jitter) to learned augmentation policies:

RandAugment (Cubuk et al., 2020) (Cubuk et al., 2020) simplified augmentation search to just two hyperparameters (number of augmentations N and magnitude M), achieving competitive or superior performance to learned policies like AutoAugment (Cubuk et al., 2019) (Cubuk et al., 2019) (which required expensive reinforcement learning search). RandAugment applies N random transformations from a pool of 14 operations (rotation, shear, translate, brightness, contrast, etc.) at magnitude M, and its simplicity has made it the default augmentation strategy for ImageNet training.
Mixup (Zhang et al., 2018) (Zhang et al., 2018) trains on random convex combinations of pairs of training examples and their labels, providing a form of data-dependent regularization that encourages linear behavior between training examples. Concretely, for two examples (x_i, y_i) and (x_j, y_j), Mixup trains on (lambda*x_i + (1-lambda)x_j, lambday_i + (1-lambda)*y_j) where lambda is drawn from Beta(alpha, alpha). This has a Vicinal Risk Minimization interpretation: Mixup defines a vicinity distribution around each training point and minimizes risk over these vicinities rather than just at the training points themselves. The result is smoother decision boundaries and better calibrated confidence estimates.
CutMix (Yun et al., 2019) (Yun et al., 2019) replaces rectangular regions of one image with patches from another, combining aspects of augmentation and regularization. Unlike Mixup (which produces ghostly superimpositions), CutMix produces natural-looking composites, forcing the model to learn from partial views of objects and improving robustness to occlusion.

Random Initialization

Neural network initialization, the choice of random distribution from which initial weights are drawn, has a profound effect on training dynamics. Proper initialization ensures that signals neither explode nor vanish as they propagate through the network:

Xavier/Glorot initialization (Glorot & Bengio, 2010) sets weights from a distribution with variance 2/(fan_in + fan_out), preserving signal variance through linear layers.
Kaiming/He initialization (He et al., 2015) adjusts the variance to account for ReLU nonlinearities (variance 2/fan_in), enabling training of very deep networks.
Orthogonal initialization (Saxe et al., 2014) (Saxe et al., 2014) uses random orthogonal matrices, which preserve signal norms exactly and have been shown to enable faster training of deep linear networks.

The connection to random matrix theory is deep: the eigenspectrum of the weight matrix at initialization determines the signal propagation properties of the network, and the "edge of chaos" initialization (where the network is neither contracting nor expanding signals) has been shown to be optimal for deep networks (Poole et al., 2016).

Dropout​

Stochastic Depth​

DropConnect, DropBlock, and Variants​

Stochastic Gradient Descent​

Random Data Augmentation​

Random Initialization​

References