Fourier & Spectral Perspectives
Fourier Features for Neural Networks
Tancik et al. (2020) (Tancik et al., 2020) demonstrated that standard neural networks fail to learn high-frequency functions from low-dimensional inputs due to a spectral bias toward low frequencies (the "frequency bias" or "spectral bias") (Rahaman et al., 2019). Their solution, mapping inputs through random Fourier features before feeding them to the network, enables learning of high-frequency content. This was transformative for Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) and other coordinate-based neural representations, where fine geometric and appearance details require high-frequency function approximation.
The connection to random features (Section 5.4.2) is direct: Fourier feature mappings are instances of random feature maps for the RBF kernel, and the frequency distribution of the random features controls the bandwidth of learnable functions (Tancik et al., 2020).
Spectral Analysis of Attention
Self-attention can be analyzed through a spectral lens. The attention matrix, when viewed as a linear operator, has a spectrum that reveals the information flow patterns in the network (Bhojanapalli et al., 2020). Empirical analysis consistently shows that trained attention matrices are approximately low-rank (a few principal components capture most of the variance), which explains why low-rank attention approximations (Linformer (Wang et al., 2020)) work well in practice. The connection between attention and kernel methods (Tsai et al., 2019) enables the use of random feature approximations for attention (e.g., Performer (Choromanski et al., 2021)), linking back to Rahimi and Recht's work. The spectrum of the attention matrix may also reveal information about the learned representations. One intuition (offered here as a hypothesis rather than an established result) is that attention heads with concentrated spectra act as "sharp" selectors (focusing on specific tokens), while those with flat spectra act as "smooth" aggregators (mixing information broadly); we are not aware of a study that systematically verifies this spectrum-to-behavior mapping.
Frequency Bias in Neural Networks
Rahaman et al. (2019) (Rahaman et al., 2019) showed that neural networks learn low-frequency components of target functions before high-frequency components, a phenomenon they called the "spectral bias" or F-principle. Concretely, for a target function with Fourier decomposition f = sum_k a_k * exp(ikx), neural networks first learn the components with small |k| (low frequency) and gradually acquire higher-frequency components during training. This has important implications for architecture design: networks struggle with high-frequency patterns unless explicitly encouraged (via positional encoding, Fourier features, or architectural modifications). The Neural Tangent Kernel (NTK) framework (Jacot et al., 2018) provides a theoretical explanation: the NTK has a spectrum that decays with frequency, causing low-frequency components to converge faster during gradient descent. Understanding spectral bias helps explain why certain architectures (e.g., those with skip connections or Fourier-based layers) generalize better, since they can capture the full frequency spectrum of the target function more evenly.
FNet
Lee-Thorp et al. (2022) (Lee-Thorp et al., 2022) proposed FNet, which replaces the self-attention sublayer in Transformers with a simple Fourier transform (unparameterized 2D DFT over the sequence and hidden dimensions). FNet achieves 92% of BERT's accuracy on GLUE while training 80% faster on GPUs and 70% faster on TPUs. The caveats matter, though: the remaining 8-point GLUE gap is substantial for tasks where accuracy is the priority, the headline speedups are reported at a sequence length of 512, and the TPU implementation uses a quadratic DFT matrix multiplication below 4,096 tokens rather than the asymptotically faster FFT. The result is therefore best read as evidence that fixed spectral mixing can approximate much of what attention does in certain regimes, connecting Transformer architecture to classical signal processing, rather than as a uniform replacement for attention.
Adaptive Fourier Neural Operators
Guibas et al. (2022) (Guibas et al., 2022) proposed AFNO (Adaptive Fourier Neural Operator), which uses parameterized Fourier transforms as token mixing operations in Transformers. Unlike FNet's fixed FFT, AFNO learns channel-mixing weights in the frequency domain, combining the efficiency of spectral methods with the expressiveness of learned parameters. The architecture applies a 2D FFT to the input tokens, applies learned complex-valued weights in the frequency domain (equivalent to a pointwise multiplication that implements global convolution), and applies an inverse FFT to return to the spatial domain. AFNO has been applied to weather forecasting, climate modeling, and other scientific computing tasks where the natural frequency-domain structure of the data provides a strong inductive bias. FourCastNet (Pathak et al., 2022) reports a 45,000x speedup over the ECMWF Integrated Forecasting System (IFS) on a node-hour compute basis, with accuracy described as comparable to IFS only for large-scale variables at short lead times; it underperforms IFS at longer ranges, so the speedup should be read with that regime qualification rather than as a uniform accuracy match. This line of work has since been advanced by data-driven models such as GraphCast (Lam et al., 2023), Pangu-Weather (Bi et al., 2023), and spherical Fourier neural operators (Bonev et al., 2023), which report skill competitive with or exceeding IFS across a broader range of variables and lead times.
The broader family of Fourier Neural Operators (FNO) (Li et al., 2021) extends this approach to learning solution operators for partial differential equations. FNO parameterizes the integral kernel in Fourier space, learning the Green's function of the PDE in a data-driven manner. This is a deep connection to classical applied mathematics: the Green's function, which maps boundary conditions and forcing functions to solutions, is traditionally computed analytically (when possible) or numerically. FNO learns it from data, inheriting the spectral efficiency of Fourier methods (smooth kernels have sparse Fourier representations) while avoiding the need for analytical derivations. FNO achieves 1000x speedup over traditional numerical solvers for Navier-Stokes equations at a resolution of 256x256.
Fourier Analysis of Generalization
The spectral perspective provides a principled framework for understanding generalization in deep learning. A common observation, which we note here as a structural regularity rather than a universal law, is that target functions in many real-world tasks have most of their energy in low-frequency components. For natural images, the classic natural-image statistics literature (Field, 1987; Ruderman, 1994) reports power spectra that decay roughly as 1/f^2; for text and for physical dynamics, the corresponding claims (that long-range correlations are captured by low-frequency modes and that physical dynamics are governed by smooth, low-frequency laws) are offered here as informal generalizations rather than precisely established results. The spectral bias of neural networks (Rahaman et al., 2019) means they naturally learn these dominant low-frequency components first, which explains why early stopping (stopping training before the network has learned the high-frequency noise) improves generalization. From a regularization perspective, one hypothesis (consistent with, but not strictly proven by, the NTK convergence-rate analysis above (Jacot et al., 2018) and the spectral-bias findings of Rahaman et al. (Rahaman et al., 2019)) is that the spectral bias acts as an implicit frequency-dependent regularizer: low-frequency components are learned quickly (low effective regularization), while high-frequency components are learned slowly (high effective regularization). This would match the statistical structure of natural data in regimes where low-frequency components are signal and high-frequency components are often noise, though we present the mechanism as a plausible account rather than a settled result.
References
- Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar (2020). Low-Rank Bottleneck in Multi-head Attention Models. ICML. ↗
- Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, Qi Tian (2023). Accurate medium-range global weather forecasting with 3D neural networks. Nature. ↗
- Boris Bonev, Thorsten Kurth, Christian Hundt, Jaideep Pathak, Maximilian Baust, Karthik Kashinath, Anima Anandkumar (2023). Spherical Fourier Neural Operators: Learning Stable Dynamics on the Sphere. International Conference on Machine Learning (ICML). ↗
- Krzysztof Choromanski, Valerii Likhosherstov, David Dohan (2021). Rethinking Attention with Performers. ICLR. ↗
- David J. Field (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America A (JOSA A). ↗
- John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro (2022). Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers. ICLR. ↗
- Arthur Jacot, Franck Gabriel, Clément Hongler (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS. ↗
- Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zhen Zhang, Jackie Stott, Stephan Hoyer, Peter Battaglia, Adrian Weller, Ali Eslami, Matthew Botvinick, Shakir Mohamed, Peter Battaglia (2023). Learning Skillful Medium-Range Global Weather Forecasting. Science. ↗
- James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL. ↗
- Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar (2021). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR. ↗
- Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV. ↗
- Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, Animashree Anandkumar (2022). FourCastNet: A Global Data-driven High-resolution Weather Forecasting Model using Adaptive Fourier Neural Operators. arXiv. ↗
- Nasim Rahaman, Aristide Baratin, Devansh Arpit (2019). On the Spectral Bias of Neural Networks. ICML. ↗
- Daniel L. Ruderman (1994). The statistics of natural images. Network: Computation in Neural Systems. ↗
- Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall (2020). Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS. ↗
- Yun-Hsuan Hsiao Tsai, Shaojie Bai, Barnabas Poczos, J. Zico Kolter, Ruslan Salakhutdinov (2019). Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel. EMNLP. ↗
- Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma (2020). Linformer: Self-Attention with Linear Complexity. arXiv. ↗