Computing Machinery and Intelligence
This chapter addresses a foundational question: what can be computed, and what are the fundamental limits? These limits are not academic curiosities -- they directly constrain what machine learning can and cannot do.
Turing Machines
- An infinite tape divided into cells, each holding a symbol from a finite alphabet (including a blank symbol )
- A read/write head that scans one cell at a time
- A finite state register storing the current state
- A transition function that, given the current state and symbol under the head, specifies the new state, the symbol to write, and the direction to move
- Distinguished accept and reject states
A computation begins with the input written on the tape, the head at the leftmost input symbol, and the machine in a designated start state . The machine applies the transition function repeatedly until it reaches an accept or reject state (or runs forever).
Despite its simplicity, a Turing machine can compute anything that any physical computer can compute. This is the content of the Church--Turing thesis (Turing, 1936):
Computability and Decidability
Not all problems are computable. The most famous example is the halting problem.
Proof sketch (by diagonalization). Suppose such a machine exists:
Construct a new program that takes a program as input:
- If , then loops forever
- If , then halts
Now consider : if halts on , then , so loops -- contradiction. If loops on , then , so halts -- contradiction. Therefore cannot exist.
- Convergence: We cannot build a general-purpose tool that determines whether a given training procedure will converge.
- Verification: We cannot build a general-purpose verifier that proves a neural network satisfies a given specification on all inputs (though restricted verifiers exist for specific architectures and properties ([?katz2017reluplex])).
- Architecture search: There is no algorithm that, given a dataset, finds the provably optimal architecture in finite time.
These are not engineering limitations to be overcome -- they are mathematical impossibilities.
Complexity Classes
Even among computable problems, some are fundamentally harder than others. Complexity theory classifies problems by the resources (time and space) required to solve them.
| Class | Definition | ML Example |
|---|---|---|
| Decidable in time for some constant | Matrix multiplication, sorting, linear regression (via normal equations) | |
| A "yes" certificate is verifiable in polynomial time | Boolean satisfiability, graph coloring | |
| -hard | At least as hard as every problem in | Training a 2-layer ReLU network to global optimality ([?blum1989training]) |
| -complete | In and -hard | Feature selection with cardinality constraints |
| Counting the number of solutions to an problem | Computing the partition function of a Boltzmann machine | |
| Decidable with polynomial space (possibly exponential time) | Game playing (Go, chess with generalized board) | |
| Decidable in polynomial time with bounded error probability | Approximate sampling from intractable distributions |
Universal Function Approximation
A central theoretical justification for neural networks is their expressiveness: they can approximate any continuous function.
- Unbounded activations. Hornik (1991) showed the result holds for any non-polynomial activation, including ReLU (Hornik, 1991).
- Width vs. depth. A single hidden layer of width suffices, but may need to be exponentially large. Deep networks can approximate the same functions with polynomially many parameters ([?telgarsky2016benefits]; [?eldan2016power]).
- Approximation rate. For Lipschitz functions on , a ReLU network with parameters achieves error -- exhibiting the curse of dimensionality ([?yarotsky2017error]). Compositional structure in the target function can break this curse.
- Practical caveat. The theorem is existential: it guarantees some network exists, but says nothing about whether gradient descent can find it, how much data is needed, or whether the required width is feasible.
Computational Complexity of Common ML Operations
| Operation | Time | Space | Notes |
|---|---|---|---|
| Matrix multiply () () | Dominated by memory bandwidth on GPUs | ||
| Self-attention (sequence length , head dim ) | FlashAttention reduces space to | ||
| Softmax (length ) | Numerically: subtract max first | ||
| Backprop through layers | = activation memory per layer | ||
| SVD of matrix | |||
| Gradient descent step | = number of parameters |
The Relationship Between Theory and Practice
- Optimization: Training is -hard in the worst case, yet SGD finds good solutions reliably on real data.
- Generalization: Classical learning theory (VC dimension, Rademacher complexity) predicts overparameterized models should overfit catastrophically, yet they generalize well -- the "double descent" phenomenon ([?belkin2019reconciling]).
- Approximation: The universal approximation theorem requires exponential width in general, yet practical networks with modest width solve hard problems.
Understanding these gaps is an active area of research. The takeaway for practitioners: theoretical limits inform algorithm design but should not be mistaken for practical limits.
Notation Summary
| Symbol | Meaning |
|---|---|
| Finite alphabet of tape symbols | |
| Finite set of machine states | |
| Transition function | |
| Start, accept, and reject states | |
| Asymptotic upper, lower, and tight bounds | |
| Complexity classes | |
| Activation function | |
| Compact subset of | |
| Approximation tolerance | |
| Number of hidden neurons | |
| Number of layers (depth) |
References
- George Cybenko (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems.
- Kurt Hornik (1991). Approximation Capabilities of Multilayer Feedforward Networks. Neural Networks.
- Alan M. Turing (1936). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society.