Computing Machinery and Intelligence
This chapter addresses a foundational question: what can be computed, and what are the fundamental limits? These limits are not academic curiosities; they directly constrain what machine learning can and cannot do.
Turing Machines
- An infinite tape divided into cells, each holding a symbol from a finite alphabet (including a blank symbol )
- A read/write head that scans one cell at a time
- A finite state register storing the current state
- A transition function that, given the current state and symbol under the head, specifies the new state, the symbol to write, and the direction to move
- Distinguished accept and reject states
A computation begins with the input written on the tape, the head at the leftmost input symbol, and the machine in a designated start state . The machine applies the transition function repeatedly until it reaches an accept or reject state (or runs forever).
Despite its simplicity, a Turing machine can compute anything that any physical computer can compute. This is the content of the Church-Turing thesis (Turing, 1936):
Computability and Decidability
Not all problems are computable. The most famous example is the halting problem.
Proof sketch (by diagonalization). Suppose such a machine exists:
Construct a new program that takes a program as input and runs , feeding its own source as the input so that the self-application is well defined:
- If , then loops forever
- If , then halts
Now consider : if halts on , then , so loops, a contradiction. If loops on , then , so halts, a contradiction. Therefore cannot exist.
- Convergence: We cannot build a general-purpose tool that determines whether a given training procedure will converge.
- Verification: We cannot build a general-purpose verifier that proves a neural network satisfies a given specification on all inputs (though restricted verifiers exist for specific architectures and properties (Katz et al., 2017)).
- Architecture search: There is no algorithm that, given a dataset, finds the provably optimal architecture in finite time.
These are not engineering limitations to be overcome; they are mathematical impossibilities.
Complexity Classes
Even among computable problems, some are fundamentally harder than others. Complexity theory classifies problems by the resources (time and space) required to solve them.
| Class | Definition | ML Example |
|---|---|---|
| Decidable in time for some constant | Matrix multiplication, sorting, linear regression (via normal equations) | |
| A "yes" certificate is verifiable in polynomial time | Boolean satisfiability, graph coloring | |
| -hard | At least as hard as every problem in | Training a 2-layer ReLU network to global optimality (Blum & Rivest, 1988) |
| -complete | In and -hard | Feature selection with cardinality constraints |
| Counting the number of solutions to an problem | Computing the partition function of a Boltzmann machine | |
| Decidable with polynomial space (possibly exponential time) | Game playing (Go, chess with generalized board) | |
| Decidable in polynomial time with bounded error probability | Approximate sampling from intractable distributions |
Universal Function Approximation
A central theoretical justification for neural networks is their expressiveness: they can approximate any continuous function.
- Unbounded activations. Hornik (1991) showed the result holds for any non-polynomial activation, including ReLU (Hornik, 1991).
- Width vs. depth. A single hidden layer of width suffices, but may need to be exponentially large. Deep networks can approximate the same functions with polynomially many parameters (Telgarsky, 2016; Eldan & Shamir, 2016).
- Approximation rate. For Lipschitz functions on , a ReLU network with parameters achieves error , exhibiting the curse of dimensionality (Yarotsky, 2017). Compositional structure in the target function can break this curse.
- Practical caveat. The theorem is existential: it guarantees some network exists, but says nothing about whether gradient descent can find it, how much data is needed, or whether the required width is feasible.
Each ReLU contributes one breakpoint, at and , splitting the line into three linear pieces. Evaluating directly: for both arguments are negative, so ; for only the first ReLU is active, so (a slope-1 ramp); for both are active, so (a constant plateau). The two neurons thus produce the three promised linear pieces. Each additional hidden layer can exponentially increase the number of linear regions: a depth-, width- ReLU network can produce linear regions (Montufar et al., 2014). This is a concrete illustration of the "depth efficiency" phenomenon.
Computational Complexity of Common ML Operations
| Operation | Time | Space | Notes |
|---|---|---|---|
| Matrix multiply () () | Dominated by memory bandwidth on GPUs | ||
| Self-attention (sequence length , head dim ) | FlashAttention reduces space to | ||
| Softmax (length ) | Numerically: subtract max first | ||
| Backprop through layers | = activation memory per layer | ||
| SVD of matrix | |||
| Gradient descent step | = number of parameters |
The Relationship Between Theory and Practice
- Optimization: Training is -hard in the worst case, yet SGD finds good solutions reliably on real data.
- Generalization: Classical learning theory (VC dimension, Rademacher complexity) predicts overparameterized models should overfit catastrophically, yet they generalize well: the "double descent" phenomenon (Belkin et al., 2019).
- Approximation: The universal approximation theorem requires exponential width in general, yet practical networks with modest width solve hard problems.
Understanding these gaps is an active area of research. The takeaway for practitioners: theoretical limits inform algorithm design but should not be mistaken for practical limits.
Notation Summary
| Symbol | Meaning |
|---|---|
| Finite alphabet of tape symbols | |
| Finite set of machine states | |
| Transition function | |
| Start, accept, and reject states | |
| Asymptotic upper, lower, and tight bounds | |
| Complexity classes (polynomial time, nondeterministic polynomial time, at least as hard as ) | |
| Counting the number of solutions to an problem | |
| Problems decidable with polynomial space | |
| Problems decidable in polynomial time with bounded error probability | |
| Activation function | |
| Compact subset of | |
| Approximation tolerance | |
| Number of hidden neurons | |
| Number of layers (depth) |
References
- Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off. Proceedings of the National Academy of Sciences (PNAS). ↗
- Avrim L. Blum, Ronald L. Rivest (1988). Training a 3-Node Neural Network is NP-Complete. Advances in Neural Information Processing Systems (NIPS). ↗
- George Cybenko (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems. ↗
- Ronen Eldan, Ohad Shamir (2016). The Power of Depth for Feedforward Neural Networks. Conference on Learning Theory (COLT). ↗
- Kurt Hornik (1991). Approximation Capabilities of Multilayer Feedforward Networks. Neural Networks. ↗
- Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, Mykel J. Kochenderfer (2017). Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. International Conference on Computer Aided Verification (CAV). ↗
- Guido Montufar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio (2014). On the Number of Linear Regions of Deep Neural Networks. Advances in Neural Information Processing Systems (NeurIPS). ↗
- Matus Telgarsky (2016). Benefits of Depth in Neural Networks. Conference on Learning Theory (COLT). ↗
- Alan M. Turing (1936). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society. ↗
- Dmitry Yarotsky (2017). Error Bounds for Approximations with Deep ReLU Networks. Neural Networks. ↗