Graph Neural Networks
Graph Neural Networks (GNNs) extend deep learning to data with irregular, non-Euclidean structure: social networks, molecular graphs, knowledge graphs, point clouds, and meshes. Unlike images (regular grids) or text (sequences), graphs have variable-size neighborhoods and no canonical ordering of nodes. This chapter covers the mathematical foundations: graph theory basics, spectral methods, the message passing framework, and key architectures.
Graphs and Adjacency
The adjacency matrix has if (or for weighted graphs). The degree matrix where . The neighborhood of node is .
| Graph Type | Properties | Example |
|---|---|---|
| Undirected | Social networks, molecules | |
| Directed | Citation networks, web graphs | |
| Weighted | Similarity graphs, distance networks | |
| Bipartite | , edges only between | User-item interactions |
| Heterogeneous | Multiple node/edge types | Knowledge graphs |
| Dynamic | Edges/nodes change over time | Temporal interaction networks |
| Hypergraph | Edges connect nodes | Co-authorship, group interactions |
Graph Laplacian
- Symmetric and positive semi-definite ()
- Null space: -- the constant vector is an eigenvector with eigenvalue
- Connected components: The number of zero eigenvalues equals the number of connected components of
- Quadratic form:
The symmetric normalized Laplacian is , with eigenvalues in .
The random walk Laplacian is , whose eigenvectors are related to the stationary distribution of a random walk on .
This connects directly to GNNs: message passing averages neighboring features, which is equivalent to applying a low-pass filter on the graph. The eigenvectors of with small eigenvalues correspond to "smooth" signals (low-frequency components), while large eigenvalues correspond to "oscillating" signals (high-frequency).
Spectral Graph Theory
The eigendecomposition (where with ) defines the graph Fourier transform:
The coefficient measures how much the signal "oscillates" at frequency .
This is the graph analogue of the convolution theorem: convolution in the spatial domain equals multiplication in the frequency domain.
ChebNet ([?defferrard2016convolutional]) approximates with a -th order Chebyshev polynomial: where . Since can be computed via the recurrence , this requires only matrix-vector products with (no eigendecomposition). The result is a -hop localized filter with parameters.
GCN (Kipf & Welling, 2017) uses with a single parameter, giving the first-order approximation that leads to the spatial message passing formula.
Message Passing Framework
- Message computation: Compute a message from each neighbor :
- Aggregation: Combine messages using a permutation-invariant function:
- Update: Combine the aggregated message with the node's own representation:
where is a permutation-invariant aggregation (sum, mean, max), is the message function, and is the update function.
| Aggregation | Formula | Properties | Issue |
|---|---|---|---|
| Sum | Injective for multisets (most expressive) | Sensitive to degree | |
| Mean | $\frac{1}{ | \mathcal{N} | }\sum_j m_j$ |
| Max | Robust to outliers | Loses multiplicity information | |
| Attention-weighted | Data-dependent, adaptive | More parameters, harder to train |
Xu et al. ([?xu2019how]) proved that sum aggregation is maximally expressive among these choices, as it can distinguish different multisets. This motivates the GIN (Graph Isomorphism Network) architecture.
where is the normalized adjacency with self-loops. This is a sparse matrix multiplication followed by a dense linear transform and nonlinearity -- exactly the form that GPUs handle efficiently. The sparsity of (most real graphs are sparse: ) means the computation per layer is rather than .
GCN and GAT
where (add self-loops) and . The normalization is symmetric and ensures that the aggregation is scale-invariant (the spectral radius is bounded by 1).
where is a learnable attention vector, is a shared linear transform, and denotes concatenation. Multi-head attention extends this: .
| Model | Aggregation | Attention | Complexity per node | Key Property |
|---|---|---|---|---|
| GCN | Degree-normalized mean | Fixed (by degree) | $O( | \mathcal{N} |
| GAT | Attention-weighted sum | Learned (pairwise) | $O( | \mathcal{N} |
| GraphSAGE | Sample + aggregate | Mean/LSTM/pool | ( = sample size) | Scalable (fixed neighborhood) |
| GIN | Sum (no normalization) | None | $O( | \mathcal{N} |
| MPNN | General (learnable) | Optional | $O( | \mathcal{N} |
| GATv2 ([?brody2022how]) | Attention-weighted sum | Dynamic (full expressivity) | $O( | \mathcal{N} |
Expressiveness: The WL Test
- Initialize: (based on node features)
- Update:
- Two graphs are declared "possibly isomorphic" if they have the same multiset of final colors
The 1-WL test is a necessary (but not sufficient) condition for graph isomorphism.
- If two nodes have different 1-WL colors, a sufficiently expressive MPNN can distinguish them.
- If two nodes have the same 1-WL colors, no MPNN can distinguish them (regardless of depth or width).
The Graph Isomorphism Network (GIN) achieves the 1-WL upper bound by using sum aggregation and an injective update: .
- -WL / -FWL: Operate on -tuples of nodes; -WL for is strictly more powerful
- Equivariant subgraph GNNs: Process subgraphs around each node
- Random features: Add random node features to break symmetry (probabilistic)
- Positional encodings: Use Laplacian eigenvectors or random walk statistics as additional features
- Graph Transformers: Full pairwise attention (not restricted to edges) achieves higher expressivity
Over-Smoothing
where is the stationary distribution of the random walk on . All node representations converge to the same weighted average of the input features.
| Mitigation | Mechanism | Reference |
|---|---|---|
| Residual connections | Analogous to ResNet; preserves input signal | |
| JK (Jumping Knowledge) | Concatenate or aggregate all layers: | Uses multi-scale features ([?xu2018representation]) |
| DropEdge | Randomly remove edges during training | Reduces effective receptive field |
| PairNorm | Normalize representations to maintain distance | Prevents collapse to constant vector |
| DeeperGCN | Pre-activation residual + generalized aggregation | Enables training GNNs with 50+ layers |
| Limit depth | Use only 2-4 layers | Practical default for most tasks |
In practice, most GNN applications use 2-4 layers. Unlike Transformers (which benefit from 100+ layers), GNNs hit diminishing returns quickly because the receptive field grows exponentially with depth, and over-smoothing degrades representations.
Graph-Level Prediction
Common readout functions:
- Sum: (most expressive for distinguishing graphs, but size-dependent)
- Mean: (size-invariant, but less expressive)
- Set2Set: Attention-based pooling (learnable, order-invariant)
- Virtual node: Add a node connected to all others; its final representation is the graph embedding
Applications
| Domain | Task | Graph Structure | Typical Architecture |
|---|---|---|---|
| Chemistry | Molecular property prediction | Atoms = nodes, bonds = edges | SchNet, DimeNet, GemNet |
| Drug discovery | Drug-target interaction | Molecular + protein graphs | Heterogeneous GNN |
| Social networks | Community detection, link prediction | Users = nodes, connections = edges | GAT, GraphSAGE |
| Recommendation | User-item matching | Bipartite interaction graph | LightGCN, PinSage |
| Computer vision | Scene understanding, point clouds | Object/pixel graphs | DGCNN, PointNet++ |
| Physics simulation | Particle/fluid dynamics | Particles = nodes, interactions = edges | GNS (Sanchez-Gonzalez et al., 2020) |
| Combinatorial optimization | TSP, scheduling | Problem-specific graphs | GNN + reinforcement learning |
| Knowledge graphs | Link prediction, QA | Entity-relation triples | R-GCN, CompGCN |
Notation Summary
| Symbol | Meaning |
|---|---|
| Graph with vertices and edges | |
| $n = | V |
| Adjacency matrix | |
| Degree matrix | |
| Unnormalized graph Laplacian | |
| Normalized Laplacian | |
| -th eigenvalue/eigenvector of | |
| Neighbors of node | |
| Node representation at layer | |
| Attention coefficient (GAT) | |
| Permutation-invariant aggregation | |
| Edge features | |
| WL | Weisfeiler-Leman graph isomorphism test |