Skip to main content

Graph Neural Networks

Graph Neural Networks (GNNs) extend deep learning to data with irregular, non-Euclidean structure: social networks, molecular graphs, knowledge graphs, point clouds, and meshes. Unlike images (regular grids) or text (sequences), graphs have variable-size neighborhoods and no canonical ordering of nodes. This chapter covers the mathematical foundations: graph theory basics, spectral methods, the message passing framework, and key architectures.

Graphs and Adjacency

A **graph** $G = (V, E)$ consists of: - **Vertices** (nodes) $V = \{1, 2, \ldots, n\}$ with feature vectors $x_i \in \mathbb{R}^{d_0}$ - **Edges** $E \subseteq V \times V$ with optional features $e_{ij} \in \mathbb{R}^{d_e}$

The adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n} has Aij=1A_{ij} = 1 if (i,j)E(i,j) \in E (or Aij=wijA_{ij} = w_{ij} for weighted graphs). The degree matrix D=diag(d1,,dn)D = \text{diag}(d_1, \dots, d_n) where di=jAijd_i = \sum_j A_{ij}. The neighborhood of node ii is N(i)={j:(i,j)E}\mathcal{N}(i) = \{j : (i,j) \in E\}.

Graph TypePropertiesExample
UndirectedA=AA = A^\topSocial networks, molecules
DirectedAAA \neq A^\topCitation networks, web graphs
WeightedAijRA_{ij} \in \mathbb{R}Similarity graphs, distance networks
BipartiteV=V1V2V = V_1 \cup V_2, edges only between V1,V2V_1, V_2User-item interactions
HeterogeneousMultiple node/edge typesKnowledge graphs
DynamicEdges/nodes change over timeTemporal interaction networks
HypergraphEdges connect >2> 2 nodesCo-authorship, group interactions

Graph Laplacian

The **unnormalized Laplacian** is $L = D - A$. Key properties:
  • Symmetric and positive semi-definite (L0L \succeq 0)
  • Null space: L1=0L \mathbf{1} = 0 -- the constant vector is an eigenvector with eigenvalue 00
  • Connected components: The number of zero eigenvalues equals the number of connected components of GG
  • Quadratic form: xLx=12(i,j)Ewij(xixj)2x^\top L x = \frac{1}{2}\sum_{(i,j) \in E} w_{ij}(x_i - x_j)^2

The symmetric normalized Laplacian is L^=ID1/2AD1/2\hat{L} = I - D^{-1/2} A D^{-1/2}, with eigenvalues in [0,2][0, 2].

The random walk Laplacian is Lrw=ID1AL_{\text{rw}} = I - D^{-1}A, whose eigenvectors are related to the stationary distribution of a random walk on GG.

**The Laplacian measures signal smoothness.** The quadratic form $x^\top L x = \frac{1}{2}\sum_{(i,j)} (x_i - x_j)^2$ is the **Dirichlet energy** -- it is small when neighboring nodes have similar values and large when they differ. This is the graph analogue of $\int |\nabla f|^2 dx$ for continuous functions.

This connects directly to GNNs: message passing averages neighboring features, which is equivalent to applying a low-pass filter on the graph. The eigenvectors of LL with small eigenvalues correspond to "smooth" signals (low-frequency components), while large eigenvalues correspond to "oscillating" signals (high-frequency).

**Spectral clustering.** The eigenvectors of $L$ corresponding to the smallest non-zero eigenvalues (the **Fiedler vectors**) approximately solve the graph partitioning problem. Spectral clustering computes $k$ such eigenvectors, treats them as node embeddings in $\mathbb{R}^k$, and runs $k$-means. This is equivalent to relaxing the discrete min-cut problem to a continuous optimization: $\min_{X \in \mathbb{R}^{n \times k}} \text{tr}(X^\top L X)$ subject to $X^\top X = I$.

Spectral Graph Theory

The eigendecomposition L=UΛUL = U \Lambda U^\top (where Λ=diag(λ0,,λn1)\Lambda = \text{diag}(\lambda_0, \ldots, \lambda_{n-1}) with 0=λ0λ10 = \lambda_0 \leq \lambda_1 \leq \cdots) defines the graph Fourier transform:

x^=Ux(forward),x=Ux^(inverse)\hat{x} = U^\top x \quad \text{(forward)}, \quad x = U \hat{x} \quad \text{(inverse)}

The coefficient x^k=ukx\hat{x}_k = u_k^\top x measures how much the signal xx "oscillates" at frequency λk\lambda_k.

**Spectral convolution** on a graph applies a filter $g_\theta$ in the frequency domain:

xGgθ=Ugθ(Λ)Ux=Udiag(gθ(λ0),,gθ(λn1))Uxx *_G g_\theta = U \, g_\theta(\Lambda) \, U^\top x = U \, \text{diag}(g_\theta(\lambda_0), \ldots, g_\theta(\lambda_{n-1})) \, U^\top x

This is the graph analogue of the convolution theorem: convolution in the spatial domain equals multiplication in the frequency domain.

**From spectral to spatial methods.** Spectral convolution has two problems: (1) computing the eigendecomposition is $O(n^3)$, and (2) the filter $g_\theta(\Lambda)$ has $n$ free parameters (one per eigenvalue), so it does not transfer across graphs.

ChebNet ([?defferrard2016convolutional]) approximates gθ(Λ)g_\theta(\Lambda) with a KK-th order Chebyshev polynomial: gθ(Λ)k=0KθkTk(Λ~)g_\theta(\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\tilde{\Lambda}) where Λ~=2Λ/λmaxI\tilde{\Lambda} = 2\Lambda/\lambda_{\max} - I. Since Tk(L)T_k(L) can be computed via the recurrence Tk(L)x=2LTk1(L)xTk2(L)xT_k(L)x = 2L \cdot T_{k-1}(L)x - T_{k-2}(L)x, this requires only matrix-vector products with LL (no eigendecomposition). The result is a KK-hop localized filter with K+1K+1 parameters.

GCN (Kipf & Welling, 2017) uses K=1K = 1 with a single parameter, giving the first-order approximation that leads to the spatial message passing formula.

Message Passing Framework

Most GNNs follow the **message passing** paradigm [@gilmer2017neural]. At each layer $l$, each node $i$ updates its representation by:
  1. Message computation: Compute a message from each neighbor jj:

mji(l)=ϕ(l)(hi(l),hj(l),eij)m_{j \to i}^{(l)} = \phi^{(l)}(h_i^{(l)}, h_j^{(l)}, e_{ij})

  1. Aggregation: Combine messages using a permutation-invariant function:

mi(l)=jN(i)mji(l)m_i^{(l)} = \bigoplus_{j \in \mathcal{N}(i)} m_{j \to i}^{(l)}

  1. Update: Combine the aggregated message with the node's own representation:

hi(l+1)=ψ(l)(hi(l),mi(l))h_i^{(l+1)} = \psi^{(l)}(h_i^{(l)}, m_i^{(l)})

where \bigoplus is a permutation-invariant aggregation (sum, mean, max), ϕ\phi is the message function, and ψ\psi is the update function.

**Aggregation function matters.** The choice of $\bigoplus$ affects the expressive power:
AggregationFormulaPropertiesIssue
Sumjmj\sum_j m_jInjective for multisets (most expressive)Sensitive to degree
Mean$\frac{1}{\mathcal{N}}\sum_j m_j$
Maxmaxjmj\max_j m_jRobust to outliersLoses multiplicity information
Attention-weightedjαijmj\sum_j \alpha_{ij} m_jData-dependent, adaptiveMore parameters, harder to train

Xu et al. ([?xu2019how]) proved that sum aggregation is maximally expressive among these choices, as it can distinguish different multisets. This motivates the GIN (Graph Isomorphism Network) architecture.

**Message passing as matrix operation.** For GCN-style aggregation, the layer update is:

H(l+1)=σ(A^H(l)W(l))H^{(l+1)} = \sigma(\hat{A} H^{(l)} W^{(l)})

where A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is the normalized adjacency with self-loops. This is a sparse matrix multiplication followed by a dense linear transform and nonlinearity -- exactly the form that GPUs handle efficiently. The sparsity of AA (most real graphs are sparse: E=O(n)|E| = O(n)) means the computation per layer is O(Ed+nd2)O(|E| \cdot d + n \cdot d^2) rather than O(n2d)O(n^2 d).

GCN and GAT

**Graph Convolutional Network (GCN)** [@kipf2017gcn] uses normalized adjacency for aggregation:

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)

where A~=A+I\tilde{A} = A + I (add self-loops) and D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}. The normalization D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is symmetric and ensures that the aggregation is scale-invariant (the spectral radius is bounded by 1).

**Graph Attention Network (GAT)** [@velickovic2018gat] replaces fixed normalization with learned attention weights:

αij=exp(LeakyReLU(a[WhiWhj]))kN(i){i}exp(LeakyReLU(a[WhiWhk]))\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(a^\top [W h_i \| W h_j]))}{\sum_{k \in \mathcal{N}(i) \cup \{i\}} \exp(\text{LeakyReLU}(a^\top [W h_i \| W h_k]))}

hi(l+1)=σ(jN(i){i}αijWhj(l))h_i^{(l+1)} = \sigma\left(\sum_{j \in \mathcal{N}(i) \cup \{i\}} \alpha_{ij} W h_j^{(l)}\right)

where aR2da \in \mathbb{R}^{2d'} is a learnable attention vector, WRd×dW \in \mathbb{R}^{d' \times d} is a shared linear transform, and \| denotes concatenation. Multi-head attention extends this: hi=k=1Kσ(jαijkWkhj)h_i' = \|_{k=1}^K \sigma(\sum_j \alpha_{ij}^k W^k h_j).

ModelAggregationAttentionComplexity per nodeKey Property
GCNDegree-normalized meanFixed (by degree)$O(\mathcal{N}
GATAttention-weighted sumLearned (pairwise)$O(\mathcal{N}
GraphSAGESample + aggregateMean/LSTM/poolO(kd2)O(k \cdot d^2) (kk = sample size)Scalable (fixed neighborhood)
GINSum (no normalization)None$O(\mathcal{N}
MPNNGeneral (learnable)Optional$O(\mathcal{N}
GATv2 ([?brody2022how])Attention-weighted sumDynamic (full expressivity)$O(\mathcal{N}

Expressiveness: The WL Test

The **1-dimensional Weisfeiler-Leman (1-WL)** graph isomorphism test iteratively refines node colors:
  1. Initialize: ci(0)=hash(xi)c_i^{(0)} = \text{hash}(x_i) (based on node features)
  2. Update: ci(l+1)=hash(ci(l),{ ⁣{cj(l):jN(i)} ⁣})c_i^{(l+1)} = \text{hash}(c_i^{(l)}, \{\!\{c_j^{(l)} : j \in \mathcal{N}(i)\}\!\})
  3. Two graphs are declared "possibly isomorphic" if they have the same multiset of final colors

The 1-WL test is a necessary (but not sufficient) condition for graph isomorphism.

**Message passing GNNs are at most as powerful as the 1-WL test** [@xu2019how; @morris2019weisfeiler]. Specifically:
  • If two nodes have different 1-WL colors, a sufficiently expressive MPNN can distinguish them.
  • If two nodes have the same 1-WL colors, no MPNN can distinguish them (regardless of depth or width).

The Graph Isomorphism Network (GIN) achieves the 1-WL upper bound by using sum aggregation and an injective update: hi(l+1)=MLP((1+ϵ)hi(l)+jN(i)hj(l))h_i^{(l+1)} = \text{MLP}((1 + \epsilon) \cdot h_i^{(l)} + \sum_{j \in \mathcal{N}(i)} h_j^{(l)}).

**Going beyond 1-WL.** The 1-WL test (and thus standard MPNNs) cannot count certain substructures (e.g., triangles, cycles of length $> 3$) or distinguish some non-isomorphic regular graphs. Higher-order methods include:
  • kk-WL / kk-FWL: Operate on kk-tuples of nodes; kk-WL for k3k \geq 3 is strictly more powerful
  • Equivariant subgraph GNNs: Process subgraphs around each node
  • Random features: Add random node features to break symmetry (probabilistic)
  • Positional encodings: Use Laplacian eigenvectors or random walk statistics as additional features
  • Graph Transformers: Full pairwise attention (not restricted to edges) achieves higher expressivity

Over-Smoothing

**Over-smoothing** occurs when GNN depth increases and node representations converge to a common vector, losing discriminative power. Formally, as $L \to \infty$:

limL(D~1/2A~D~1/2)LH=1πH\lim_{L \to \infty} (\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2})^L H = \mathbf{1} \pi^\top H

where π\pi is the stationary distribution of the random walk on GG. All node representations converge to the same weighted average of the input features.

**Why over-smoothing happens and how to mitigate it.** Each message passing layer applies a low-pass filter (averaging with neighbors). Repeated application exponentially suppresses high-frequency components. The convergence rate depends on the spectral gap $\lambda_1$ of the normalized Laplacian: larger gap means faster convergence (faster over-smoothing).
MitigationMechanismReference
Residual connectionsh(l+1)=h(l)+GNN(h(l))h^{(l+1)} = h^{(l)} + \text{GNN}(h^{(l)})Analogous to ResNet; preserves input signal
JK (Jumping Knowledge)Concatenate or aggregate all layers: hi=f(hi(1),,hi(L))h_i = f(h_i^{(1)}, \ldots, h_i^{(L)})Uses multi-scale features ([?xu2018representation])
DropEdgeRandomly remove edges during trainingReduces effective receptive field
PairNormNormalize representations to maintain distancePrevents collapse to constant vector
DeeperGCNPre-activation residual + generalized aggregationEnables training GNNs with 50+ layers
Limit depthUse only 2-4 layersPractical default for most tasks

In practice, most GNN applications use 2-4 layers. Unlike Transformers (which benefit from 100+ layers), GNNs hit diminishing returns quickly because the receptive field grows exponentially with depth, and over-smoothing degrades representations.

Graph-Level Prediction

For graph-level tasks (graph classification, property prediction), node representations must be aggregated into a single graph-level vector:

hG=READOUT({hi(L):iV})h_G = \text{READOUT}(\{h_i^{(L)} : i \in V\})

Common readout functions:

  • Sum: hG=ihih_G = \sum_i h_i (most expressive for distinguishing graphs, but size-dependent)
  • Mean: hG=1Vihih_G = \frac{1}{|V|}\sum_i h_i (size-invariant, but less expressive)
  • Set2Set: Attention-based pooling (learnable, order-invariant)
  • Virtual node: Add a node connected to all others; its final representation is the graph embedding

Applications

DomainTaskGraph StructureTypical Architecture
ChemistryMolecular property predictionAtoms = nodes, bonds = edgesSchNet, DimeNet, GemNet
Drug discoveryDrug-target interactionMolecular + protein graphsHeterogeneous GNN
Social networksCommunity detection, link predictionUsers = nodes, connections = edgesGAT, GraphSAGE
RecommendationUser-item matchingBipartite interaction graphLightGCN, PinSage
Computer visionScene understanding, point cloudsObject/pixel graphsDGCNN, PointNet++
Physics simulationParticle/fluid dynamicsParticles = nodes, interactions = edgesGNS (Sanchez-Gonzalez et al., 2020)
Combinatorial optimizationTSP, schedulingProblem-specific graphsGNN + reinforcement learning
Knowledge graphsLink prediction, QAEntity-relation triplesR-GCN, CompGCN

Notation Summary

SymbolMeaning
G=(V,E)G = (V, E)Graph with vertices and edges
$n =V
AAAdjacency matrix
DDDegree matrix
L=DAL = D - AUnnormalized graph Laplacian
L^=ID1/2AD1/2\hat{L} = I - D^{-1/2}AD^{-1/2}Normalized Laplacian
λk,uk\lambda_k, u_kkk-th eigenvalue/eigenvector of LL
N(i)\mathcal{N}(i)Neighbors of node ii
hi(l)h_i^{(l)}Node ii representation at layer ll
αij\alpha_{ij}Attention coefficient (GAT)
\bigoplusPermutation-invariant aggregation
eije_{ij}Edge features
WLWeisfeiler-Leman graph isomorphism test

References