Skip to main content

Graph Neural Networks

Graph Neural Networks (GNNs) extend deep learning to data with irregular, non-Euclidean structure: social networks, molecular graphs, knowledge graphs, point clouds, and meshes. Unlike images (regular grids) or text (sequences), graphs have variable-size neighborhoods and no canonical ordering of nodes. This chapter covers the mathematical foundations: graph theory basics, spectral methods, the message passing framework, and key architectures.

Graphs and Adjacency

A **graph** $G = (V, E)$ consists of: - **Vertices** (nodes) $V = \{1, 2, \ldots, n\}$ with feature vectors $x_i \in \mathbb{R}^{d_0}$ - **Edges** $E \subseteq V \times V$ with optional features $e_{ij} \in \mathbb{R}^{d_e}$

The adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n} has Aij=1A_{ij} = 1 if (i,j)E(i,j) \in E (or Aij=wijA_{ij} = w_{ij} for weighted graphs). The degree matrix D=diag(d1,,dn)D = \text{diag}(d_1, \dots, d_n) where di=jAijd_i = \sum_j A_{ij}. The neighborhood of node ii is N(i)={j:(i,j)E}\mathcal{N}(i) = \{j : (i,j) \in E\}.

Graph TypePropertiesExample
UndirectedA=AA = A^\topSocial networks, molecules
DirectedAAA \neq A^\topCitation networks, web graphs
WeightedAijRA_{ij} \in \mathbb{R}Similarity graphs, distance networks
BipartiteV=V1V2V = V_1 \cup V_2, edges only between V1,V2V_1, V_2User-item interactions
HeterogeneousMultiple node/edge typesKnowledge graphs
DynamicEdges/nodes change over timeTemporal interaction networks
HypergraphEdges connect >2> 2 nodesCo-authorship, group interactions

Graph Laplacian

The **unnormalized Laplacian** is $L = D - A$. Key properties:
  • Symmetric and positive semi-definite (L0L \succeq 0)
  • Null space: L1=0L \mathbf{1} = 0, so the constant vector is an eigenvector with eigenvalue 00
  • Connected components: The number of zero eigenvalues equals the number of connected components of GG
  • Quadratic form: xLx=12(i,j)Ewij(xixj)2x^\top L x = \frac{1}{2}\sum_{(i,j) \in E} w_{ij}(x_i - x_j)^2

The symmetric normalized Laplacian is L^=ID1/2AD1/2\hat{L} = I - D^{-1/2} A D^{-1/2}, with eigenvalues in [0,2][0, 2].

The random walk Laplacian is Lrw=ID1AL_{\text{rw}} = I - D^{-1}A, whose eigenvectors are related to the stationary distribution of a random walk on GG.

**Laplacian of a triangle graph.** Take the complete graph on three nodes $K_3$ (every pair connected). The adjacency, degree, and Laplacian matrices are A=(011101110),D=(200020002),L=DA=(211121112).A = \begin{pmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{pmatrix}, \quad D = \begin{pmatrix} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{pmatrix}, \quad L = D - A = \begin{pmatrix} 2 & -1 & -1 \\ -1 & 2 & -1 \\ -1 & -1 & 2 \end{pmatrix}.

To get the eigenvalues, note that A=JIA = J - I where JJ is the all-ones matrix, so L=2I(JI)=3IJL = 2I - (J - I) = 3I - J. The matrix JJ has eigenvalues {3,0,0}\{3, 0, 0\} (eigenvalue 33 for 1\mathbf{1}, eigenvalue 00 for the two-dimensional orthogonal complement). Therefore L=3IJL = 3I - J has eigenvalues

λ0=33=0,λ1=λ2=30=3.\lambda_0 = 3 - 3 = 0, \qquad \lambda_1 = \lambda_2 = 3 - 0 = 3.

The single zero eigenvalue (with eigenvector 1\mathbf{1}) confirms one connected component, matching the properties above. Checking the Dirichlet energy on x=(1,0,0)x = (1, 0, 0)^\top: xLx=2x^\top L x = 2, which equals 12(i,j)E(xixj)2=12[(10)2+(10)2+(00)2]2=2\frac{1}{2}\sum_{(i,j)\in E}(x_i - x_j)^2 = \frac{1}{2}\big[(1-0)^2 + (1-0)^2 + (0-0)^2\big] \cdot 2 = 2 (each of the two edges touching node 11 contributes 11, counted in both directions). The symmetric normalized operator is D1/2AD1/2=12AD^{-1/2}AD^{-1/2} = \frac{1}{2}A, with eigenvalues {1,12,12}\{1, -\tfrac12, -\tfrac12\}, so the normalized Laplacian L^=I12A\hat{L} = I - \frac{1}{2}A has eigenvalues {0,32,32}[0,2]\{0, \tfrac32, \tfrac32\} \subset [0, 2], as guaranteed.

**The Laplacian measures signal smoothness.** The quadratic form $x^\top L x = \frac{1}{2}\sum_{(i,j)} (x_i - x_j)^2$ is the **Dirichlet energy**: it is small when neighboring nodes have similar values and large when they differ. This is the graph analogue of $\int |\nabla f|^2 dx$ for continuous functions.

This connects directly to GNNs: message passing averages neighboring features, which is equivalent to applying a low-pass filter on the graph. The eigenvectors of LL with small eigenvalues correspond to "smooth" signals (low-frequency components), while large eigenvalues correspond to "oscillating" signals (high-frequency).

**Spectral clustering.** The eigenvectors of $L$ corresponding to the smallest non-zero eigenvalues (the **Fiedler vectors**) approximately solve the graph partitioning problem. Spectral clustering computes $k$ such eigenvectors, treats them as node embeddings in $\mathbb{R}^k$, and runs $k$-means. This is equivalent to relaxing the discrete min-cut problem to a continuous optimization: $\min_{X \in \mathbb{R}^{n \times k}} \text{tr}(X^\top L X)$ subject to $X^\top X = I$.

Spectral Graph Theory

The eigendecomposition L=UΛUL = U \Lambda U^\top (where Λ=diag(λ0,,λn1)\Lambda = \text{diag}(\lambda_0, \ldots, \lambda_{n-1}) with 0=λ0λ10 = \lambda_0 \leq \lambda_1 \leq \cdots) defines the graph Fourier transform:

x^=Ux(forward),x=Ux^(inverse)\hat{x} = U^\top x \quad \text{(forward)}, \quad x = U \hat{x} \quad \text{(inverse)}

The coefficient x^k=ukx\hat{x}_k = u_k^\top x measures how much the signal xx "oscillates" at frequency λk\lambda_k.

**Spectral convolution** on a graph applies a filter $g_\theta$ in the frequency domain:

xGgθ=Ugθ(Λ)Ux=Udiag(gθ(λ0),,gθ(λn1))Uxx *_G g_\theta = U \, g_\theta(\Lambda) \, U^\top x = U \, \text{diag}(g_\theta(\lambda_0), \ldots, g_\theta(\lambda_{n-1})) \, U^\top x

This is the graph analogue of the convolution theorem: convolution in the spatial domain equals multiplication in the frequency domain.

**From spectral to spatial methods.** Spectral convolution has two problems: (1) computing the eigendecomposition is $O(n^3)$, and (2) the filter $g_\theta(\Lambda)$ has $n$ free parameters (one per eigenvalue), so it does not transfer across graphs.

ChebNet (Defferrard et al., 2016) approximates gθ(Λ)g_\theta(\Lambda) with a KK-th order Chebyshev polynomial: gθ(Λ)k=0KθkTk(Λ~)g_\theta(\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\tilde{\Lambda}) where Λ~=2Λ/λmaxI\tilde{\Lambda} = 2\Lambda/\lambda_{\max} - I. Since Tk(L)T_k(L) can be computed via the recurrence Tk(L)x=2LTk1(L)xTk2(L)xT_k(L)x = 2L \cdot T_{k-1}(L)x - T_{k-2}(L)x, this requires only matrix-vector products with LL (no eigendecomposition). The result is a KK-hop localized filter with K+1K+1 parameters.

GCN (Kipf & Welling, 2017) uses K=1K = 1 with a single parameter, giving the first-order approximation that leads to the spatial message passing formula.

Message Passing Framework

Most GNNs follow the **message passing** paradigm [@gilmer2017neural]. At each layer $l$, each node $i$ updates its representation by:
  1. Message computation: Compute a message from each neighbor jj:

mji(l)=ϕ(l)(hi(l),hj(l),eij)m_{j \to i}^{(l)} = \phi^{(l)}(h_i^{(l)}, h_j^{(l)}, e_{ij})

  1. Aggregation: Combine messages using a permutation-invariant function:

mi(l)=jN(i)mji(l)m_i^{(l)} = \bigoplus_{j \in \mathcal{N}(i)} m_{j \to i}^{(l)}

  1. Update: Combine the aggregated message with the node's own representation:

hi(l+1)=ψ(l)(hi(l),mi(l))h_i^{(l+1)} = \psi^{(l)}(h_i^{(l)}, m_i^{(l)})

where \bigoplus is a permutation-invariant aggregation (sum, mean, max), ϕ\phi is the message function, and ψ\psi is the update function.

**Aggregation function matters.** The choice of $\bigoplus$ affects the expressive power:
AggregationFormulaPropertiesIssue
Sumjmj\sum_j m_jInjective for multisets (most expressive)Sensitive to degree
Mean$\frac{1}{\mathcal{N}}\sum_j m_j$
Maxmaxjmj\max_j m_jRobust to outliersLoses multiplicity information
Attention-weightedjαijmj\sum_j \alpha_{ij} m_jData-dependent, adaptiveMore parameters, harder to train

Xu et al. (Xu et al., 2019) proved that sum aggregation is maximally expressive among these choices, as it can distinguish different multisets. This motivates the GIN (Graph Isomorphism Network) architecture.

**Message passing as matrix operation.** For GCN-style aggregation, the layer update is:

H(l+1)=σ(A^H(l)W(l))H^{(l+1)} = \sigma(\hat{A} H^{(l)} W^{(l)})

where A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is the normalized adjacency with self-loops. This is a sparse matrix multiplication followed by a dense linear transform and nonlinearity, exactly the form that GPUs handle efficiently. The sparsity of AA (most real graphs are sparse: E=O(n)|E| = O(n)) means the computation per layer is O(Ed+nd2)O(|E| \cdot d + n \cdot d^2) rather than O(n2d)O(n^2 d).

GCN and GAT

**Graph Convolutional Network (GCN)** [@kipf2017gcn] uses normalized adjacency for aggregation:

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)

where A~=A+I\tilde{A} = A + I (add self-loops) and D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}. The normalization D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is symmetric and ensures that the aggregation is scale-invariant (the spectral radius is bounded by 1).

**Graph Attention Network (GAT)** [@velickovic2018gat] replaces fixed normalization with learned attention weights:

αij=exp(LeakyReLU(a[WhiWhj]))kN(i){i}exp(LeakyReLU(a[WhiWhk]))\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(a^\top [W h_i \| W h_j]))}{\sum_{k \in \mathcal{N}(i) \cup \{i\}} \exp(\text{LeakyReLU}(a^\top [W h_i \| W h_k]))}

hi(l+1)=σ(jN(i){i}αijWhj(l))h_i^{(l+1)} = \sigma\left(\sum_{j \in \mathcal{N}(i) \cup \{i\}} \alpha_{ij} W h_j^{(l)}\right)

where aR2da \in \mathbb{R}^{2d'} is a learnable attention vector, WRd×dW \in \mathbb{R}^{d' \times d} is a shared linear transform, and \| denotes concatenation. Multi-head attention extends this: hi=k=1Kσ(jαijkWkhj)h_i' = \|_{k=1}^K \sigma(\sum_j \alpha_{ij}^k W^k h_j).

ModelAggregationAttentionComplexity per nodeKey Property
GCNDegree-normalized meanFixed (by degree)$O(\mathcal{N}
GATAttention-weighted sumLearned (pairwise)$O(\mathcal{N}
GraphSAGESample + aggregateMean/LSTM/poolO(kd2)O(k \cdot d^2) (kk = sample size)Scalable (fixed neighborhood)
GINSum (no normalization)None$O(\mathcal{N}
MPNNGeneral (learnable)Optional$O(\mathcal{N}
GATv2 (Brody et al., 2022)Attention-weighted sumDynamic (full expressivity)$O(\mathcal{N}

Expressiveness: The WL Test

The **1-dimensional Weisfeiler-Leman (1-WL)** graph isomorphism test iteratively refines node colors:
  1. Initialize: ci(0)=hash(xi)c_i^{(0)} = \text{hash}(x_i) (based on node features)
  2. Update: ci(l+1)=hash(ci(l),{ ⁣{cj(l):jN(i)} ⁣})c_i^{(l+1)} = \text{hash}(c_i^{(l)}, \{\!\{c_j^{(l)} : j \in \mathcal{N}(i)\}\!\})
  3. Two graphs are declared "possibly isomorphic" if they have the same multiset of final colors

The 1-WL test is a necessary (but not sufficient) condition for graph isomorphism.

**One round of 1-WL on a path graph.** Consider the path $P_4$ with nodes $1\!-\!2\!-\!3\!-\!4$ and no node features, so all nodes start with the same color $c^{(0)} = a$. The neighborhoods are $\mathcal{N}(1) = \{2\}$, $\mathcal{N}(2) = \{1,3\}$, $\mathcal{N}(3) = \{2,4\}$, $\mathcal{N}(4) = \{3\}$.

Round 1. Each node hashes its own color together with the multiset of neighbor colors:

Node(c(0),{ ⁣{neighbors} ⁣})(c^{(0)}, \{\!\{\text{neighbors}\}\!\})New color
1(a,{ ⁣{a} ⁣})(a, \{\!\{a\}\!\})bb
2(a,{ ⁣{a,a} ⁣})(a, \{\!\{a, a\}\!\})cc
3(a,{ ⁣{a,a} ⁣})(a, \{\!\{a, a\}\!\})cc
4(a,{ ⁣{a} ⁣})(a, \{\!\{a\}\!\})bb

The degree-11 endpoints {1,4}\{1, 4\} split off from the degree-22 interior {2,3}\{2, 3\}: after one round the colors already encode node degree.

Round 2 (faded guidance). Now feed the colors c(1)=(b,c,c,b)c^{(1)} = (b, c, c, b) back in. Work out each node's signature (ci(1),{ ⁣{cj(1):jN(i)} ⁣})(c^{(1)}_i, \{\!\{c^{(1)}_j : j \in \mathcal{N}(i)\}\!\}) yourself; you should find that nodes 11 and 44 both get (b,{ ⁣{c} ⁣})(b, \{\!\{c\}\!\}) and nodes 22 and 33 both get (c,{ ⁣{b,c} ⁣})(c, \{\!\{b, c\}\!\}). The induced partition {{1,4},{2,3}}\{\{1,4\}, \{2,3\}\} is identical to round 1, so the coloring has stabilized: no further round can refine it. The test cannot tell node 22 from node 33, which reflects the genuine automorphism of P4P_4 that swaps the two ends, and is exactly the kind of symmetric pair that no message passing GNN can separate either.

**Message passing GNNs are at most as powerful as the 1-WL test** [@xu2019how; @morris2019weisfeiler]. Specifically:
  • If two nodes have different 1-WL colors, a sufficiently expressive MPNN can distinguish them.
  • If two nodes have the same 1-WL colors, no MPNN can distinguish them (regardless of depth or width).

The Graph Isomorphism Network (GIN) achieves the 1-WL upper bound by using sum aggregation and an injective update: hi(l+1)=MLP((1+ϵ)hi(l)+jN(i)hj(l))h_i^{(l+1)} = \text{MLP}((1 + \epsilon) \cdot h_i^{(l)} + \sum_{j \in \mathcal{N}(i)} h_j^{(l)}).

**Going beyond 1-WL.** The 1-WL test (and thus standard MPNNs) cannot count certain substructures (e.g., triangles, cycles of length $> 3$) or distinguish some non-isomorphic regular graphs. Higher-order methods include:
  • kk-WL / kk-FWL: Operate on kk-tuples of nodes; kk-WL for k3k \geq 3 is strictly more powerful
  • Equivariant subgraph GNNs: Process subgraphs around each node
  • Random features: Add random node features to break symmetry (probabilistic)
  • Positional encodings: Use Laplacian eigenvectors or random walk statistics as additional features
  • Graph Transformers: Full pairwise attention (not restricted to edges) achieves higher expressivity

Over-Smoothing

**Over-smoothing** occurs when GNN depth increases and node representations converge to a common vector, losing discriminative power. Formally, as $L \to \infty$:

limL(D~1/2A~D~1/2)LH=(D~1/21)(D~1/21)D~1/212H\lim_{L \to \infty} (\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2})^L H = \frac{(\tilde{D}^{1/2}\mathbf{1})(\tilde{D}^{1/2}\mathbf{1})^\top}{\|\tilde{D}^{1/2}\mathbf{1}\|^2} H

The dominant eigenvector of the symmetric operator D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} (eigenvalue 11) is uD~1/21u \propto \tilde{D}^{1/2}\mathbf{1}, so the limit is the rank-one orthogonal projector onto uu. All node representations collapse onto this single direction (a degree-weighted average of the input features), losing the information that distinguished them. The equivalent statement for the random walk operator D~1A~\tilde{D}^{-1}\tilde{A} is 1πH\mathbf{1}\pi^\top H, where π\pi is the stationary distribution of the random walk on GG.

**Why over-smoothing happens and how to mitigate it.** Each message passing layer applies a low-pass filter (averaging with neighbors). Repeated application exponentially suppresses high-frequency components. The convergence rate depends on the spectral gap $\lambda_1$ of the normalized Laplacian: larger gap means faster convergence (faster over-smoothing).
MitigationMechanismReference
Residual connectionsh(l+1)=h(l)+GNN(h(l))h^{(l+1)} = h^{(l)} + \text{GNN}(h^{(l)})Analogous to ResNet; preserves input signal
JK (Jumping Knowledge)Concatenate or aggregate all layers: hi=f(hi(1),,hi(L))h_i = f(h_i^{(1)}, \ldots, h_i^{(L)})Uses multi-scale features (Xu et al., 2018)
DropEdgeRandomly remove edges during trainingReduces effective receptive field
PairNormNormalize representations to maintain distancePrevents collapse to constant vector
DeeperGCNPre-activation residual + generalized aggregationEnables training GNNs with 50+ layers
Limit depthUse only 2-4 layersPractical default for most tasks

In practice, most GNN applications use 2-4 layers. Unlike Transformers (which benefit from 100+ layers), GNNs hit diminishing returns quickly because the receptive field grows exponentially with depth, and over-smoothing degrades representations.

Graph-Level Prediction

For graph-level tasks (graph classification, property prediction), node representations must be aggregated into a single graph-level vector:

hG=READOUT({hi(L):iV})h_G = \text{READOUT}(\{h_i^{(L)} : i \in V\})

Common readout functions:

  • Sum: hG=ihih_G = \sum_i h_i (most expressive for distinguishing graphs, but size-dependent)
  • Mean: hG=1Vihih_G = \frac{1}{|V|}\sum_i h_i (size-invariant, but less expressive)
  • Set2Set: Attention-based pooling (learnable, order-invariant)
  • Virtual node: Add a node connected to all others; its final representation is the graph embedding

Applications

DomainTaskGraph StructureTypical Architecture
ChemistryMolecular property predictionAtoms = nodes, bonds = edgesSchNet, DimeNet, GemNet
Drug discoveryDrug-target interactionMolecular + protein graphsHeterogeneous GNN
Social networksCommunity detection, link predictionUsers = nodes, connections = edgesGAT, GraphSAGE
RecommendationUser-item matchingBipartite interaction graphLightGCN, PinSage
Computer visionScene understanding, point cloudsObject/pixel graphsDGCNN, PointNet++
Physics simulationParticle/fluid dynamicsParticles = nodes, interactions = edgesGNS (Sanchez-Gonzalez et al., 2020)
Combinatorial optimizationTSP, schedulingProblem-specific graphsGNN + reinforcement learning
Knowledge graphsLink prediction, QAEntity-relation triplesR-GCN, CompGCN

Notation Summary

SymbolMeaning
G=(V,E)G = (V, E)Graph with vertices and edges
$n =V
AAAdjacency matrix
DDDegree matrix
L=DAL = D - AUnnormalized graph Laplacian
L^=ID1/2AD1/2\hat{L} = I - D^{-1/2}AD^{-1/2}Normalized Laplacian
λk,uk\lambda_k, u_kkk-th eigenvalue/eigenvector of LL
N(i)\mathcal{N}(i)Neighbors of node ii
hi(l)h_i^{(l)}Node ii representation at layer ll
αij\alpha_{ij}Attention coefficient (GAT)
\bigoplusPermutation-invariant aggregation
eije_{ij}Edge features
WLWeisfeiler-Leman graph isomorphism test

References