Graphs

•

In multiple branches of science, from sociology to particle physics,

graphs are a fundamental model of systems of relations and interactions.

•

Graphs give rise to a basic type of symmetry modelled by the group of

permutations. This is why we start our investigation by studying them.

•

Most other objects of interest to us, such as grids and sets, can be

obtained as a particular case of graphs, and many architectures we will

discuss here will be expressible in the language of graph neural networks.

•

This also includes the modern large language model (LLM) stack, which

we present as a case study in this Chapter.

5.1 Domain

In many ways, graphs are the main modality of data we receive from nature.

Indeed, at virtually all levels of organisation of biological and physical

systems—from molecules to the connectomic structure of the neurons in the

brain —graphs offer a natural abstraction; namely, that of an interconnected

network of nodes. Similar conclusions can be drawn about human-designed

systems, such as transportation networks or social networks. In fact, it may

even be argued that the domains where deep learning has traditionally seen

most success—such as images, text or speech—can all be in the very least

seen as special cases of graph structures, but also very often just grid-like pro-

jections of a far more irregularly

structured phenomenon. Furthermore, the

very organisation of cognition seems to be irregularly structured, with con-

cepts freely related and recombined to form new knowledge. All of this taken

into account, it is likely not very controversial to claim that a truly artiﬁcially-

intelligent system will necessarily need to know how to process and reason

about graph-structured data. Conveniently, this framework also ﬁts very nicely

in our GDL blueprint, and can be used as a basis for studying many other

geometric domains. Now, it is our turn to study graph representation learning.

130 Chapter 5

This is a graph. G = (V, E) This is its tensorial representation.

X ∈ R

|V|×k

A ∈ R

|V|×|V|

X =

1 0 0 1 0 1

0 1 0 0 0 1

0 0 1 1 1 0

1 0 1 1 0 1

0 0 1 0 1 1

1 1 0 1 1 1

A =

Figure 5.1

An undirected graph G = (V, E) containing the set of nodes V = {u, v, a, b, c, d} and

edges E = {(a, d), (a, v), (b, v), (c, d), (c, u), (d, v), (u, v)} (symmetric edges omitted for

brevity). This graph is also represented using tensors (node features X, adjacency matrix

A) for a node permutation g = [a, b, c, d, u, v].

A graph G = (V, E) is a collection of nodes

V and edges E ⊆V ×V between

pairs of nodes. For the purpose of the following discussion, we will further

assume the nodes to be endowed with s-dimensional node features, denoted by

for all u ∈V; see Figure 5.1 for an illustration. Social networks are perhaps

among the most commonly studied examples of graphs, where nodes repre-

sent users, edges correspond to friendship relations between them, and node

features model user properties such as age, proﬁle picture, etc. It is also often

possible to endow the edges, or entire graphs, with features; but as this does

not alter our main ﬁndings, we will defer discussing it to Section 5.2.2.

The key structural property of graphs is that the nodes in V are usually not

assumed to be provided in any particular order, and thus any operations per-

formed on graphs should not depend on the ordering of nodes. The desirable

property that functions acting on graphs should satisfy is thus permutation

invariance, and it implies that for any two isomorphic graphs—the likes of

which we saw in Figure 4.4—the outcomes of these functions are identical.

We can see this as a particular setting of our blueprint, where the domain Ω = G

and the space X(G, R

) is that of d-dimensional node-wise signals. The sym-

metry we consider is given by the permutation group G = S

, whose elements

are all the possible orderings of the set of node indices {1, . . . , n}.

5.1.1 Sets

Let us ﬁrst illustrate the concept of permutation invariance on sets, a special

case of graphs without edges (i.e., E = ∅). By stacking the node features as rows

of the n ×d matrix X = (x

, . . . , x

)

⊤

, we do effectively specify an ordering of

Graphs 131

the nodes. The action of the permutation g ∈S

on the set of nodes amounts

to the reordering of the rows of X, which can be represented as an n ×n per-

mutation matrix ρ(g) = P,

where each row and column contains exactly one 1

and all the other entries are zeros.

A function f operating on this set is then said to be permutation invariant

if, for any such permutation matrix P, it holds that f (PX) = f (X). One simple

such function is

f (X) = ϕ

u∈V

(

)

, (5.46)

where the function ψ is independently applied to every node’s features, and ϕ

is applied on its sum-aggregated outputs: as sum is independent of the order in

which its inputs are provided, such a function is invariant with respect to the

permutation of the node set, and is hence guaranteed to always return the same

output, no matter how the nodes are permuted. Such a layer is very powerful—

under suitable conditions on ψ’s output dimensionality, it can be shown that

any permutation-invariant set function must be representable using the form in

Equation 5.46 (Wagstaff et al. 2019).

Functions like the above provide a ‘global’ set-level output, but very often,

we will be interested in functions that act ‘locally’, in a node-wise manner. For

example, we may want to apply some function to update the features in every

node, obtaining the set of latent node features. If we stack these latent features

into a matrix H, the function

F(X) = H is no longer permutation invariant:

the order of the rows of H should be tied to the order of the rows of X, so

that we know which output node feature corresponds to which input node. We

need instead a more ﬁne-grained notion of permutation equivariance, stating

that, once we “commit” to a permutation of inputs, it consistently permutes the

resulting objects. Formally, F(X) is a permutation equivariant function if, for

any permutation matrix P, it holds that F(PX) = PF(X). A shared node-wise

linear transform

(X) = XΘ (5.47)

speciﬁed by a weight matrix Θ ∈R

d×d

′

, is one possible construction of such a

permutation equivariant function, producing in our example latent features of

the form h

= Θ

⊤

This construction arises naturally from our Geometric Deep Learning

blueprint. Namely, we can characterise linear equivariants (functions of the

form FPX = PFX), for which it is easy to verify that any such map can be

written as a linear combination of two generators, the identity F

X = X and

the average F

X =

⊤

X =

u=1

. In Section 5.2.3, we show how the

popular Deep Sets (Zaheer et al. 2017) architecture follows precisely this idea.

132 Chapter 5

5.1.2 From sets to graphs

We can now generalise the notions of permutation invariance and equivariance

from sets to graphs. In the generic setting E ̸= ∅, the graph connectivity can be

represented by the n ×n adjacency matrix A,

deﬁned as

(

1 (u, v) ∈E

0 otherwise.

(5.48)

Note that now the adjacency and feature matrices A and X are “synchronised”,

in the sense that a

speciﬁes the adjacency information between the nodes

described by the uth and vth rows of X – cf. Figure 5.1. Therefore, applying a

permutation matrix P to the node features X automatically implies applying it

to A’s rows and columns, PAP

⊤

We say that (a graph-wise function

) f is

permutation invariant if

f (PX, PAP

⊤

) = f (X, A) (5.49)

and (a node-wise function) F is permutation equivariant if

F(PX, PAP

⊤

) = PF(X, A) (5.50)

for any permutation matrix P.

Here again, we can ﬁrst characterise linear equivariant functions.

observed by Maron et al. 2018, any linear F satisfying equation (5.50) can

be expressed as a linear combination of ﬁfteen linear generators; remarkably,

this family of generators is independent of n. Amongst these generators, our

blueprint speciﬁcally advocates for those that are also local, i.e., whereby the

output on node u directly depends on its neighbouring nodes in the graph. We

can formalise this constraint explicitly in our model construction, by deﬁning

what it means for a node to be neighbouring another.

A (undirected) neighbourhood of node u, sometimes also called 1-hop, is

deﬁned as

= {v : (u, v) ∈E or (v, u) ∈E} (5.51)

and the neighbourhood features are deﬁned as the multiset

= {{x

: v ∈N

}}. (5.52)

Operating on 1-hop neighbourhoods aligns well with the locality aspect of our

blueprint: namely, deﬁning our metric over graphs as the shortest path distance

between nodes using edges in E.

The GDL blueprint thus yields a general recipe for constructing permutation

equivariant functions on graphs, by specifying a local function ϕ that operates

over the features of a node and its neighbourhood, ϕ(x

, X

). Then, a permu-

tation equivariant function F can be constructed by applying ϕ to every node’s

Graphs 133

φ(x

, X

)

Figure 5.2

An illustration of constructing permutation-equivariant functions over graphs, by apply-

ing a permutation-invariant function ϕ to every neighbourhood. In this case, ϕ is applied

to the features x

of node b as well as the multiset of its neighbourhood features,

= {{x

, x

}}. Applying ϕ in this manner to every node’s neighbourhood

recovers the rows of the resulting matrix of latents features H = F(X, A).

neighbourhood in isolation (see Figure 5.2):

F(X, A) =







ϕ(x

, X

)

ϕ(x

, X

)

ϕ(x

, X

)







(5.53)

As F is constructed by applying a shared function ϕ to each node locally, its

permutation equivariance rests on ϕ’s output being independent on the ordering

of the nodes in N

. Thus, if ϕ is built to be permutation invariant, then this

property is satisﬁed.

It is also worth noticing that the difference between functions deﬁned on

sets and more general graphs in this example is that in the latter case we need

to explicitly account for the structure of the domain. As a consequence, graphs

stand apart in the sense that the domain becomes part of the input in machine

learning problems, whereas when dealing with sets and grids (both particu-

lar cases of graphs) we can specify only the features and assume the domain

to be ﬁxed. This distinction will be a recurring motif in our discussion. As

a result, the notion of geometric stability (invariance to domain deformation)

is crucial in most problems of learning on graphs. It straightforwardly follows

from our construction that permutation invariant and equivariant functions pro-

duce identical outputs on isomorphic (topologically-equivalent) graphs. These

results can be generalised to approximately isomorphic graphs, and several

results on stability under graph perturbations exist (Levie et al. 2018). We will

134 Chapter 5

return to this important point in our discussion on manifolds, which we will

use as an vehicle to study such invariance in further detail.

Second, due to their additional structure, graphs and grids, unlike sets,

can be coarsened in a non-trivial way

, giving rise to a variety of pooling

operations.

5.2 Model

Graph Neural Networks (GNNs) are the realisation of our Geometric Deep

Learning blueprint on graphs leveraging the properties of the permutation

group. GNNs are among the most general class of deep learning architectures

currently in existence, and as we will see in this text, most other deep learning

architectures can be understood as a special case of the GNN with additional

geometric structure.

As per our discussion in Section 5.1, we consider a graph to be speciﬁed with

an adjacency matrix A and node features X. We will study GNN architectures

that are permutation equivariant functions F(X, A) constructed by applying

shared permutation invariant functions ϕ(x

, X

) over local neighbourhoods.

Under various guises, this local function ϕ can be referred to as “diffusion”,

“propagation”, or “message passing”, and the overall computation of such F

as a “GNN layer”.

5.2.1 The three “ﬂavours” of GNNs

The design and study of GNN layers is one of the most active areas of deep

learning at the time of writing, making it a landscape that is challenging to

navigate. Fortunately, we ﬁnd that the vast majority of the literature may be

directly derived from only three spatial

“ﬂavours” of GNN layers (Figure

5.3), which we will present here. These ﬂavours govern the extent to which

ϕ transforms the neighbourhood features, allowing for varying degrees of

complexity when modelling interactions across the graph.

In all three ﬂavours, permutation invariance is ensured by aggregating fea-

tures from X

(potentially transformed, by means of some function ψ) with

some permutation-invariant function

, and then updating the features of

node u, by means of some function ϕ. Typically,

ψ and ϕ are learnable,

whereas

is realised as a nonparametric operation such as sum, mean,

or maximum, though it can also be constructed e.g. using recurrent neural

networks with appropriate constraints (R. L. Murphy et al. 2018).

In the convolutional ﬂavour (Kipf and Welling 2016; Defferrard, Bresson,

and Vandergheynst 2016; F. Wu et al. 2019), the features of the neighbourhood

Graphs 135

Convolutional

Attentional

Message-passing

Figure 5.3

A visualisation of the dataﬂow for the three ﬂavours of GNN layers, g. We use the

neighbourhood of node b from Figure 5.2 to illustrate this. Left-to-right: convolutional,

where sender node features are multiplied with a constant, c

; attentional, where this

multiplier is implicitly computed via an attention mechanism of the receiver over the

sender: α

= a(x

, x

); and message-passing, where vector-based messages are com-

puted based on both the sender and receiver: m

= ψ(x

, x

nodes are directly aggregated with ﬁxed weights,

= ϕ

v∈N

ψ(x

)

. (5.54)

Here, c

speciﬁes the importance of node v to node u’s representation. It is a

constant that often directly depends on the entries in A representing the struc-

ture of the graph. Note that when the aggregation operator

is chosen to be

the summation, it can be considered as a linear diffusion or position-dependent

linear ﬁltering, a generalisation of convolution

—giving the ﬂavour its name.

In the attentional ﬂavour (Veli

ckovi

c et al. 2018; Monti et al. 2017; J. Zhang

et al. 2018; Brody, Alon, and Yahav 2022), these interactions are implicit:

= ϕ

v∈N

a(x

, x

)ψ(x

)

. (5.55)

Here, a is a learnable self-attention mechanism that computes the importance

coefﬁcients α

= a(x

, x

) implicitly. They are often softmax-normalised

across all neighbours. When

is the summation, the aggregation is still a

linear combination of the neighbourhood node features, but now the weights

are feature-dependent.

Finally, the message-passing ﬂavour (Battaglia et al. 2016; Gilmer et

al. 2017; Battaglia et al. 2018) amounts to computing arbitrary vectors

(“messages”) across edges,

= ϕ

v∈N

ψ(x

, x

)

. (5.56)

136 Chapter 5

Here, ψ is a learnable message function, computing v’s vector sent to u, and

the aggregation can be considered as a form of message passing on the graph.

One important thing to note is a representational containment between

these approaches: convolution ⊆ attention ⊆ message-passing. Indeed, atten-

tional GNNs can represent convolutional GNNs by an attention mechanism

implemented as a look-up table a(x

, x

) = c

, and both convolutional and

attentional GNNs are special cases of message-passing where the messages

are only the sender nodes’ features: ψ(x

, x

) = c

ψ(x

) for convolutional

GNNs and ψ(x

, x

) = a(x

, x

)ψ(x

) for attentional GNNs. Note that the lat-

ter containment holds even if the attentional coefﬁcients are normalised across

neighbours; we will explore this further in Section 5.4.1.

This does not imply that message passing GNNs are always the most useful

variant; as they have to compute vector-valued messages across edges, they are

typically harder to train and require unwieldy amounts of memory. Further, on

a wide range of naturally-occurring graphs, the graph’s edges encode for down-

stream class similarity (i.e. an edge (u, v) implies that u and v are likely to have

the same output). For such graphs (often called homophilic), convolutional

aggregation across neighbourhoods is often a far better choice, both in terms

of regularisation and scalability. Attentional GNNs offer a “middle-ground”:

they allow for modelling complex interactions within neighbourhoods while

computing only scalar-valued quantities across the edges, making them more

scalable than message-passing.

The “three ﬂavour” categorisation presented here is provided with brevity

in mind and inevitably neglects a wealth of nuances, insights, perspectives,

generalisations, and historical contexts to GNN models. However, based on

arguments we will present in the following subsections, its expressivity may

be sufﬁcient to represent any graph function of interest.

5.2.2 A maximally potent GNN: the Graph Network model

We started out our discussion by deliberately assuming the graph only had

features in the nodes; however, for many domains of scientiﬁc and industrial

interest, we may be interested in featurising the graph at other granularities.

One very important such context is computational chemistry, and when

working with molecular data we may expect to see, simultaneously: node fea-

tures, x

, such as atom type and charge; edge features, x

, such as bond types,

and whether they are in a ring; and graph features, x

, such as the molecular

weight or ﬁngerprints. We may then be interested in using a GNN to compute

representations h

, h

at all of these levels.

To illustrate how all the concepts we introduced before generalise to this set-

ting, we present the Graph Network architecture (Battaglia et al. 2018). It is a

Graphs 137

v∈N

(u,v)∈E

u∈V

Figure 5.4

The dataﬂow of the Graph Network model (Battaglia et al. 2018), integrating node

features (in blue), edge features (in green) and graph features (in red) to compute appro-

priate representations. Equivariant (blue) and invariant (red) layers feature prominently

across the Graph Network’s three functions ψ, ϕ and ρ.

general blueprint for building permutation equivariant functions over attributed

graphs, and all of the ﬂavours we discussed can be obtained by an appropriate

restriction of its equations. It proceeds in three phases, at each step using all of

the relevant information (Figure 5.4).

First, edge representations, h

, are computed using the features of the rel-

evant two nodes (u and v), the features of the edge, and the features of the

graph:

= ψ

(

, x

)

, (5.57)

where ψ would be analogous to the previously deﬁned message function.

Treating h

as messages, we can compute node representations. The Graph

Network takes into account the node’s features, the aggregated edge represen-

tations from all of the neighbours, and the graph features:

= ϕ

v∈N

, x

. (5.58)

Finally, since graph-level representations are a global property of the graph,

they need to be computed invariantly. Embodying the spirit of the geometric

deep learning blueprint, we aggregate all the node and edge representations to

compute h

= ρ





u∈V

(u,v)∈E

, x





. (5.59)

138 Chapter 5

Lastly, the reader may be curious about whether such a framework truly is

maximally potent—for example, could it also support structures such as hyper-

graphs, where more than two nodes may be involved in a hyperedge. We will

ponder this question from several angles in the following sections.

5.2.3 Deep Sets, Transformers, and Graph Rewiring

The “three ﬂavours” view of GNNs presented thus far largely concerns itself

with the design of their message and update functions (ψ and ϕ). However,

looking at the general message passing equation (Equation 5.56), there are in

fact two other “moving parts”: the choice of aggregator,

, and the choice of

neighbourhoods, N

. In the process of analysing these choices, it turns out we

can form tight connections between GNNs and many other inﬂuential neural

network architectures. We will begin by analysing the choice of N

In fact, the choice of N

is directly tied to the input graph structure we are

assuming. Thus far, we presumed that N

is given to us, and we did not ques-

tion its structure. However, such a situation is in reality quite rare: many graphs

we observe in the real world are merely approximations of the “ground-truth”

interactions between the nodes. For example, in a social network, we only

observe the friendship links users have explicitly formed, rather than the com-

plete collection of friendship interactions between those people. And within a

molecule, there exist physical interactions between all pairs of atoms—not just

ones explicitly connected by a chemical bond. This example becomes espe-

cially prominent when we consider interactions within macromolecules such

as proteins, that fold in 3D space, allowing atoms that are quite distant in the

molecular graph to interact more directly.

Beyond the need to handle such approximations to ground-truth graphs, it

is also well-understood that certain choices of graph are simply pathologically

bad – and their use will lead to issues regardless of the GNN architecture we

use (Alon and Yahav 2020; Di Giovanni et al. 2023). One typical instance of

this arises when the graph has bottlenecks—small collections of edges which

are responsible for carrying representations between large groups of nodes. For

example, in a barbell graph , the red edge is under sig-

niﬁcant representational pressure to transport information between the nodes

in the two communities. If it is important for those communities to directly

exchange features, it is possible to prove that in many cases, GNNs will not

even be able to ﬁt their training data accurately under such a graph.

To study various standard choices of N

without discussing properties of

any given input graphs, we will now brieﬂy return to the domain of unordered

sets. In the language of Section 5.1, we assume that we are given a matrix of

Graphs 139

5 6

Figure 5.5

Three strategies for choosing an edge set for a GNN, left-to-right: empty, leading to

Deep Sets (Zaheer et al. 2017); complete, leading to Transformers (Vaswani et al. 2017);

and precomputed, leading to EGP (Deac, Lackenby, and Veli

ckovi

c 2022). In all cases

we highlight one of the nodes’ neighbourhoods, N

, in red.

node features, X, but without any speciﬁed adjacency or ordering information

between the nodes. The speciﬁc architectures will arise by deciding to what

extent to model interactions between the nodes; see Figure 5.5.

5.2.3.1 Empty edge set Unordered sets are provided without any additional

structure or geometry whatsoever—hence, it could be argued that the most

natural way to process them is to treat each set element entirely independently.

This translates to a permutation equivariant function over such inputs, which

was already introduced in Section 5.1: a shared transformation applied to every

node in isolation. Assuming the same notation as when describing GNNs, such

models can be represented as

= ψ(x

), (5.60)

where ψ is a learnable transformation. It may be observed that this is a spe-

cial case of a convolutional GNN with N

= {u}—or, equivalently, A = I. Such

an architecture is commonly referred to as Deep Sets, in recognition of the

work of Zaheer et al. 2017 that have theoretically proved several universal-

approximation properties of such architectures. It should be noted that the need

to process unordered sets commonly arises in computer vision and graphics

when dealing with point clouds; therein, such models are known as PointNets

(Qi et al. 2017). Philosophically, it may also be noted that this architecture is

the furthest we can go with “pure” sets; any additional inputs to ψ imply that

there exist connections between the nodes, making our domain a graph.

5.2.3.2 Complete edge set While assuming an empty edge set is a very

efﬁcient construct for building functions over unordered sets, often we would

140 Chapter 5

expect that elements of the set exhibit some form of relational structure—

i.e., that there exists a latent graph between the nodes. Setting A = I discards

any such structure, and may yield suboptimal performance. Conversely, we

could assume that, in absence of any other prior knowledge, we cannot upfront

exclude any possible links between nodes. In this approach we assume the

complete graph, A = 11

⊤

; equivalently, N

= V. As we do not assume access

to any coefﬁcients of interaction, running convolutional-type GNNs over such

a graph would amount to:

= ϕ

v∈V

ψ(x

)

, (5.61)

where the second input,

v∈V

ψ(x

) is identical for all nodes u

, and as such

makes the model equivalently expressive to ignoring that input altogether; i.e.

the A = I case mentioned above.

This motivates the use of a more expressive GNN ﬂavour, the attentional,

= ϕ

v∈V

a(x

, x

)ψ(x

)

(5.62)

which yields the self-attention operator, the core of the Transformer architec-

ture (Vaswani et al. 2017). Assuming some kind of normalisation over the

attentional coefﬁcients (e.g. softmax), we can constrain all the scalars a(x

, x

)

to be in the range [0, 1]; as such, we can think of self-attention as infer-

ring a soft adjacency matrix, a

= a(x

, x

), as a byproduct of gradient-based

optimisation for some downstream task.

The above perspective means that we can pose Transformers exactly as

attentional GNNs over a complete graph (Joshi 2020).

However, this is in

apparent conﬂict with Transformers being initially proposed for modelling

sequences—the representations of h

should be mindful of node u’s position

in the sequence, which complete-graph aggregation would ignore. Transform-

ers address this issue by introducing positional encodings: modulating either

the node features x

or the computed attention coefﬁcients a

in a way that

depends on the absolute or relative positions of u and v in the sequence. The

original Transformer architecture, for example, samples an additive factor for

from a sine wave, whose frequency is dependent on u.

On graphs, where no natural ordering of nodes exists, multiple alterna-

tives were proposed to such positional encodings. One particularly promising

direction involves a realisation that the positional encodings used in ‘vanilla’

Transformers can be directly related to the discrete Fourier transform (DFT),

and hence to the eigenvectors of the graph Laplacian of a “circular grid”.

Hence, such positional encodings are implicitly representing our assumption

Graphs 141

that input nodes are connected in a grid. For more general graph structures,

one may simply use the Laplacian eigenvectors of the (assumed) graph—an

observation exploited by Dwivedi and Bresson (2020) within their empirically

powerful Graph Transformer model. Designing Transformers for graphs is a

very active and important area of research; we direct the reader to Müller et

al. (2023) for an excellent overview.

Finally, we note that while our discussion might have been sufﬁcient to

describe many important aspects of self-attention, additional careful details

are needed to relate the modern large language model (LLM) stack to our

framework. We detailedly cover this as a case study in Section 5.4.

5.2.3.3 Inferred edge set Finally, one can try to learn the latent relational

structure, leading to some general A that is neither I nor 11

⊤

. The problem

of inferring a latent adjacency matrix A for a GNN to use (often called latent

graph inference) is of high interest for graph representation learning. This is

due to the fact that assuming A = I may be representationally inferior, and

A = 11

⊤

may be challenging to implement due to memory requirements and

large neighbourhoods to aggregate over. Additionally, it is closest to the “true”

problem: inferring an adjacency matrix A implies detecting useful structure

between the rows of X, which may then help formulate hypotheses such as

causal relations between variables.

Unfortunately, such a framing necessarily induces a step-up in modelling

complexity. Speciﬁcally, it requires properly balancing a structure learning

objective (which is discrete, and hence challenging for gradient-based opti-

misation) with any downstream task the graph is used for. This makes latent

graph inference a highly challenging and intricate problem. Accordingly, many

of the current popular methods in this space side-step the issue of learnability

by using fully nonparametric methods (Wang et al. 2019), or pre-computing

the graph structure to use (Deac, Lackenby, and Veli

ckovi

c 2022; Shirzad

et al. 2023). These methods are also popular when modifying a given input

graph to e.g. relieve bottlenecks—an approach commonly referred to as (non-

parametric) graph rewiring (Gasteiger, Weißenberger, and Günnemann 2019;

Topping et al. 2021).

5.2.4 The expressive power of GNNs, and the Weisfeiler-Lehman test

The remaining design choice of GNNs to discuss is the choice of aggregation

function,

. While studying the choice of neighbourhoods allowed us to relate

many popular architectures to GNNs, studying aggregators will allow us to

reason about the GNNs’ expressive power, as we will now discover.

142 Chapter 5

ϕ (•, {{•, •}})

ϕ (•, {{•, •, •}})

ϕ (•, {{•, •}})

ϕ (•, {{•, •, •}})

ϕ (•, {{•, •}})

Figure 5.6

How the Weisfeiler-Lehman algorithm featurises a given graph of ﬁve nodes and six

edges, with categorical features coming from ∆

(i.e., effectively a featureless graph).

At each step, the colours are reﬁned using injective mappings of neighbourhood multi-

sets, until the distribution of colours stops changing – in this case, yielding the multiset

{{•, •, •, •, •}}. This histogram of colours can then be compared against another graph’s.

It would be very useful to understand whether the GNN ﬂavours we dis-

cussed are capable of representing any permutation equivariant function we

care about over graphs, or if there exists a need to study alternative frame-

works. To quantify the kinds of functions a GNN can express, we typically

study them from the perspective of deciding graph isomorphism: given two

non-isomorphic graphs, G

and G

and a GNN embedding them into a latent

space h, does it hold that h

̸= h

? If not, any kind of distinct classiﬁcation

of these two graphs is hopeless

To simplify our analysis, we will assume that the graphs have categor-

ical node features, x

∈∆

, that is, these features are one-hot vectors of

k dimensions. We can interpret this as a colouring of their nodes. Recall

that, in our derivation so far, we express GNNs using the abstract formula

= ϕ(x

, X

), where the features of the neighbourhood are represented as

v∈N

ψ(x

, x

). In order to maximise the capacity of the GNN, note that h

will retain maximal information about the graph—and hence, have maximal

expressive capacity—when ϕ is injective in its inputs: that is, two different

neighbourhood multisets must be mapped to different vectors by ϕ.

In the categorical-features case, we can obtain injectivity by choosing a

very simple message function: ψ(x

, x

) = x

, and then picking an aggrega-

tor

that preserves all multiset statistics of X

, especially around repeated

elements. Such an aggregator is summation (

), as it will preserve

information about all cardinalities of repeated items in the multiset. Weaker

Graphs 143

aggregators, such as averaging or maximisation, may reduce these cardinal-

ities to their relative ratios (

), or eliminate them altogether (max). Xu

et al. (2018) leverage this insight to derive the graph isomorphism network

(GIN), a maximally-expressive GNN over categorical node features:

= ϕ

(

1 + ϵ

)

v∈N

(5.63)

where ϵ ∈R is a learnable scalar designed to give injective representation to the

receiver node (u), and ϕ is a deep multilayer perceptron (which can represent

an injective function). The expressive power of GNNs is then equivalent to

the Weisfeiler-Lehman graph isomorphism test (Weisfeiler and Leman 1968),

a classical algorithm in graph theory which iteratively reﬁnes the colours of the

nodes using an (injective) hash of the neighbours’ colours (Morris et al. 2019).

An illustration of how this algorithm operates is provided in Figure 5.6.

5.2.5 *More expressive GNNs

While the equivalence of message passing (as described in Equation 5.56)

to the Weisfeiler-Lehman test is certainly a satisfying result, it leaves much

to be desired. For example, it implies that message passing neural networks

provably cannot distinguish a 6-cycle from two triangles ; the

colour reﬁnement procedure will never update any colour, as all neighbour-

hoods “locally look the same”. In some applications, this is a major drawback:

for example in organic chemistry where molecules are modeled as graphs,

cyclic structures such as aromatic rings are abundant and important.

There are multiple ways to extend the expressive power of GNNs that we

will discuss next; note that none of them will imply that the equivalence of

GNNs to the WL test doesn’t hold, as they will modify assumptions concerning

the input provided to the GNN. In fact, we may often leverage speciﬁc limiting

examples—like the 6-cycle/triangle one above—to derive meaningful ways in

which such limitations can be overcome in general.

5.2.5.1 Feature augmentation We start from perhaps the most obvious

reason why the WL test may fail – too many neighbourhoods being locally

identical. This issue arises because there are no distinguishing features in the

nodes; sometimes, the graph may even be featureless, as we discussed. It also

immediately hints at a cheap way to surpass this limitation: simply augment

the graph with more descriptive features.

An approach like this “pre-colours” different nodes (or edges) of the graph

by attaching some additional features to them, giving the WL algorithm (and

144 Chapter 5

hence also GNNs!) additional information to work with when distinguish-

ing non-isomorphic graphs. These pre-colours do not have to be particularly

meaningful— already randomly featurising the nodes is provably sufﬁcient to

increase expressive power (Sato, Yamada, and Kashima 2021). Random fea-

tures allow message passing to identify a node in its own k-hop neighbourhood,

empowering GNNs to detect cycles of a certain cardinality.

Naturally, feature augmentation also works very well when performed in a

more informed manner (R. Murphy et al. 2019); it is even possible to count

substructures we expect to care about (e.g. aromatic rings in molecules) and

use these counts for additional featurisation (Bouritsas et al. 2022).

5.2.5.2 Message passing modulation While introducing additional struc-

tural features boosts expressive power, the GNN is not committed to using

such features. As such, another promising class of models injects such features

directly into the equation of the message function ψ; in that way, the model is

forced to take this information into account.

There are many representative instances of such modulated message mech-

anisms; we call out two early successful examples. Firstly, directional graph

networks (Beaini et al. 2021), wherein Laplacian eigenvectors are used to

deﬁne canonical “ﬂows” on a graph, and they explicitly guide the scaling

factors of various incoming messages. Secondly, the LSPE model (Dwivedi

et al. 2021) has dedicated message passing over the positional features it

leverages, once again coercing the model to process them nontrivially.

Such ideas have also permeated into graph Transformers, with perhaps

the most prominent example being the Graphormer (C. Ying et al. 2021).

Graphormers compute all-pairs shortest paths between nodes in the graph, and

use them as an explicit bias for the self-attention operator.

5.2.5.3 Computation graph design While pre-computing interesting

graph properties and then leveraging them (either as additional features or

through modifying messages directly) is an attractive approach, it also arguably

requires a degree of feature engineering. It might be more favourable to

empower the model to extract meaningful structure by itself—but it provably

cannot do so over many input graphs, owing to the equivalence to the WL test.

This motivates a broad class of methods that modify the input graph of a GNN

in a way that will allow message passing to better detect certain substructures.

We’ve already covered some of these approaches in Section 5.2.3.3 under

the umbrella of “graph rewiring”; one key difference is that approaches we

will discuss here often introduce additional nodes into the graph. Such nodes

(often referred to as virtual nodes) carry representations that promote structure

Graphs 145

2 3

Subgraph 1

Subgraph 2

Subgraph 3

1 ̸↔ 3

1 ̸↔ 2

2 ̸↔ 3

1 0 1

0 1 1

1 1 1

1 1 0

1 1 1

0 1 1

1 1 1

1 1 0

1 0 1

Figure 5.7

A subgraph GNN operates over several sampled subgraphs of the given input graph

simultaneously. ESAN (Bevilacqua et al. 2021) is a subgraph GNN that operates in a

way that respects both the permutation symmetries of the nodes (S

) and the subgraphs

)—hence, it is equivariant to the product group S

×S

discovery; when appropriately placed, they can provably improve the quality

of the GNN’s dataﬂow (Wilson, Bechler-Speicher, and Veli

ckovi

c 2024).

Perhaps the easiest place to start tracking these approaches is by study-

ing extensions to the Weisfeiler-Lehman test itself. It was indeed extended

in the late 1970s

to a hierarchy of increasingly more powerful higher-

dimensional k-WL tests that operate on k-tuples of nodes. Christopher Morris

et al. (2019) and Haggai Maron et al. (2018; 2019) showed different designs

of higher-dimensional GNN architectures equivalent to k-WL tests; however,

such models lose the advantage of locality and linear complexity of the

standard message-passing GNN

. While such approaches take into account

featurising all relevant k-tuples indiscriminately, it is also possible to perform

more targeted message passing over speciﬁc topological substructures in the

graph, such as simplicial and cellular complexes of the graph (Bodnar, Frasca,

Wang, et al. 2021; Bodnar, Frasca, Otter, et al. 2021).

While approaches inspired by the above typically add nodes and edges into

the graph, it is also possible to recover improved expressive power by selec-

tively deleting nodes or edges—effectively, by message passing over selected

subgraphs of the input graph (Figure 5.7). One particularly interesting repre-

sentative subgraph GNN from the perspective of equivariances is ESAN; the

equivariant subgraph aggregation network (Bevilacqua et al. 2021). ESAN

operates over multisets of subgraphs, whose symmetry arises from both the

structure of each constituent subgraph as well as the multiset as the whole.

146 Chapter 5

The principle of equivariance requires that changing the order of m subgraphs

in the multiset (permutation group S

) and the order of the n nodes in the sub-

graphs (permutation group S

) yields an equivalent representation, or in other

words, the architecture is equivariant w.r.t. the product group

G = S

×S

This example serves as a relevant instance of combining symmetries within the

same architecture; we will explore more involved instances of this when we

discuss geometric graphs in latter Chapters.

5.2.5.4 Synchronisation and “clocks” Thus far, we have discussed sev-

eral ways in which the message passing equation may be edited: its parametric

functions (ψ, ϕ), aggregation function (

), features (x

), and computation

graph (N

). It may seem as if we’ve covered all the angles possible by doing

so, but in fact, there remains one hidden but important aspect – the synchroni-

sation between the different steps of the mechanism. In reality, Equation 5.56

stores embeddings over time:

(t+1)

= ϕ

(t)

v∈N



(t)

, x

(t)



, (5.64)

where t ∈N identiﬁes the current time-step of the GNN, and x

(t)

correspond

to the GNN input embeddings available at time t—note that we may simply

equate x

(t+1)

= h

(t+1)

in most cases, though they may also differ.

In almost all cases, Equation 5.64 is executed synchronously: all h

(t+1)

embeddings are computed simultaneously, in parallel – in essence, there is no

distinction between the notion of time-step and the notion of GNN layer. But

what if we relaxed this assumption, and allowed speciﬁc subsets of messages

to be computed or aggregated at one step, rather than all-at-once?

It turns out that this modiﬁcation, even when only one message at a time

is processed (see Figure 5.8), can yield improvements to expressive power.

This effect was ﬁrst studied and proved by Faber and Wattenhofer (2024) for

their GwAC asynchronous architecture. The argument is as follows: clearly,

an asynchronous GNN can match the execution of a synchronous one—we

just need to use an execution order that matches the relevant synchronisation

points. As such, the model is at least 1-WL expressive. Further, for speciﬁc

execution orders, it can be shown that it is possible to exactly trace the outline

of particular cycle structures in the graph, or even exactly identify each node

(as we did with random features previously), allowing to discern structures that

1-WL cannot.

GwAC does not take a very strong stance on the execution order in its asyn-

chronous trace, but several other architectures have since emerged that do. One

Graphs 147

: m

= ψ



, x



: m

= ψ



, x



: x

= ϕ



, m



: m

= ψ



, x



: x

= ϕ



, m



: m

= ψ



, x



: x

= ϕ



, m



: x

= ϕ



, m



Figure 5.8

An execution trace of a fully-asynchronous GNN: at each time-step (denoted by

advancing clock), exactly one message is either computed or aggregated. Note that

the snapshot of the node features dynamically changes whenever ϕ is invoked.

such architecture is the Co-GNN (Finkelshtein et al. 2024), wherein each node

can select whether to send or receive messages at a particular GNN layer.

While powerful, asynchronous GNNs do not neatly align with the currently

prevalent massively-parallel approaches to machine learning models. There-

fore, a middle-ground approach that can exploit such parallelism, while still

having clear traits of asynchronous computation, would be very attractive.

Dudzik et al. (2024) have shown that this is indeed possible to build such

a model, through the concept of asynchrony invariance: keeping the model

fully synchronous, but constraining it to give identical embeddings even if we

made its execution trace asynchronous in certain ways. This approach neatly

resonates with the ideas of symmetry that we explored thus far; however, it

requires working with generalisations of groups (monoids), and as such we

will only study it signiﬁcantly later in this book—namely, in its ﬁnal Chapter.

5.2.5.5 Continuous node features There is no doubt that the scaffold

offered by the k-WL hierarchy has had outsized impact to the development

of graph representation learning. As we’ve just seen, this algorithm family,

and the observation that GNNs naturally align to it, has informed countless

contributions in feature engineering, architecture design, graph rewiring, and

148 Chapter 5

even asynchronous models. Its ﬂexibility does not come without limitations,

however—and perhaps the most glaring of these is its assumption on dis-

crete features in the nodes. For many scientiﬁc and industrial tasks arising

from real-world data, features we have available will often be real-valued. We

will only focus on this particular extension, and how it leads to transcending

the WL hierarchy. That said, we remark several other approaches that are not

well-aligned with k-WL emerged in recent times; e.g., Tönshoff et al. (2021).

While Xu et al. (2018) left the study of aligning GNNs with real-valued

features to deciding graph isomorphism to future work, it was already evident

from the work of Zaheer et al. (2017) that extending such results even in the

domain of sets was going to be difﬁcult. The essence of difﬁculties governing

this transition were ﬁnally exposed and settled by Corso et al. (2020). The

key issue arises in the larger ﬂexibility of the objects that need to be injectively

aggregated: while in the WL setting we dealt with aggregating multisets of one-

hot vectors—and hence all we had to track for injectivity was the preservation

of their cardinalities—with real-valued features, non-isomorphic graphs can

have clashing embeddings from unexpected directions.

To illustrate this, let us ﬁrst recall that, according to 1-WL, aggregator

choices for

are ordered by expressivity as max <

, with

being

maximally expressive. Now, assume that our inbound messages are in R rather

than ∆

. It is relatively straightforward to play around with examples of real-

valued, non-isomorphic multisets where not only this hierarchy is broken, but

some may not be injectively distinguished by any of these three aggregators:

•

{{4, 1, 1}} vs. {{3, 3, 0}}: distinguished by max but not by mean or sum.

•

{{4, 2, 2, 0}} vs. {{4, 4, 0, 0}}: cannot be distinguished by either max, mean or

sum—however, they can be distinguished by standard deviation.

Following these leads, with a little help from standard facts in algebraic topol-

ogy

, it is possible to show that no single aggregator choice for

is sufﬁcient

to injectively embed real-valued neighbourhoods. In fact, it can be shown that

the number of aggregators, N, needed for injectivity is equal to the maximal

neighbourhood size, max

u∈V

|, in our dataset. While it is not yet clear what

property these collections of aggregators need to satisfy to be injective, Corso

et al. (2020) showed that the ﬁrst N moments of the neighbourhood multiset

(whose elements are modelled using a random variable X), deﬁned as follows:

= E(X) M

E[(X – E(X))

] (n > 1), (5.65)

Graphs 149

would serve as a sufﬁcient collection of aggregators. They leveraged this

insight to propose the principal neighbourhood aggregation (PNA) archi-

tecture, which retains only the numerically stable moments (n ∈{1, 2, ∞},

amounting to mean, standard deviation, min and max).

Concurrently with these developments, Wagstaff et al. (2019) successfully

settled the extension to real-valued features for pure unordered sets, success-

fully extending results from Zaheer et al. (2017). Remarkably, they ﬁnd a

very compatible result—proving that it is possible to injectively embed sets of

real-valued features with a single aggregator, so long as the number of latent

features is larger than or equal to the maximal set size in the dataset. The fact

that both Corso et al. (2020) and Wagstaff et al. (2019) rely on the maximal set

size hints at a deep connection between these results

PNA offers improved expressivity of GNNs through the use of multiple ﬁxed

aggregators. What if, instead, we had a parametric aggregator? Several pro-

posals attempt to introduce scalar parameters in an otherwise rigid aggregator

backbone (Pellegrini et al. 2020). It is also possible to replace aggregators with

a scaffold relying on fully-generic neural networks (such as MLPs) serving as

the

operator; however, care must be taken that such networks support a

commutative monoid structure (Ong and Veli

ckovi

c 2022). As monoids are

structures that generalise groups and hence out of focus for the GDL blueprint,

we will defer this particular example to the latter Chapters of the book.

5.2.6 What lies beyond message passing?

Throughout the previous ﬁve subsections, we made clear inroads towards sur-

passing the 1-WL limitation on expressivity for GNNs as speciﬁed by Equation

5.56. This raises clear questions as to whether message passing is sufﬁcient to

describe all possible permutation-equivariant layers we might come across.

For some approaches we discussed, it is clear the equation still applies:

feature augmentation does not change the operations of the model, message

passing modulation conditions functions like ψ(x

, x

) to ψ

G,(u,v)

, x

)—

making them depend on the graph structure around nodes u and v, and methods

like PNA amount to representing

using a combination of aggregators.

For other approaches, the situation is less clear, and the jury is still very

much out concerning the universality of message passing; not even the authors

of this book are in agreement

. To stimulate a useful discussion, we will

present a claim that message passing is universal up to a change of computation

graph; the argument was ﬁrst written up by Veli

ckovi

c (2022) and popularised

by Francesco Di Giovanni in his NeurIPS’22 Workshop talk.

The claim is as follows (Figure 5.9): let GNN

(G, X) be any neural network

with learnable parameters θ, applied over a graph G with (node/edge/graph.. . )

150 Chapter 5

GNN

MPNN

Figure 5.9

Schematic of the conjecture formalised by Francesco Di Giovanni at his NeurIPS’22

Graph Learning Frontiers Workshop talk, illustrated on the case of a simplicial complex

GNN (Bodnar, Frasca, Wang, et al. 2021). Such models perform message passing over

substructures – spanning more than two vertices, and hence not conforming to a binary

message function such as ψ(x

, x

). However, it is possible to equivalently re-express

such computation by ﬁrst rewiring the graph—in this case, by creating new nodes that

will represent each simplicial complex, appropriately wiring them to their constituent

nodes—and then performing the usual message passing equation. It is conjectured that

every (graph) neural network may be expressed in this manner.

features X. Then, it is conjectured that, for some R and θ

′

GNN

(G, X) = MPNN

′

(R(G), R(X)) (5.66)

where MPNN

′

, X

′

) is a message passing GNN as speciﬁed in Equation

5.56, with parameters speciﬁed by θ

′

, neighbourhoods speciﬁed by the graph

′

and input features speciﬁed by X

′

. The rewiring function, R: G ×R

→

G ×R

out

, modulates the input graph in an appropriate way. Note that we allow

R to act on the node features as well, as e.g. the operation may introduce new

nodes into the graph, which then need to be appropriately initialised.

If the conjecture in Equation 5.66 holds, then the message passing primitive

would be universal—perhaps for all of deep learning over discrete structures

For the speciﬁc computation graph modulations we presented earlier, we can

indeed propose how to construct appropriate rewiring operations:

•

For message passing over substructures (such as k-tuples), we can invent

new nodes which represent these substructures, and then connect them to all

Graphs 151

of their constituent nodes (in a manner that depends on whether permutation

invariance holds across the substructure).

•

For subgraph GNNs, we can duplicate each node once per subgraph, and

connect the various copies of a node together using edges of a separate type.

•

For asynchronous GNNs, we can artiﬁcially slow down messages by

introducing slow nodes in-between speciﬁc edges (Azabou et al. 2023).

While these examples offer reasonable motivation, interesting edge cases

with unclear semantics are still ample; for example, message functions may

be computed as a result of optimising an embedding for certain properties

(Bartunov, Fuchs, and Lillicrap 2022), or even the entire output of certain

continuous-layer GNNs may be recovered by analytically solving for some

constraints (Xhonneux, Qu, and Tang 2020). Further, there are many propos-

als for machine learning models over domains exhibiting richer structure than

graphs, such as sheaves (Bodnar et al. 2022), bundles (Bamberger et al. 2024)

and general topological spaces (Papamarkou et al. 2024). Any attempt at for-

mally demonstrating whether message passing can be a universal operator for

deep learning will need to take such works into account, but we believe such

efforts (in either direction) will be very much worth it!

5.3 Case Study: Recommender Systems And Social Networks

The ﬁrst popularised large-scale industrial applications of graph representation

learning have occurred within social networks, primarily in the context of rec-

ommender systems. Recommenders are tasked with deciding which content to

serve to users, potentially depending on their previous history of interactions

on the service. This is typically realised by ﬁrst using a GNN to embed the

properties of each item of content (and potentially individual users), u, into a

vector, h

, leveraging a graph comprising known user-content (e.g. a product

purchase) or content-content (e.g. products co-purchased together) interactions

to perform message passing over.

Note that, in a production environment, such graphs are frequently hetero-

geneous: we might have multiple types of nodes (e.g. user, product. . . ) as well

as multiple types of edges (e.g. purchase, impression, co-purchase. . . )—and

this information should condition the message passing operation. There exist

several ways to do this in GNNs: either simply passing node/edge type as a

categorical feature, or maintaining separate message function parameters for

each edge type, or anything in between (C. Zhang et al. 2019).

Once computed, these embeddings are optimised through a link prediction

objective (Figure 5.10): supervise the embeddings of various nodes (pieces of

content) such that they are kept close together if they are deemed related (e.g.

152 Chapter 5

u ∈ V

(u, v) ∈ E

Link

Predictor

(e.g. h

⊤

)

∈ B

Figure 5.10

Illustration of a typical recommender system setting with graph neural networks. In

such settings, nodes u ∈V may correspond to users, content or products, and edges

(u, v) ∈E may correspond to purchases, impressions, co-purchases, or similar. A GNN

ﬁrst embeds all nodes into a shared latent space (h

for node u ∈V), after which a link

prediction objective is optimised; for example, we might optimise the inner product

⊤

to predict whether user a would purchase product b (y

), given observed pur-

chase logs. This prediction can then be used to drive recommendations downstream.

commonly viewed together). Then the proximity of two nodes’ embeddings

(e.g. their inner product, h

⊤

) can be interpreted as the probability that they

are linked by an edge in the content graph. Then, for any content queried by

users, one approach could serve its nearest neighbours in the embedding space.

Importantly, at the scales faced by many deployed recommender systems,

naïvely deploying a message passing model is frequently prohibitively expen-

sive. The costs may come either from available memory – keeping entire

graphs of millions or even billions of entities in memory may be impossible

– or from required latency – the model’s forward pass may simply be too slow

for a seamless user experience. A key early example of a GNN mindful of such

issues is GraphSAGE (Hamilton, Ying, and Leskovec 2017), which does not

feed the entire graph into the GNN at once—instead, it processes a batch of

nodes at once, and dynamically subsamples each of those nodes’ neighbour-

hoods

to run the GNN over (Figure 5.11). Beyond the subsampling approach,

two other popular mechanisms for building scalable GNNs include partition-

ing the graph across multiple training steps (Chiang et al. 2019) or simplifying

Graphs 153

Figure 5.11

GraphSAGE (Hamilton, Ying, and Leskovec 2017) processes a batch of three nodes,

aggregating their K-hop neighbourhoods (here, K = 2). For each of the three nodes, it

randomly samples, with replacement, a certain number of k-order neighbours of each

node (for 1 ≤k ≤K; here, three for k = 1, and two for k = 2), to form the ﬁnal sub-

sampled graph. This graph is then used to steer the GNN’s message passing. Only the

messages necessary to compute the central node’s feature vectors, h

, are computed, in

order to optimise memory usage. Note that, while drawn here as all distinct, the sampled

nodes may be duplicated or overlapping, within as well as across the sampled patches.

the message passing equation by leveraging concepts from diffusion (F. Wu

et al. 2019; Rossi et al. 2020) or PageRank (Bojchevski et al. 2020).

Among the pioneers of this methodology is the American image sharing and

social media company Pinterest: besides presenting one of the ﬁrst success-

ful deployments of GNNs in production, their method, PinSage

, successfully

made graph representation learning scalable to graphs of millions of nodes and

billions of edges (R. Ying et al. 2018). Related applications, particularly in the

space of product recommendations, soon followed. Popular GNN-backed rec-

ommenders that had been deployed in production include Alibaba’s Aligraph

(Zhu et al. 2019), Amazon’s P-Companion (Hao et al. 2020), Spotify’s 2T-

HGNN (De Nadai et al. 2024), LinkedIn’s LiGNN (Borisyuk et al. 2024) and

Snapchat’s GiGL (Zhao et al. 2025). GNNs had also been deployed industrially

beyond the context of recommenders—for example, they power the expected

time of arrival (ETA) predictions within Google Maps (Derrow-Pinion et

al. 2021). All of these deployments taken together

, it is not far-fetched to say

that millions of people are coming into contact with graph machine learning

systems on a daily level.

Within the context of content analysis on social networks, another notewor-

thy effort is Fabula AI, which is among the ﬁrst GNN-based startups to be

acquired (in 2019, by Twitter). The startup, founded by one of the authors of

the text and his team, developed novel technology for detecting misinformation

154 Chapter 5

house

FFN

house

∈ R

FFN(x) = W

σ(W

x + b

) + b

∈ R

∥

w∈{A,house,of}

softmax((Qh

)

⊤

)Vh

cards

house

Figure 5.12

An illustration of training a large language model (LLM) on the next-token prediction

task for the input text “A house of cards”. The modern LLM stack interchanges

two types of layer: a feedforward network (FFN) applied to each token representation

in isolation, and a self-attention layer, allowing tokens to exchange their embeddings

using an attentional GNN. Note that it is often necessary to restrict the allowed dataﬂow

so that a token may only interact with the tokens preceding them; this protects against

leaking privileged information when all token representations in the text are supervised

at once. Accordingly, modern LLMs are graph neural networks operating over graphs

of tokens. See also Figure 5.13 for a zoomed-in view of the individual operations.

on social networks (Monti et al. 2019). Fabula’s solution consists of modelling

the spread of a particular news item by the network of users who shared it. The

users are connected if one of them re-shared the information from the other,

but also if they follow each other on the social network. This graph is then fed

into a graph neural network, which classiﬁes the entire graph as either ‘true’ or

‘fake’ content – with labels based on agreement between fact-checking bodies.

Besides demonstrating strong predictive power which stabilises quickly (often

within a few hours of the news spreading), analysing the embeddings of indi-

vidual user nodes revealed clear clustering of users who tend to share incorrect

information, exemplifying the well-known ‘echo chamber’ effect.

5.4 Case Study: Large Language Models

At the time of writing this Chapter—and surely for a signicant amount of

time after—the most prevalent artiﬁcial intelligence system is the large lan-

guage model (LLM; Figure 5.12). These systems have outstanding capabilities

in language, demonstrating coherent and useful answers to various prompts

of interest to users; this capability is reﬂected both in their widespread usage

(through systems such as Gemini, Claude and ChatGPT) and them attracting

signiﬁcant scientiﬁc interest. Naturally, it would be highly useful to be able to

describe and reason about LLMs within the framework proposed in this book.

Graphs 155

As it turns out, we already charted all of the necessary components to derive

the geometric deep learning properties of LLMs. In fact, we even hinted at this

in Section 5.2.3.2: attentional GNNs over complete graphs, i.e.

= ϕ

v∈V

a(x

, x

)ψ(x

)

, (5.67)

may already be sufﬁcient to describe the Transformer architecture (Vaswani

et al. 2017), which forms the key backbone of contemporary language models.

In this setting, the nodes of the graph correspond to tokens (fundamental units

of text—they may be words but also, e.g., word fragments), and their input

features, x

, are one-hot indicators specifying which token is at position u.

There are two key missing pieces: detailing how dot-product self-attention

conforms to this equation, and appropriately taking care of causal masking:

both of which we detail next.

5.4.1 Dot-product self attention

The dot-product self-attention operator

, as used by Transformers, is deﬁned

as follows (for general neighbourhood N

v∈N

(5.68)

where v

= Vx

(the value vector of v) and α

is the softmax-normalised dot-

product

between q

= Qx

and k

= Kx

= q

⊤

exp e

v∈N

exp e

(5.69)

Note that Q, K and V are the only parametric components of this operator.

From this analysis, we can align the equation to attentional GNNs as presented

in Equation 5.67 – set

, let ψ(x

) = Vx

and let a(x

, x

) = α

, where

the softmax-normalisation is implied. Transformers typically interchange self-

attention layers with feedforward layers, which are multilayer perceptrons

applied, in parallel, to each token’s embedding separately—much like in Deep

Sets

. Such a multilayer perceptron takes the role of ϕ, e.g.:

ϕ(x

, z

) = W

(

+ z

) + b

)

+ b

(5.70)

where W

, b

are the learnable weight and bias parameters of each layer, σ

is an activation function, and the input token features are residually added to

the attended vector z

. This MLP might also feature normalisation operations,

such as layer normalisation (Ba, Kiros, and Hinton 2016).

While this derivation (Figure 5.13) already aligns Transformers to the atten-

tional GNN ﬂavour, we ﬁnd it an instructive exercise to demonstrate how they

156 Chapter 5

−∞ −∞

0 0

−∞

0 0 0

softmax

Norm

FFN

Norm

FFN

Norm

FFN

Figure 5.13

Zooming in on the schematic from Figure 5.12, this Figure presents the individual

operations that commonly feature in the modern Transformer block. The dataﬂow also

emphasises how such models can be seen as graph neural networks: the adjacency

matrix is applied in the form of an attention mask (setting non-edge logits to –∞); the

value vectors v

may be seen as messages, and the attentional coefﬁcients α

deter-

mine the interaction strengths when aggregating these messages. The latter part of the

block (normalisation and feedforward network) corresponds to the GNN’s update func-

tion, ϕ. Note, the ﬁgure also depicts the frequently used rotary position embedding (Su

et al. 2024) as a rotation matrix R(u, v) applied to the dot product e

= q

⊤

R(u, v)k

can be expressed as a message passing GNN. This derivation will work for all

softmax-normalised attentional GNNs, even if not using dot-product attention.

To help us in this endeavour, let’s write Equation 5.68 as follows:

v∈N

(exp e

v∈N

exp e

(5.71)

The key insight is that both the numerator and the denominator in this fraction

can be computed with simple sum-aggregated local messages. This allows us

to express our message as a combination of two messages, (n, d), where n ∈R

is the numerator, and d ∈R the denominator. Both n and d can be computed as

functions of x

and x

only, and therefore conform to the required form.

Now it only remains to deﬁne our aggregation function,

. Note that in this

case, we need to deﬁne how our combined messages are aggregated

; i.e.,

(n, d) ⊕(n

′

, d

′

) =



n + n

′

, d + d

′



(5.72)

Graphs 157

Once aggregated, it remains to divide the numerator by the denominator—this

can be performed within the deﬁnition of ϕ. Hence, Transformers are indeed

(attentional and message-passing) graph neural networks

(Joshi 2020).

Why attentional? It is useful to pause for a moment and ponder why was the

attentional GNN ﬂavour so impactful in such a rich natural language domain,

as opposed to more generic message passing architectures. While we cannot

deﬁnitively settle this question here, we do offer a few conversation-starters.

On one hand, words are meaningless without context: as our input node

features are simply one-hot indices into a token vocabulary, it is unlikely that

complex messages are necessary

to make sense of them. The success of atten-

tion in this space reinforces the idea that one can easily reason about word

representations by leveraging weighted sums of other words’ representations.

On another hand, we must acknowledge the massive scale at which Trans-

formers are deployed today, with many large language models sporting

hundreds of billions of parameters. This is another setting that presents an

advantage for attentional GNNs: especially when using linear or dot-product

attention without given edge features, no vectors (only scalars) need to mate-

rialise on the edges of the graph, saving signiﬁcant memory

. In contrast, the

most general message passing GNNs require explicitly computing message

vectors and storing them in each edge.

Finally, the speciﬁc computations of Transformers—which consist of, e.g.,

many carefully placed full matrix products—align very well with the modern

AI hardware stack (e.g. GPUs and TPUs) and hence, Transformers are GNNs

that are currently winning the hardware lottery (Hooker 2021)!

5.4.2 Causal masking

It is precisely the aforementioned scalability of language model training which

yields the second key constraint to consider. The parameters of an LLM are

typically pre-trained using a next-token prediction objective: given a speciﬁc

sequence of tokens, predict the next one in the sequence. This objective is then

deployed over tokenised texts scraped across the entire Internet.

Given the incredible scales involved, it is critical that this training proceeds

in a manner which is as parallelisable as possible. Because of this, the model is

typically trained over entire chunks of contiguous text at once, and supervision

is applied on all of the tokens simultaneously.

This decision enables further scalability but limits the allowed dataﬂow

between tokens; namely, it necessitates that information ﬂows in a feedfor-

ward manner throughout the model. Should any information ﬂow from future

tokens towards past ones while also supervising on all tokens at once, it would

158 Chapter 5

(1)

(2)

(3)

(1)

(2)

(3)

Figure 5.14

Two typical harmful effects on information propagation in decoder-only Transformers:

overmixing (left) and oversquashing (right). Overmixing leads to important information

(depicted in red) being diffused too quickly across the token embeddings. Oversquash-

ing leads to overly relying on earlier tokens, as they have substantially more paths to

the ﬁnal token embedding (depicted in blue for an early token vs. red for a late token).

Both issues occur when attentional coefﬁcients, α

, are insufﬁciently sharp.

allow the model to “cheat” on the next-token prediction objective (information

about the next token would leak into the token tasked with predicting it).

Enforcing this constraint in an LLM amounts to masking the allowed token

adjacency to always be a lower-triangular matrix—often also referred to as

a causal mask. Note that this is a form of nonparametric graph rewiring: it

sets neighbourhoods N

= {v | v ≤u}. In Figure 5.13, this is exempliﬁed by the

relevant rewired adjacency matrix being passed as the attention mask (setting

logits in the u-th row that are not in N

to –∞).

Taking all of these constraints together, we have now fully speciﬁed the so-

called decoder-only Transformer

, which is the essence within the generative

pre-trained Transformer (GPT) (Radford et al. 2018) framework.

5.4.3 Analysing LLM graph dataﬂow

Now we know that decoder-only Transformers (the prevalent LLM architecture

at the time of writing) are actually secretly (?) graph machine learning models

under the hood. A natural question arises: can we leverage this knowledge to

make sense of how data propagates within LLMs – and potentially even make

meaningful improvements by tinkering with their graph topology?

We can answer this question in the afﬁrmative—in fact, the speciﬁc choice

of graph structure implied by a causal mask turns out to be harmful for

dataﬂow, from the perspective of (m)any metrics of graph “health” known to

graph theory. We will provide two examples of these issues for illustrative pur-

poses (see also Figure 5.14); while we remain cognisant of the rapid pace of

research in this space, we hope they serve as useful inspiration.

Firstly, because of the fully-connected structure of this causal graph, mes-

sages are always passed fully across all allowed pairs of tokens. This can lead

to a challenging phenomenon known as overmixing (Barbero et al. 2025),

Graphs 159

wherein the model loses the ﬁdelity of information about individual tokens,

as they rapidly mix with other tokens’ embeddings across different layers.

A complementary issue arises due to the causality of the computational

graph, coupled with the fact that only the last token’s embedding is used to

predict the next one. The causal graph necessarily induces a form of token

asymmetry, with earlier tokens having signiﬁcantly more paths to the last

token’s embedding than later ones. This leads to the over-squashing problem

(Barbero et al. 2024): as information is asymmetrically compressed into the

ﬁnal token’s embedding (after L layers), h

(L)

, early tokens have a signiﬁcantly

higher inﬂuence over it (in terms of the Jacobian, ∂h

(L)

/∂x

, for input token at

position v). This asymmetry is so pronounced that it is possible to show that, in

the limit where the number of decoder-only Transformer layers goes to inﬁnity

(L →∞), only the ﬁrst token inﬂuences the ﬁnal predictions.

While both of the above issues present important impediments to the

model’s dataﬂow, they may in principle be ﬁxed by appropriately chosen atten-

tional coefﬁcients, α

. For example, the coefﬁcients may be selected such that

they preserve the most vulnerable paths for longer, or delay mixing by concen-

trating attention in speciﬁc places. In both of these cases, a desirable property

is sharpness—being able to choose disproportionately higher values of α

for

a smaller subset of tokens. While such sharp circuits may indeed be discov-

ered within trained LLMs (Elhage et al. 2021), they only tend to appear in data

distributions “familiar” to the model after training. It can, in fact, be proved

using standard results in real analysis (Veli

ckovi

c et al. 2025) that the coefﬁ-

cients α

must all disperse below any chosen positive ϵ for sufﬁciently large

input position u at inference time (when going beyond the maximal input size

encountered during training).

Many of these issues are well familiar to researchers in graph representa-

tion learning; overmixing is frequently referred to as oversmoothing

(Rusch,

Bronstein, and Mishra 2023), whereas oversquashing is known by the same

name (Alon and Yahav 2020). While this certainly invites a plethora of inter-

esting methods from the ﬁeld, it is also important to acknowledge the unique

challenges that come from the feedforward structure of this particular graph,

which are seldom studied. Initial steps towards bridging this gap have recently

been made (Vitvitskyi et al. 2025), but a lot more work is needed.