5
Graphs
In multiple branches of science, from sociology to particle physics,
graphs are a fundamental model of systems of relations and interactions.
Graphs give rise to a basic type of symmetry modelled by the group of
permutations. This is why we start our investigation by studying them.
Most other objects of interest to us, such as grids and sets, can be
obtained as a particular case of graphs, and many architectures we will
discuss here will be expressible in the language of graph neural networks.
This also includes the modern large language model (LLM) stack, which
we present as a case study in this Chapter.
5.1 Domain
In many ways, graphs are the main modality of data we receive from nature.
Indeed, at virtually all levels of organisation of biological and physical
systems—from molecules to the connectomic structure of the neurons in the
brain —graphs offer a natural abstraction; namely, that of an interconnected
network of nodes. Similar conclusions can be drawn about human-designed
systems, such as transportation networks or social networks. In fact, it may
even be argued that the domains where deep learning has traditionally seen
most success—such as images, text or speech—can all be in the very least
seen as special cases of graph structures, but also very often just grid-like pro-
jections of a far more irregularly
1
structured phenomenon. Furthermore, the
very organisation of cognition seems to be irregularly structured, with con-
cepts freely related and recombined to form new knowledge. All of this taken
into account, it is likely not very controversial to claim that a truly artificially-
intelligent system will necessarily need to know how to process and reason
about graph-structured data. Conveniently, this framework also fits very nicely
in our GDL blueprint, and can be used as a basis for studying many other
geometric domains. Now, it is our turn to study graph representation learning.
130 Chapter 5
uc
d
v
a
b
This is a graph. G = (V, E) This is its tensorial representation.
X R
|Vk
A R
|V|×|V|
x
a
x
b
x
c
x
d
x
u
x
v
X =
1 0 0 1 0 1
0 1 0 0 0 1
0 0 1 1 1 0
1 0 1 1 0 1
0 0 1 0 1 1
1 1 0 1 1 1
A =
Figure 5.1
An undirected graph G = (V, E) containing the set of nodes V = {u, v, a, b, c, d} and
edges E = {(a, d), (a, v), (b, v), (c, d), (c, u), (d, v), (u, v)} (symmetric edges omitted for
brevity). This graph is also represented using tensors (node features X, adjacency matrix
A) for a node permutation g = [a, b, c, d, u, v].
A graph G = (V, E) is a collection of nodes
2
V and edges E V ×V between
pairs of nodes. For the purpose of the following discussion, we will further
assume the nodes to be endowed with s-dimensional node features, denoted by
x
u
for all u V; see Figure 5.1 for an illustration. Social networks are perhaps
among the most commonly studied examples of graphs, where nodes repre-
sent users, edges correspond to friendship relations between them, and node
features model user properties such as age, profile picture, etc. It is also often
possible to endow the edges, or entire graphs, with features; but as this does
not alter our main findings, we will defer discussing it to Section 5.2.2.
The key structural property of graphs is that the nodes in V are usually not
assumed to be provided in any particular order, and thus any operations per-
formed on graphs should not depend on the ordering of nodes. The desirable
property that functions acting on graphs should satisfy is thus permutation
invariance, and it implies that for any two isomorphic graphs—the likes of
which we saw in Figure 4.4—the outcomes of these functions are identical.
We can see this as a particular setting of our blueprint, where the domain = G
and the space X(G, R
d
) is that of d-dimensional node-wise signals. The sym-
metry we consider is given by the permutation group G = S
n
, whose elements
are all the possible orderings of the set of node indices {1, . . . , n}.
5.1.1 Sets
Let us first illustrate the concept of permutation invariance on sets, a special
case of graphs without edges (i.e., E = ). By stacking the node features as rows
of the n ×d matrix X = (x
1
, . . . , x
n
)
, we do effectively specify an ordering of
Graphs 131
the nodes. The action of the permutation g S
n
on the set of nodes amounts
to the reordering of the rows of X, which can be represented as an n ×n per-
mutation matrix ρ(g) = P,
3
where each row and column contains exactly one 1
and all the other entries are zeros.
A function f operating on this set is then said to be permutation invariant
if, for any such permutation matrix P, it holds that f (PX) = f (X). One simple
such function is
f (X) = ϕ
X
u∈V
ψ
(
x
u
)
!
, (5.46)
where the function ψ is independently applied to every node’s features, and ϕ
is applied on its sum-aggregated outputs: as sum is independent of the order in
which its inputs are provided, such a function is invariant with respect to the
permutation of the node set, and is hence guaranteed to always return the same
output, no matter how the nodes are permuted. Such a layer is very powerful—
under suitable conditions on ψs output dimensionality, it can be shown that
any permutation-invariant set function must be representable using the form in
Equation 5.46 (Wagstaff et al. 2019).
Functions like the above provide a ‘global’ set-level output, but very often,
we will be interested in functions that act ‘locally’, in a node-wise manner. For
example, we may want to apply some function to update the features in every
node, obtaining the set of latent node features. If we stack these latent features
into a matrix H, the function
4
F(X) = H is no longer permutation invariant:
the order of the rows of H should be tied to the order of the rows of X, so
that we know which output node feature corresponds to which input node. We
need instead a more fine-grained notion of permutation equivariance, stating
that, once we “commit” to a permutation of inputs, it consistently permutes the
resulting objects. Formally, F(X) is a permutation equivariant function if, for
any permutation matrix P, it holds that F(PX) = PF(X). A shared node-wise
linear transform
F
Θ
(X) = (5.47)
specified by a weight matrix Θ R
d×d
, is one possible construction of such a
permutation equivariant function, producing in our example latent features of
the form h
u
= Θ
x
u
.
This construction arises naturally from our Geometric Deep Learning
blueprint. Namely, we can characterise linear equivariants (functions of the
form FPX = PFX), for which it is easy to verify that any such map can be
written as a linear combination of two generators, the identity F
1
X = X and
the average F
2
X =
1
n
11
X =
1
n
P
n
u=1
x
u
. In Section 5.2.3, we show how the
popular Deep Sets (Zaheer et al. 2017) architecture follows precisely this idea.
132 Chapter 5
5.1.2 From sets to graphs
We can now generalise the notions of permutation invariance and equivariance
from sets to graphs. In the generic setting E ̸= , the graph connectivity can be
represented by the n ×n adjacency matrix A,
5
defined as
a
uv
=
(
1 (u, v) E
0 otherwise.
(5.48)
Note that now the adjacency and feature matrices A and X are “synchronised”,
in the sense that a
uv
specifies the adjacency information between the nodes
described by the uth and vth rows of X – cf. Figure 5.1. Therefore, applying a
permutation matrix P to the node features X automatically implies applying it
to As rows and columns, PAP
.
6
We say that (a graph-wise function
7
) f is
permutation invariant if
f (PX, PAP
) = f (X, A) (5.49)
and (a node-wise function) F is permutation equivariant if
F(PX, PAP
) = PF(X, A) (5.50)
for any permutation matrix P.
Here again, we can first characterise linear equivariant functions.
8
As
observed by Maron et al. 2018, any linear F satisfying equation (5.50) can
be expressed as a linear combination of fifteen linear generators; remarkably,
this family of generators is independent of n. Amongst these generators, our
blueprint specifically advocates for those that are also local, i.e., whereby the
output on node u directly depends on its neighbouring nodes in the graph. We
can formalise this constraint explicitly in our model construction, by defining
what it means for a node to be neighbouring another.
A (undirected) neighbourhood of node u, sometimes also called 1-hop, is
defined as
9
N
u
= {v : (u, v) E or (v, u) E} (5.51)
and the neighbourhood features are defined as the multiset
10
X
N
u
= {{x
v
: v N
u
}}. (5.52)
Operating on 1-hop neighbourhoods aligns well with the locality aspect of our
blueprint: namely, defining our metric over graphs as the shortest path distance
between nodes using edges in E.
The GDL blueprint thus yields a general recipe for constructing permutation
equivariant functions on graphs, by specifying a local function ϕ that operates
over the features of a node and its neighbourhood, ϕ(x
u
, X
N
u
). Then, a permu-
tation equivariant function F can be constructed by applying ϕ to every node’s
Graphs 133
x
b
x
a
x
c
x
d
x
e
h
b
φ(x
b
, X
N
b
)
Figure 5.2
An illustration of constructing permutation-equivariant functions over graphs, by apply-
ing a permutation-invariant function ϕ to every neighbourhood. In this case, ϕ is applied
to the features x
b
of node b as well as the multiset of its neighbourhood features,
X
N
b
= {{x
a
, x
b
, x
c
, x
d
, x
e
}}. Applying ϕ in this manner to every node’s neighbourhood
recovers the rows of the resulting matrix of latents features H = F(X, A).
neighbourhood in isolation (see Figure 5.2):
F(X, A) =
ϕ(x
1
, X
N
1
)
ϕ(x
2
, X
N
2
)
.
.
.
ϕ(x
n
, X
N
n
)
(5.53)
As F is constructed by applying a shared function ϕ to each node locally, its
permutation equivariance rests on ϕs output being independent on the ordering
of the nodes in N
u
. Thus, if ϕ is built to be permutation invariant, then this
property is satisfied.
It is also worth noticing that the difference between functions defined on
sets and more general graphs in this example is that in the latter case we need
to explicitly account for the structure of the domain. As a consequence, graphs
stand apart in the sense that the domain becomes part of the input in machine
learning problems, whereas when dealing with sets and grids (both particu-
lar cases of graphs) we can specify only the features and assume the domain
to be fixed. This distinction will be a recurring motif in our discussion. As
a result, the notion of geometric stability (invariance to domain deformation)
is crucial in most problems of learning on graphs. It straightforwardly follows
from our construction that permutation invariant and equivariant functions pro-
duce identical outputs on isomorphic (topologically-equivalent) graphs. These
results can be generalised to approximately isomorphic graphs, and several
results on stability under graph perturbations exist (Levie et al. 2018). We will
134 Chapter 5
return to this important point in our discussion on manifolds, which we will
use as an vehicle to study such invariance in further detail.
Second, due to their additional structure, graphs and grids, unlike sets,
can be coarsened in a non-trivial way
11
, giving rise to a variety of pooling
operations.
5.2 Model
Graph Neural Networks (GNNs) are the realisation of our Geometric Deep
Learning blueprint on graphs leveraging the properties of the permutation
group. GNNs are among the most general class of deep learning architectures
currently in existence, and as we will see in this text, most other deep learning
architectures can be understood as a special case of the GNN with additional
geometric structure.
As per our discussion in Section 5.1, we consider a graph to be specified with
an adjacency matrix A and node features X. We will study GNN architectures
that are permutation equivariant functions F(X, A) constructed by applying
shared permutation invariant functions ϕ(x
u
, X
N
u
) over local neighbourhoods.
Under various guises, this local function ϕ can be referred to as “diffusion”,
“propagation”, or “message passing”, and the overall computation of such F
as a “GNN layer”.
5.2.1 The three “flavours” of GNNs
The design and study of GNN layers is one of the most active areas of deep
learning at the time of writing, making it a landscape that is challenging to
navigate. Fortunately, we find that the vast majority of the literature may be
directly derived from only three spatial
12
“flavours” of GNN layers (Figure
5.3), which we will present here. These flavours govern the extent to which
ϕ transforms the neighbourhood features, allowing for varying degrees of
complexity when modelling interactions across the graph.
In all three flavours, permutation invariance is ensured by aggregating fea-
tures from X
N
u
(potentially transformed, by means of some function ψ) with
some permutation-invariant function
L
, and then updating the features of
node u, by means of some function ϕ. Typically,
13
ψ and ϕ are learnable,
whereas
L
is realised as a nonparametric operation such as sum, mean,
or maximum, though it can also be constructed e.g. using recurrent neural
networks with appropriate constraints (R. L. Murphy et al. 2018).
In the convolutional flavour (Kipf and Welling 2016; Defferrard, Bresson,
and Vandergheynst 2016; F. Wu et al. 2019), the features of the neighbourhood
Graphs 135
c
ba
c
bc
c
bd
c
be
c
bb
Convolutional
x
b
x
a
x
c
x
d
x
e
α
ba
α
bc
α
bd
α
be
α
bb
Attentional
x
b
x
a
x
c
x
d
x
e
m
ba
m
bc
m
bd
m
be
m
bb
Message-passing
x
b
x
a
x
c
x
d
x
e
Figure 5.3
A visualisation of the dataflow for the three flavours of GNN layers, g. We use the
neighbourhood of node b from Figure 5.2 to illustrate this. Left-to-right: convolutional,
where sender node features are multiplied with a constant, c
uv
; attentional, where this
multiplier is implicitly computed via an attention mechanism of the receiver over the
sender: α
uv
= a(x
u
, x
v
); and message-passing, where vector-based messages are com-
puted based on both the sender and receiver: m
uv
= ψ(x
u
, x
v
).
nodes are directly aggregated with fixed weights,
h
u
= ϕ
x
u
,
M
v∈N
u
c
uv
ψ(x
v
)
!
. (5.54)
Here, c
uv
specifies the importance of node v to node us representation. It is a
constant that often directly depends on the entries in A representing the struc-
ture of the graph. Note that when the aggregation operator
L
is chosen to be
the summation, it can be considered as a linear diffusion or position-dependent
linear filtering, a generalisation of convolution
14
—giving the flavour its name.
In the attentional flavour (Veli
ˇ
ckovi
´
c et al. 2018; Monti et al. 2017; J. Zhang
et al. 2018; Brody, Alon, and Yahav 2022), these interactions are implicit:
h
u
= ϕ
x
u
,
M
v∈N
u
a(x
u
, x
v
)ψ(x
v
)
!
. (5.55)
Here, a is a learnable self-attention mechanism that computes the importance
coefficients α
uv
= a(x
u
, x
v
) implicitly. They are often softmax-normalised
across all neighbours. When
L
is the summation, the aggregation is still a
linear combination of the neighbourhood node features, but now the weights
are feature-dependent.
Finally, the message-passing flavour (Battaglia et al. 2016; Gilmer et
al. 2017; Battaglia et al. 2018) amounts to computing arbitrary vectors
(“messages”) across edges,
h
u
= ϕ
x
u
,
M
v∈N
u
ψ(x
u
, x
v
)
!
. (5.56)
136 Chapter 5
Here, ψ is a learnable message function, computing vs vector sent to u, and
the aggregation can be considered as a form of message passing on the graph.
One important thing to note is a representational containment between
these approaches: convolution attention message-passing. Indeed, atten-
tional GNNs can represent convolutional GNNs by an attention mechanism
implemented as a look-up table a(x
u
, x
v
) = c
uv
, and both convolutional and
attentional GNNs are special cases of message-passing where the messages
are only the sender nodes’ features: ψ(x
u
, x
v
) = c
uv
ψ(x
v
) for convolutional
GNNs and ψ(x
u
, x
v
) = a(x
u
, x
v
)ψ(x
v
) for attentional GNNs. Note that the lat-
ter containment holds even if the attentional coefficients are normalised across
neighbours; we will explore this further in Section 5.4.1.
This does not imply that message passing GNNs are always the most useful
variant; as they have to compute vector-valued messages across edges, they are
typically harder to train and require unwieldy amounts of memory. Further, on
a wide range of naturally-occurring graphs, the graph’s edges encode for down-
stream class similarity (i.e. an edge (u, v) implies that u and v are likely to have
the same output). For such graphs (often called homophilic), convolutional
aggregation across neighbourhoods is often a far better choice, both in terms
of regularisation and scalability. Attentional GNNs offer a “middle-ground”:
they allow for modelling complex interactions within neighbourhoods while
computing only scalar-valued quantities across the edges, making them more
scalable than message-passing.
The “three flavour” categorisation presented here is provided with brevity
in mind and inevitably neglects a wealth of nuances, insights, perspectives,
generalisations, and historical contexts to GNN models. However, based on
arguments we will present in the following subsections, its expressivity may
be sufficient to represent any graph function of interest.
5.2.2 A maximally potent GNN: the Graph Network model
We started out our discussion by deliberately assuming the graph only had
features in the nodes; however, for many domains of scientific and industrial
interest, we may be interested in featurising the graph at other granularities.
One very important such context is computational chemistry, and when
working with molecular data we may expect to see, simultaneously: node fea-
tures, x
u
, such as atom type and charge; edge features, x
uv
, such as bond types,
and whether they are in a ring; and graph features, x
G
, such as the molecular
weight or fingerprints. We may then be interested in using a GNN to compute
representations h
u
, h
uv
, h
G
at all of these levels.
To illustrate how all the concepts we introduced before generalise to this set-
ting, we present the Graph Network architecture (Battaglia et al. 2018). It is a
Graphs 137
x
G
x
v
x
uv
x
u
ψ
h
uv
φ
h
u
ρ
h
G
L
v∈N
u
L
(u,v)∈E
L
u∈V
Figure 5.4
The dataflow of the Graph Network model (Battaglia et al. 2018), integrating node
features (in blue), edge features (in green) and graph features (in red) to compute appro-
priate representations. Equivariant (blue) and invariant (red) layers feature prominently
across the Graph Network’s three functions ψ, ϕ and ρ.
general blueprint for building permutation equivariant functions over attributed
graphs, and all of the flavours we discussed can be obtained by an appropriate
restriction of its equations. It proceeds in three phases, at each step using all of
the relevant information (Figure 5.4).
First, edge representations, h
uv
, are computed using the features of the rel-
evant two nodes (u and v), the features of the edge, and the features of the
graph:
h
uv
= ψ
(
x
u
, x
v
, x
uv
, x
G
)
, (5.57)
where ψ would be analogous to the previously defined message function.
Treating h
uv
as messages, we can compute node representations. The Graph
Network takes into account the node’s features, the aggregated edge represen-
tations from all of the neighbours, and the graph features:
h
u
= ϕ
x
u
,
M
v∈N
u
h
uv
, x
G
!
. (5.58)
Finally, since graph-level representations are a global property of the graph,
they need to be computed invariantly. Embodying the spirit of the geometric
deep learning blueprint, we aggregate all the node and edge representations to
compute h
G
:
h
G
= ρ
M
u∈V
h
u
,
M
(u,v)∈E
h
uv
, x
G
. (5.59)
138 Chapter 5
Lastly, the reader may be curious about whether such a framework truly is
maximally potent—for example, could it also support structures such as hyper-
graphs, where more than two nodes may be involved in a hyperedge. We will
ponder this question from several angles in the following sections.
5.2.3 Deep Sets, Transformers, and Graph Rewiring
The “three flavours” view of GNNs presented thus far largely concerns itself
with the design of their message and update functions (ψ and ϕ). However,
looking at the general message passing equation (Equation 5.56), there are in
fact two other “moving parts”: the choice of aggregator,
L
, and the choice of
neighbourhoods, N
u
. In the process of analysing these choices, it turns out we
can form tight connections between GNNs and many other influential neural
network architectures. We will begin by analysing the choice of N
u
.
In fact, the choice of N
u
is directly tied to the input graph structure we are
assuming. Thus far, we presumed that N
u
is given to us, and we did not ques-
tion its structure. However, such a situation is in reality quite rare: many graphs
we observe in the real world are merely approximations of the “ground-truth”
interactions between the nodes. For example, in a social network, we only
observe the friendship links users have explicitly formed, rather than the com-
plete collection of friendship interactions between those people. And within a
molecule, there exist physical interactions between all pairs of atoms—not just
ones explicitly connected by a chemical bond. This example becomes espe-
cially prominent when we consider interactions within macromolecules such
as proteins, that fold in 3D space, allowing atoms that are quite distant in the
molecular graph to interact more directly.
Beyond the need to handle such approximations to ground-truth graphs, it
is also well-understood that certain choices of graph are simply pathologically
bad and their use will lead to issues regardless of the GNN architecture we
use (Alon and Yahav 2020; Di Giovanni et al. 2023). One typical instance of
this arises when the graph has bottlenecks—small collections of edges which
are responsible for carrying representations between large groups of nodes. For
example, in a barbell graph , the red edge is under sig-
nificant representational pressure to transport information between the nodes
in the two communities. If it is important for those communities to directly
exchange features, it is possible to prove that in many cases, GNNs will not
even be able to fit their training data accurately under such a graph.
To study various standard choices of N
u
without discussing properties of
any given input graphs, we will now briefly return to the domain of unordered
sets. In the language of Section 5.1, we assume that we are given a matrix of
Graphs 139
12
3
4
5 6
7
8
12
3
4
5 6
7
8
12
3
4
5 6
7
8
Figure 5.5
Three strategies for choosing an edge set for a GNN, left-to-right: empty, leading to
Deep Sets (Zaheer et al. 2017); complete, leading to Transformers (Vaswani et al. 2017);
and precomputed, leading to EGP (Deac, Lackenby, and Veli
ˇ
ckovi
´
c 2022). In all cases
we highlight one of the nodes’ neighbourhoods, N
1
, in red.
node features, X, but without any specified adjacency or ordering information
between the nodes. The specific architectures will arise by deciding to what
extent to model interactions between the nodes; see Figure 5.5.
5.2.3.1 Empty edge set Unordered sets are provided without any additional
structure or geometry whatsoever—hence, it could be argued that the most
natural way to process them is to treat each set element entirely independently.
This translates to a permutation equivariant function over such inputs, which
was already introduced in Section 5.1: a shared transformation applied to every
node in isolation. Assuming the same notation as when describing GNNs, such
models can be represented as
h
u
= ψ(x
u
), (5.60)
where ψ is a learnable transformation. It may be observed that this is a spe-
cial case of a convolutional GNN with N
u
= {u}—or, equivalently, A = I. Such
an architecture is commonly referred to as Deep Sets, in recognition of the
work of Zaheer et al. 2017 that have theoretically proved several universal-
approximation properties of such architectures. It should be noted that the need
to process unordered sets commonly arises in computer vision and graphics
when dealing with point clouds; therein, such models are known as PointNets
(Qi et al. 2017). Philosophically, it may also be noted that this architecture is
the furthest we can go with “pure” sets; any additional inputs to ψ imply that
there exist connections between the nodes, making our domain a graph.
5.2.3.2 Complete edge set While assuming an empty edge set is a very
efficient construct for building functions over unordered sets, often we would
140 Chapter 5
expect that elements of the set exhibit some form of relational structure—
i.e., that there exists a latent graph between the nodes. Setting A = I discards
any such structure, and may yield suboptimal performance. Conversely, we
could assume that, in absence of any other prior knowledge, we cannot upfront
exclude any possible links between nodes. In this approach we assume the
complete graph, A = 11
; equivalently, N
u
= V. As we do not assume access
to any coefficients of interaction, running convolutional-type GNNs over such
a graph would amount to:
h
u
= ϕ
x
u
,
M
v∈V
ψ(x
v
)
!
, (5.61)
where the second input,
L
v∈V
ψ(x
v
) is identical for all nodes u
15
, and as such
makes the model equivalently expressive to ignoring that input altogether; i.e.
the A = I case mentioned above.
This motivates the use of a more expressive GNN flavour, the attentional,
h
u
= ϕ
x
u
,
M
v∈V
a(x
u
, x
v
)ψ(x
v
)
!
(5.62)
which yields the self-attention operator, the core of the Transformer architec-
ture (Vaswani et al. 2017). Assuming some kind of normalisation over the
attentional coefficients (e.g. softmax), we can constrain all the scalars a(x
u
, x
v
)
to be in the range [0, 1]; as such, we can think of self-attention as infer-
ring a soft adjacency matrix, a
uv
= a(x
u
, x
v
), as a byproduct of gradient-based
optimisation for some downstream task.
The above perspective means that we can pose Transformers exactly as
attentional GNNs over a complete graph (Joshi 2020).
16
However, this is in
apparent conflict with Transformers being initially proposed for modelling
sequences—the representations of h
u
should be mindful of node us position
in the sequence, which complete-graph aggregation would ignore. Transform-
ers address this issue by introducing positional encodings: modulating either
the node features x
u
or the computed attention coefficients a
uv
in a way that
depends on the absolute or relative positions of u and v in the sequence. The
original Transformer architecture, for example, samples an additive factor for
x
u
from a sine wave, whose frequency is dependent on u.
On graphs, where no natural ordering of nodes exists, multiple alterna-
tives were proposed to such positional encodings. One particularly promising
direction involves a realisation that the positional encodings used in ‘vanilla’
Transformers can be directly related to the discrete Fourier transform (DFT),
and hence to the eigenvectors of the graph Laplacian of a “circular grid”.
Hence, such positional encodings are implicitly representing our assumption
Graphs 141
that input nodes are connected in a grid. For more general graph structures,
one may simply use the Laplacian eigenvectors of the (assumed) graph—an
observation exploited by Dwivedi and Bresson (2020) within their empirically
powerful Graph Transformer model. Designing Transformers for graphs is a
very active and important area of research; we direct the reader to Müller et
al. (2023) for an excellent overview.
Finally, we note that while our discussion might have been sufficient to
describe many important aspects of self-attention, additional careful details
are needed to relate the modern large language model (LLM) stack to our
framework. We detailedly cover this as a case study in Section 5.4.
5.2.3.3 Inferred edge set Finally, one can try to learn the latent relational
structure, leading to some general A that is neither I nor 11
. The problem
of inferring a latent adjacency matrix A for a GNN to use (often called latent
graph inference) is of high interest for graph representation learning. This is
due to the fact that assuming A = I may be representationally inferior, and
A = 11
may be challenging to implement due to memory requirements and
large neighbourhoods to aggregate over. Additionally, it is closest to the “true”
problem: inferring an adjacency matrix A implies detecting useful structure
between the rows of X, which may then help formulate hypotheses such as
causal relations between variables.
Unfortunately, such a framing necessarily induces a step-up in modelling
complexity. Specifically, it requires properly balancing a structure learning
objective (which is discrete, and hence challenging for gradient-based opti-
misation) with any downstream task the graph is used for. This makes latent
graph inference a highly challenging and intricate problem. Accordingly, many
of the current popular methods in this space side-step the issue of learnability
by using fully nonparametric methods (Wang et al. 2019), or pre-computing
the graph structure to use (Deac, Lackenby, and Veli
ˇ
ckovi
´
c 2022; Shirzad
et al. 2023). These methods are also popular when modifying a given input
graph to e.g. relieve bottlenecks—an approach commonly referred to as (non-
parametric) graph rewiring (Gasteiger, Weißenberger, and Günnemann 2019;
Topping et al. 2021).
5.2.4 The expressive power of GNNs, and the Weisfeiler-Lehman test
The remaining design choice of GNNs to discuss is the choice of aggregation
function,
L
. While studying the choice of neighbourhoods allowed us to relate
many popular architectures to GNNs, studying aggregators will allow us to
reason about the GNNs’ expressive power, as we will now discover.
142 Chapter 5
ϕ (, {{, }})
ϕ (, {{, , }})
ϕ (, {{, }})
ϕ (, {{, , }})
ϕ (, {{, }})
Figure 5.6
How the Weisfeiler-Lehman algorithm featurises a given graph of five nodes and six
edges, with categorical features coming from
1
(i.e., effectively a featureless graph).
At each step, the colours are refined using injective mappings of neighbourhood multi-
sets, until the distribution of colours stops changing – in this case, yielding the multiset
{{, , , , }}. This histogram of colours can then be compared against another graph’s.
It would be very useful to understand whether the GNN flavours we dis-
cussed are capable of representing any permutation equivariant function we
care about over graphs, or if there exists a need to study alternative frame-
works. To quantify the kinds of functions a GNN can express, we typically
study them from the perspective of deciding graph isomorphism: given two
non-isomorphic graphs, G
1
and G
2
and a GNN embedding them into a latent
space h, does it hold that h
G
1
̸= h
G
2
? If not, any kind of distinct classification
of these two graphs is hopeless
17
.
To simplify our analysis, we will assume that the graphs have categor-
ical node features, x
u
k
, that is, these features are one-hot vectors of
k dimensions. We can interpret this as a colouring of their nodes. Recall
that, in our derivation so far, we express GNNs using the abstract formula
h
u
= ϕ(x
u
, X
N
u
), where the features of the neighbourhood are represented as
L
v∈N
u
ψ(x
u
, x
v
). In order to maximise the capacity of the GNN, note that h
u
will retain maximal information about the graph—and hence, have maximal
expressive capacity—when ϕ is injective in its inputs: that is, two different
neighbourhood multisets must be mapped to different vectors by ϕ.
In the categorical-features case, we can obtain injectivity by choosing a
very simple message function: ψ(x
u
, x
v
) = x
v
, and then picking an aggrega-
tor
L
that preserves all multiset statistics of X
N
u
, especially around repeated
elements. Such an aggregator is summation (
L
=
P
), as it will preserve
information about all cardinalities of repeated items in the multiset. Weaker
Graphs 143
aggregators, such as averaging or maximisation, may reduce these cardinal-
ities to their relative ratios (
1
|N
u
|
P
), or eliminate them altogether (max). Xu
et al. (2018) leverage this insight to derive the graph isomorphism network
(GIN), a maximally-expressive GNN over categorical node features:
h
u
= ϕ
(
1 + ϵ
)
x
u
+
X
v∈N
u
x
v
!
(5.63)
where ϵ R is a learnable scalar designed to give injective representation to the
receiver node (u), and ϕ is a deep multilayer perceptron (which can represent
an injective function). The expressive power of GNNs is then equivalent to
the Weisfeiler-Lehman graph isomorphism test (Weisfeiler and Leman 1968),
a classical algorithm in graph theory which iteratively refines the colours of the
nodes using an (injective) hash of the neighbours’ colours (Morris et al. 2019).
An illustration of how this algorithm operates is provided in Figure 5.6.
5.2.5 *More expressive GNNs
While the equivalence of message passing (as described in Equation 5.56)
to the Weisfeiler-Lehman test is certainly a satisfying result, it leaves much
to be desired. For example, it implies that message passing neural networks
provably cannot distinguish a 6-cycle from two triangles ; the
colour refinement procedure will never update any colour, as all neighbour-
hoods “locally look the same”. In some applications, this is a major drawback:
for example in organic chemistry where molecules are modeled as graphs,
cyclic structures such as aromatic rings are abundant and important.
There are multiple ways to extend the expressive power of GNNs that we
will discuss next; note that none of them will imply that the equivalence of
GNNs to the WL test doesn’t hold, as they will modify assumptions concerning
the input provided to the GNN. In fact, we may often leverage specific limiting
examples—like the 6-cycle/triangle one above—to derive meaningful ways in
which such limitations can be overcome in general.
5.2.5.1 Feature augmentation We start from perhaps the most obvious
reason why the WL test may fail too many neighbourhoods being locally
identical. This issue arises because there are no distinguishing features in the
nodes; sometimes, the graph may even be featureless, as we discussed. It also
immediately hints at a cheap way to surpass this limitation: simply augment
the graph with more descriptive features.
An approach like this “pre-colours” different nodes (or edges) of the graph
by attaching some additional features to them, giving the WL algorithm (and
144 Chapter 5
hence also GNNs!) additional information to work with when distinguish-
ing non-isomorphic graphs. These pre-colours do not have to be particularly
meaningful— already randomly featurising the nodes is provably sufficient to
increase expressive power (Sato, Yamada, and Kashima 2021). Random fea-
tures allow message passing to identify a node in its own k-hop neighbourhood,
empowering GNNs to detect cycles of a certain cardinality.
Naturally, feature augmentation also works very well when performed in a
more informed manner (R. Murphy et al. 2019); it is even possible to count
substructures we expect to care about (e.g. aromatic rings in molecules) and
use these counts for additional featurisation (Bouritsas et al. 2022).
5.2.5.2 Message passing modulation While introducing additional struc-
tural features boosts expressive power, the GNN is not committed to using
such features. As such, another promising class of models injects such features
directly into the equation of the message function ψ; in that way, the model is
forced to take this information into account.
There are many representative instances of such modulated message mech-
anisms; we call out two early successful examples. Firstly, directional graph
networks (Beaini et al. 2021), wherein Laplacian eigenvectors are used to
define canonical “flows” on a graph, and they explicitly guide the scaling
factors of various incoming messages. Secondly, the LSPE model (Dwivedi
et al. 2021) has dedicated message passing over the positional features it
leverages, once again coercing the model to process them nontrivially.
Such ideas have also permeated into graph Transformers, with perhaps
the most prominent example being the Graphormer (C. Ying et al. 2021).
Graphormers compute all-pairs shortest paths between nodes in the graph, and
use them as an explicit bias for the self-attention operator.
5.2.5.3 Computation graph design While pre-computing interesting
graph properties and then leveraging them (either as additional features or
through modifying messages directly) is an attractive approach, it also arguably
requires a degree of feature engineering. It might be more favourable to
empower the model to extract meaningful structure by itself—but it provably
cannot do so over many input graphs, owing to the equivalence to the WL test.
This motivates a broad class of methods that modify the input graph of a GNN
in a way that will allow message passing to better detect certain substructures.
We’ve already covered some of these approaches in Section 5.2.3.3 under
the umbrella of “graph rewiring”; one key difference is that approaches we
will discuss here often introduce additional nodes into the graph. Such nodes
(often referred to as virtual nodes) carry representations that promote structure
Graphs 145
1
2 3
Subgraph 1
Subgraph 2
Subgraph 3
2
1
3
2
1
3
2
1
3
1 ̸↔ 3
1 ̸↔ 2
2 ̸↔ 3
1 0 1
0 1 1
1 1 1
1 1 0
1 1 1
0 1 1
1 1 1
1 1 0
1 0 1
S
n
S
m
Figure 5.7
A subgraph GNN operates over several sampled subgraphs of the given input graph
simultaneously. ESAN (Bevilacqua et al. 2021) is a subgraph GNN that operates in a
way that respects both the permutation symmetries of the nodes (S
n
) and the subgraphs
(S
m
)—hence, it is equivariant to the product group S
n
×S
m
.
discovery; when appropriately placed, they can provably improve the quality
of the GNN’s dataflow (Wilson, Bechler-Speicher, and Veli
ˇ
ckovi
´
c 2024).
Perhaps the easiest place to start tracking these approaches is by study-
ing extensions to the Weisfeiler-Lehman test itself. It was indeed extended
in the late 1970s
18
to a hierarchy of increasingly more powerful higher-
dimensional k-WL tests that operate on k-tuples of nodes. Christopher Morris
et al. (2019) and Haggai Maron et al. (2018; 2019) showed different designs
of higher-dimensional GNN architectures equivalent to k-WL tests; however,
such models lose the advantage of locality and linear complexity of the
standard message-passing GNN
19
. While such approaches take into account
featurising all relevant k-tuples indiscriminately, it is also possible to perform
more targeted message passing over specific topological substructures in the
graph, such as simplicial and cellular complexes of the graph (Bodnar, Frasca,
Wang, et al. 2021; Bodnar, Frasca, Otter, et al. 2021).
While approaches inspired by the above typically add nodes and edges into
the graph, it is also possible to recover improved expressive power by selec-
tively deleting nodes or edges—effectively, by message passing over selected
subgraphs of the input graph (Figure 5.7). One particularly interesting repre-
sentative subgraph GNN from the perspective of equivariances is ESAN; the
equivariant subgraph aggregation network (Bevilacqua et al. 2021). ESAN
operates over multisets of subgraphs, whose symmetry arises from both the
structure of each constituent subgraph as well as the multiset as the whole.
146 Chapter 5
The principle of equivariance requires that changing the order of m subgraphs
in the multiset (permutation group S
m
) and the order of the n nodes in the sub-
graphs (permutation group S
n
) yields an equivalent representation, or in other
words, the architecture is equivariant w.r.t. the product group
20
G = S
m
×S
n
.
This example serves as a relevant instance of combining symmetries within the
same architecture; we will explore more involved instances of this when we
discuss geometric graphs in latter Chapters.
5.2.5.4 Synchronisation and “clocks” Thus far, we have discussed sev-
eral ways in which the message passing equation may be edited: its parametric
functions (ψ, ϕ), aggregation function (
L
), features (x
u
), and computation
graph (N
u
). It may seem as if we’ve covered all the angles possible by doing
so, but in fact, there remains one hidden but important aspect – the synchroni-
sation between the different steps of the mechanism. In reality, Equation 5.56
stores embeddings over time:
h
(t+1)
u
= ϕ
x
(t)
u
,
M
v∈N
u
ψ
x
(t)
u
, x
(t)
v
!
, (5.64)
where t N identifies the current time-step of the GNN, and x
(t)
u
correspond
to the GNN input embeddings available at time t—note that we may simply
equate x
(t+1)
u
= h
(t+1)
u
in most cases, though they may also differ.
In almost all cases, Equation 5.64 is executed synchronously: all h
(t+1)
u
embeddings are computed simultaneously, in parallel – in essence, there is no
distinction between the notion of time-step and the notion of GNN layer. But
what if we relaxed this assumption, and allowed specific subsets of messages
to be computed or aggregated at one step, rather than all-at-once?
It turns out that this modification, even when only one message at a time
is processed (see Figure 5.8), can yield improvements to expressive power.
This effect was first studied and proved by Faber and Wattenhofer (2024) for
their GwAC asynchronous architecture. The argument is as follows: clearly,
an asynchronous GNN can match the execution of a synchronous one—we
just need to use an execution order that matches the relevant synchronisation
points. As such, the model is at least 1-WL expressive. Further, for specific
execution orders, it can be shown that it is possible to exactly trace the outline
of particular cycle structures in the graph, or even exactly identify each node
(as we did with random features previously), allowing to discern structures that
1-WL cannot.
GwAC does not take a very strong stance on the execution order in its asyn-
chronous trace, but several other architectures have since emerged that do. One
Graphs 147
m
ba
m
bc
m
db
m
bb
x
b
x
a
x
c
x
d
: m
ba
= ψ
x
b
, x
a
: m
bc
= ψ
x
b
, x
c
: x
b
= ϕ
x
b
, m
bc
: m
bb
= ψ
x
b
, x
b
: x
b
= ϕ
x
b
, m
ba
: m
db
= ψ
x
d
, x
b
: x
b
= ϕ
x
b
, m
bb
: x
d
= ϕ
x
d
, m
db
Figure 5.8
An execution trace of a fully-asynchronous GNN: at each time-step (denoted by
advancing clock), exactly one message is either computed or aggregated. Note that
the snapshot of the node features dynamically changes whenever ϕ is invoked.
such architecture is the Co-GNN (Finkelshtein et al. 2024), wherein each node
can select whether to send or receive messages at a particular GNN layer.
While powerful, asynchronous GNNs do not neatly align with the currently
prevalent massively-parallel approaches to machine learning models. There-
fore, a middle-ground approach that can exploit such parallelism, while still
having clear traits of asynchronous computation, would be very attractive.
Dudzik et al. (2024) have shown that this is indeed possible to build such
a model, through the concept of asynchrony invariance: keeping the model
fully synchronous, but constraining it to give identical embeddings even if we
made its execution trace asynchronous in certain ways. This approach neatly
resonates with the ideas of symmetry that we explored thus far; however, it
requires working with generalisations of groups (monoids), and as such we
will only study it significantly later in this book—namely, in its final Chapter.
5.2.5.5 Continuous node features There is no doubt that the scaffold
offered by the k-WL hierarchy has had outsized impact to the development
of graph representation learning. As we’ve just seen, this algorithm family,
and the observation that GNNs naturally align to it, has informed countless
contributions in feature engineering, architecture design, graph rewiring, and
148 Chapter 5
even asynchronous models. Its flexibility does not come without limitations,
however—and perhaps the most glaring of these is its assumption on dis-
crete features in the nodes. For many scientific and industrial tasks arising
from real-world data, features we have available will often be real-valued. We
will only focus on this particular extension, and how it leads to transcending
the WL hierarchy. That said, we remark several other approaches that are not
well-aligned with k-WL emerged in recent times; e.g., Tönshoff et al. (2021).
While Xu et al. (2018) left the study of aligning GNNs with real-valued
features to deciding graph isomorphism to future work, it was already evident
from the work of Zaheer et al. (2017) that extending such results even in the
domain of sets was going to be difficult. The essence of difficulties governing
this transition were finally exposed and settled by Corso et al. (2020). The
key issue arises in the larger flexibility of the objects that need to be injectively
aggregated: while in the WL setting we dealt with aggregating multisets of one-
hot vectors—and hence all we had to track for injectivity was the preservation
of their cardinalities—with real-valued features, non-isomorphic graphs can
have clashing embeddings from unexpected directions.
To illustrate this, let us first recall that, according to 1-WL, aggregator
choices for
L
are ordered by expressivity as max <
1
|N
u
|
P
<
P
, with
P
being
maximally expressive. Now, assume that our inbound messages are in R rather
than
k
. It is relatively straightforward to play around with examples of real-
valued, non-isomorphic multisets where not only this hierarchy is broken, but
some may not be injectively distinguished by any of these three aggregators:
{{4, 1, 1}} vs. {{3, 3, 0}}: distinguished by max but not by mean or sum.
{{4, 2, 2, 0}} vs. {{4, 4, 0, 0}}: cannot be distinguished by either max, mean or
sum—however, they can be distinguished by standard deviation.
Following these leads, with a little help from standard facts in algebraic topol-
ogy
21
, it is possible to show that no single aggregator choice for
L
is sufficient
to injectively embed real-valued neighbourhoods. In fact, it can be shown that
the number of aggregators, N, needed for injectivity is equal to the maximal
neighbourhood size, max
u∈V
|N
u
|, in our dataset. While it is not yet clear what
property these collections of aggregators need to satisfy to be injective, Corso
et al. (2020) showed that the first N moments of the neighbourhood multiset
(whose elements are modelled using a random variable X), defined as follows:
M
1
= E(X) M
n
=
n
p
E[(X E(X))
n
] (n > 1), (5.65)
Graphs 149
would serve as a sufficient collection of aggregators. They leveraged this
insight to propose the principal neighbourhood aggregation (PNA) archi-
tecture, which retains only the numerically stable moments (n {1, 2, },
amounting to mean, standard deviation, min and max).
Concurrently with these developments, Wagstaff et al. (2019) successfully
settled the extension to real-valued features for pure unordered sets, success-
fully extending results from Zaheer et al. (2017). Remarkably, they find a
very compatible result—proving that it is possible to injectively embed sets of
real-valued features with a single aggregator, so long as the number of latent
features is larger than or equal to the maximal set size in the dataset. The fact
that both Corso et al. (2020) and Wagstaff et al. (2019) rely on the maximal set
size hints at a deep connection between these results
22
.
PNA offers improved expressivity of GNNs through the use of multiple fixed
aggregators. What if, instead, we had a parametric aggregator? Several pro-
posals attempt to introduce scalar parameters in an otherwise rigid aggregator
backbone (Pellegrini et al. 2020). It is also possible to replace aggregators with
a scaffold relying on fully-generic neural networks (such as MLPs) serving as
the
L
operator; however, care must be taken that such networks support a
commutative monoid structure (Ong and Veli
ˇ
ckovi
´
c 2022). As monoids are
structures that generalise groups and hence out of focus for the GDL blueprint,
we will defer this particular example to the latter Chapters of the book.
5.2.6 What lies beyond message passing?
Throughout the previous five subsections, we made clear inroads towards sur-
passing the 1-WL limitation on expressivity for GNNs as specified by Equation
5.56. This raises clear questions as to whether message passing is sufficient to
describe all possible permutation-equivariant layers we might come across.
For some approaches we discussed, it is clear the equation still applies:
feature augmentation does not change the operations of the model, message
passing modulation conditions functions like ψ(x
u
, x
v
) to ψ
G,(u,v)
(x
u
, x
v
)—
making them depend on the graph structure around nodes u and v, and methods
like PNA amount to representing
L
using a combination of aggregators.
For other approaches, the situation is less clear, and the jury is still very
much out concerning the universality of message passing; not even the authors
of this book are in agreement
23
. To stimulate a useful discussion, we will
present a claim that message passing is universal up to a change of computation
graph; the argument was first written up by Veli
ˇ
ckovi
´
c (2022) and popularised
by Francesco Di Giovanni in his NeurIPS’22 Workshop talk.
The claim is as follows (Figure 5.9): let GNN
θ
(G, X) be any neural network
with learnable parameters θ, applied over a graph G with (node/edge/graph.. . )
150 Chapter 5
A
B
Y
A
B
GNN
R
MPNN
Figure 5.9
Schematic of the conjecture formalised by Francesco Di Giovanni at his NeurIPS’22
Graph Learning Frontiers Workshop talk, illustrated on the case of a simplicial complex
GNN (Bodnar, Frasca, Wang, et al. 2021). Such models perform message passing over
substructures – spanning more than two vertices, and hence not conforming to a binary
message function such as ψ(x
u
, x
v
). However, it is possible to equivalently re-express
such computation by first rewiring the graph—in this case, by creating new nodes that
will represent each simplicial complex, appropriately wiring them to their constituent
nodes—and then performing the usual message passing equation. It is conjectured that
every (graph) neural network may be expressed in this manner.
features X. Then, it is conjectured that, for some R and θ
:
GNN
θ
(G, X) = MPNN
θ
(R(G), R(X)) (5.66)
where MPNN
θ
(G
, X
) is a message passing GNN as specified in Equation
5.56, with parameters specified by θ
, neighbourhoods specified by the graph
G
and input features specified by X
. The rewiring function, R: G ×R
n
in
G ×R
n
out
, modulates the input graph in an appropriate way. Note that we allow
R to act on the node features as well, as e.g. the operation may introduce new
nodes into the graph, which then need to be appropriately initialised.
If the conjecture in Equation 5.66 holds, then the message passing primitive
would be universal—perhaps for all of deep learning over discrete structures
24
.
For the specific computation graph modulations we presented earlier, we can
indeed propose how to construct appropriate rewiring operations:
For message passing over substructures (such as k-tuples), we can invent
new nodes which represent these substructures, and then connect them to all
Graphs 151
of their constituent nodes (in a manner that depends on whether permutation
invariance holds across the substructure).
For subgraph GNNs, we can duplicate each node once per subgraph, and
connect the various copies of a node together using edges of a separate type.
For asynchronous GNNs, we can artificially slow down messages by
introducing slow nodes in-between specific edges (Azabou et al. 2023).
While these examples offer reasonable motivation, interesting edge cases
with unclear semantics are still ample; for example, message functions may
be computed as a result of optimising an embedding for certain properties
(Bartunov, Fuchs, and Lillicrap 2022), or even the entire output of certain
continuous-layer GNNs may be recovered by analytically solving for some
constraints (Xhonneux, Qu, and Tang 2020). Further, there are many propos-
als for machine learning models over domains exhibiting richer structure than
graphs, such as sheaves (Bodnar et al. 2022), bundles (Bamberger et al. 2024)
and general topological spaces (Papamarkou et al. 2024). Any attempt at for-
mally demonstrating whether message passing can be a universal operator for
deep learning will need to take such works into account, but we believe such
efforts (in either direction) will be very much worth it!
5.3 Case Study: Recommender Systems And Social Networks
The first popularised large-scale industrial applications of graph representation
learning have occurred within social networks, primarily in the context of rec-
ommender systems. Recommenders are tasked with deciding which content to
serve to users, potentially depending on their previous history of interactions
on the service. This is typically realised by first using a GNN to embed the
properties of each item of content (and potentially individual users), u, into a
vector, h
u
, leveraging a graph comprising known user-content (e.g. a product
purchase) or content-content (e.g. products co-purchased together) interactions
to perform message passing over.
Note that, in a production environment, such graphs are frequently hetero-
geneous: we might have multiple types of nodes (e.g. user, product. . . ) as well
as multiple types of edges (e.g. purchase, impression, co-purchase. . . )—and
this information should condition the message passing operation. There exist
several ways to do this in GNNs: either simply passing node/edge type as a
categorical feature, or maintaining separate message function parameters for
each edge type, or anything in between (C. Zhang et al. 2019).
Once computed, these embeddings are optimised through a link prediction
objective (Figure 5.10): supervise the embeddings of various nodes (pieces of
content) such that they are kept close together if they are deemed related (e.g.
152 Chapter 5
h
u
h
c
h
d
h
v
h
a
h
b
/
u V
(u, v) E
Link
Predictor
(e.g. h
a
h
b
)
y
ab
B
Figure 5.10
Illustration of a typical recommender system setting with graph neural networks. In
such settings, nodes u V may correspond to users, content or products, and edges
(u, v) E may correspond to purchases, impressions, co-purchases, or similar. A GNN
first embeds all nodes into a shared latent space (h
u
for node u V), after which a link
prediction objective is optimised; for example, we might optimise the inner product
h
a
h
b
to predict whether user a would purchase product b (y
ab
), given observed pur-
chase logs. This prediction can then be used to drive recommendations downstream.
commonly viewed together). Then the proximity of two nodes’ embeddings
(e.g. their inner product, h
u
h
v
) can be interpreted as the probability that they
are linked by an edge in the content graph. Then, for any content queried by
users, one approach could serve its nearest neighbours in the embedding space.
Importantly, at the scales faced by many deployed recommender systems,
naïvely deploying a message passing model is frequently prohibitively expen-
sive. The costs may come either from available memory keeping entire
graphs of millions or even billions of entities in memory may be impossible
– or from required latency – the model’s forward pass may simply be too slow
for a seamless user experience. A key early example of a GNN mindful of such
issues is GraphSAGE (Hamilton, Ying, and Leskovec 2017), which does not
feed the entire graph into the GNN at once—instead, it processes a batch of
nodes at once, and dynamically subsamples each of those nodes’ neighbour-
hoods
25
to run the GNN over (Figure 5.11). Beyond the subsampling approach,
two other popular mechanisms for building scalable GNNs include partition-
ing the graph across multiple training steps (Chiang et al. 2019) or simplifying
Graphs 153
h
1
h
2
h
3
Figure 5.11
GraphSAGE (Hamilton, Ying, and Leskovec 2017) processes a batch of three nodes,
aggregating their K-hop neighbourhoods (here, K = 2). For each of the three nodes, it
randomly samples, with replacement, a certain number of k-order neighbours of each
node (for 1 k K; here, three for k = 1, and two for k = 2), to form the final sub-
sampled graph. This graph is then used to steer the GNN’s message passing. Only the
messages necessary to compute the central node’s feature vectors, h
u
, are computed, in
order to optimise memory usage. Note that, while drawn here as all distinct, the sampled
nodes may be duplicated or overlapping, within as well as across the sampled patches.
the message passing equation by leveraging concepts from diffusion (F. Wu
et al. 2019; Rossi et al. 2020) or PageRank (Bojchevski et al. 2020).
Among the pioneers of this methodology is the American image sharing and
social media company Pinterest: besides presenting one of the first success-
ful deployments of GNNs in production, their method, PinSage
26
, successfully
made graph representation learning scalable to graphs of millions of nodes and
billions of edges (R. Ying et al. 2018). Related applications, particularly in the
space of product recommendations, soon followed. Popular GNN-backed rec-
ommenders that had been deployed in production include Alibaba’s Aligraph
(Zhu et al. 2019), Amazon’s P-Companion (Hao et al. 2020), Spotify’s 2T-
HGNN (De Nadai et al. 2024), LinkedIn’s LiGNN (Borisyuk et al. 2024) and
Snapchat’s GiGL (Zhao et al. 2025). GNNs had also been deployed industrially
beyond the context of recommenders—for example, they power the expected
time of arrival (ETA) predictions within Google Maps (Derrow-Pinion et
al. 2021). All of these deployments taken together
27
, it is not far-fetched to say
that millions of people are coming into contact with graph machine learning
systems on a daily level.
Within the context of content analysis on social networks, another notewor-
thy effort is Fabula AI, which is among the first GNN-based startups to be
acquired (in 2019, by Twitter). The startup, founded by one of the authors of
the text and his team, developed novel technology for detecting misinformation
154 Chapter 5
A
house
of
x
A
x
house
x
of
FFN
FFN
FFN
h
A
h
house
h
of
z
A
z
house
z
of
R
n
FFN(x) = W
2
σ(W
1
x + b
1
) + b
2
R
k
P
w∈{A,house,of}
softmax((Qh
of
)
Kh
w
)Vh
w
cards
of
house
Figure 5.12
An illustration of training a large language model (LLM) on the next-token prediction
task for the input text “A house of cards”. The modern LLM stack interchanges
two types of layer: a feedforward network (FFN) applied to each token representation
in isolation, and a self-attention layer, allowing tokens to exchange their embeddings
using an attentional GNN. Note that it is often necessary to restrict the allowed dataflow
so that a token may only interact with the tokens preceding them; this protects against
leaking privileged information when all token representations in the text are supervised
at once. Accordingly, modern LLMs are graph neural networks operating over graphs
of tokens. See also Figure 5.13 for a zoomed-in view of the individual operations.
on social networks (Monti et al. 2019). Fabula’s solution consists of modelling
the spread of a particular news item by the network of users who shared it. The
users are connected if one of them re-shared the information from the other,
but also if they follow each other on the social network. This graph is then fed
into a graph neural network, which classifies the entire graph as either ‘true’ or
‘fake’ content – with labels based on agreement between fact-checking bodies.
Besides demonstrating strong predictive power which stabilises quickly (often
within a few hours of the news spreading), analysing the embeddings of indi-
vidual user nodes revealed clear clustering of users who tend to share incorrect
information, exemplifying the well-known ‘echo chamber’ effect.
5.4 Case Study: Large Language Models
At the time of writing this Chapter—and surely for a signicant amount of
time after—the most prevalent artificial intelligence system is the large lan-
guage model (LLM; Figure 5.12). These systems have outstanding capabilities
in language, demonstrating coherent and useful answers to various prompts
of interest to users; this capability is reflected both in their widespread usage
(through systems such as Gemini, Claude and ChatGPT) and them attracting
significant scientific interest. Naturally, it would be highly useful to be able to
describe and reason about LLMs within the framework proposed in this book.
Graphs 155
As it turns out, we already charted all of the necessary components to derive
the geometric deep learning properties of LLMs. In fact, we even hinted at this
in Section 5.2.3.2: attentional GNNs over complete graphs, i.e.
h
u
= ϕ
x
u
,
M
v∈V
a(x
u
, x
v
)ψ(x
v
)
!
, (5.67)
may already be sufficient to describe the Transformer architecture (Vaswani
et al. 2017), which forms the key backbone of contemporary language models.
In this setting, the nodes of the graph correspond to tokens (fundamental units
of text—they may be words but also, e.g., word fragments), and their input
features, x
u
, are one-hot indicators specifying which token is at position u.
There are two key missing pieces: detailing how dot-product self-attention
conforms to this equation, and appropriately taking care of causal masking:
both of which we detail next.
5.4.1 Dot-product self attention
The dot-product self-attention operator
28
, as used by Transformers, is defined
as follows (for general neighbourhood N
u
):
z
u
=
X
v∈N
u
α
uv
v
v
(5.68)
where v
v
= Vx
v
(the value vector of v) and α
uv
is the softmax-normalised dot-
product
29
between q
u
= Qx
u
and k
v
= Kx
v
:
e
uv
= q
u
k
v
α
uv
=
exp e
uv
P
v∈N
u
exp e
uv
(5.69)
Note that Q, K and V are the only parametric components of this operator.
From this analysis, we can align the equation to attentional GNNs as presented
in Equation 5.67 set
L
=
P
, let ψ(x
v
) = Vx
v
and let a(x
u
, x
v
) = α
uv
, where
the softmax-normalisation is implied. Transformers typically interchange self-
attention layers with feedforward layers, which are multilayer perceptrons
applied, in parallel, to each token’s embedding separately—much like in Deep
Sets
30
. Such a multilayer perceptron takes the role of ϕ, e.g.:
ϕ(x
u
, z
u
) = W
2
σ
(
W
1
(x
u
+ z
u
) + b
1
)
+ b
2
(5.70)
where W
·
, b
·
are the learnable weight and bias parameters of each layer, σ
is an activation function, and the input token features are residually added to
the attended vector z
u
. This MLP might also feature normalisation operations,
such as layer normalisation (Ba, Kiros, and Hinton 2016).
While this derivation (Figure 5.13) already aligns Transformers to the atten-
tional GNN flavour, we find it an instructive exercise to demonstrate how they
156 Chapter 5
x
1
x
2
x
3
q
3
q
2
q
1
k
2
k
1
k
3
v
1
v
2
v
3
Q
K
V
e
11
e
12
e
13
e
21
e
22
e
23
e
31
e
32
e
33
R
R
R
0
−∞ −∞
0 0
−∞
0 0 0
L
softmax
θ
α
12
α
11
α
13
α
21
α
22
α
23
α
31
α
32
α
33
z
2
z
1
z
3
h
1
h
2
h
3
Norm
FFN
Norm
FFN
Norm
FFN
L
L
Figure 5.13
Zooming in on the schematic from Figure 5.12, this Figure presents the individual
operations that commonly feature in the modern Transformer block. The dataflow also
emphasises how such models can be seen as graph neural networks: the adjacency
matrix is applied in the form of an attention mask (setting non-edge logits to –); the
value vectors v
u
may be seen as messages, and the attentional coefficients α
uv
deter-
mine the interaction strengths when aggregating these messages. The latter part of the
block (normalisation and feedforward network) corresponds to the GNN’s update func-
tion, ϕ. Note, the figure also depicts the frequently used rotary position embedding (Su
et al. 2024) as a rotation matrix R(u, v) applied to the dot product e
uv
= q
u
R(u, v)k
v
.
can be expressed as a message passing GNN. This derivation will work for all
softmax-normalised attentional GNNs, even if not using dot-product attention.
To help us in this endeavour, let’s write Equation 5.68 as follows:
z
u
=
P
v∈N
u
(exp e
uv
)v
v
P
v∈N
u
exp e
uv
(5.71)
The key insight is that both the numerator and the denominator in this fraction
can be computed with simple sum-aggregated local messages. This allows us
to express our message as a combination of two messages, (n, d), where n R
k
is the numerator, and d R the denominator. Both n and d can be computed as
functions of x
u
and x
v
only, and therefore conform to the required form.
Now it only remains to define our aggregation function,
L
. Note that in this
case, we need to define how our combined messages are aggregated
31
; i.e.,
(n, d) (n
, d
) =
n + n
, d + d
(5.72)
Graphs 157
Once aggregated, it remains to divide the numerator by the denominator—this
can be performed within the definition of ϕ. Hence, Transformers are indeed
(attentional and message-passing) graph neural networks
32
(Joshi 2020).
Why attentional? It is useful to pause for a moment and ponder why was the
attentional GNN flavour so impactful in such a rich natural language domain,
as opposed to more generic message passing architectures. While we cannot
definitively settle this question here, we do offer a few conversation-starters.
On one hand, words are meaningless without context: as our input node
features are simply one-hot indices into a token vocabulary, it is unlikely that
complex messages are necessary
33
to make sense of them. The success of atten-
tion in this space reinforces the idea that one can easily reason about word
representations by leveraging weighted sums of other words’ representations.
On another hand, we must acknowledge the massive scale at which Trans-
formers are deployed today, with many large language models sporting
hundreds of billions of parameters. This is another setting that presents an
advantage for attentional GNNs: especially when using linear or dot-product
attention without given edge features, no vectors (only scalars) need to mate-
rialise on the edges of the graph, saving significant memory
34
. In contrast, the
most general message passing GNNs require explicitly computing message
vectors and storing them in each edge.
Finally, the specific computations of Transformers—which consist of, e.g.,
many carefully placed full matrix products—align very well with the modern
AI hardware stack (e.g. GPUs and TPUs) and hence, Transformers are GNNs
that are currently winning the hardware lottery (Hooker 2021)!
5.4.2 Causal masking
It is precisely the aforementioned scalability of language model training which
yields the second key constraint to consider. The parameters of an LLM are
typically pre-trained using a next-token prediction objective: given a specific
sequence of tokens, predict the next one in the sequence. This objective is then
deployed over tokenised texts scraped across the entire Internet.
Given the incredible scales involved, it is critical that this training proceeds
in a manner which is as parallelisable as possible. Because of this, the model is
typically trained over entire chunks of contiguous text at once, and supervision
is applied on all of the tokens simultaneously.
This decision enables further scalability but limits the allowed dataflow
between tokens; namely, it necessitates that information flows in a feedfor-
ward manner throughout the model. Should any information flow from future
tokens towards past ones while also supervising on all tokens at once, it would
158 Chapter 5
x
1
x
2
x
3
x
4
x
5
h
(1)
1
h
(1)
2
h
(1)
3
h
(1)
4
h
(1)
5
h
(2)
1
h
(2)
2
h
(2)
3
h
(2)
4
h
(2)
5
h
(3)
1
h
(3)
2
h
(3)
3
h
(3)
4
h
(3)
5
x
1
x
2
x
3
x
4
x
5
h
(1)
1
h
(1)
2
h
(1)
3
h
(1)
4
h
(1)
5
h
(2)
1
h
(2)
2
h
(2)
3
h
(2)
4
h
(2)
5
h
(3)
1
h
(3)
2
h
(3)
3
h
(3)
4
h
(3)
5
x
6
Figure 5.14
Two typical harmful effects on information propagation in decoder-only Transformers:
overmixing (left) and oversquashing (right). Overmixing leads to important information
(depicted in red) being diffused too quickly across the token embeddings. Oversquash-
ing leads to overly relying on earlier tokens, as they have substantially more paths to
the final token embedding (depicted in blue for an early token vs. red for a late token).
Both issues occur when attentional coefficients, α
uv
, are insufficiently sharp.
allow the model to “cheat” on the next-token prediction objective (information
about the next token would leak into the token tasked with predicting it).
Enforcing this constraint in an LLM amounts to masking the allowed token
adjacency to always be a lower-triangular matrix—often also referred to as
a causal mask. Note that this is a form of nonparametric graph rewiring: it
sets neighbourhoods N
u
= {v | v u}. In Figure 5.13, this is exemplified by the
relevant rewired adjacency matrix being passed as the attention mask (setting
logits in the u-th row that are not in N
u
to –).
Taking all of these constraints together, we have now fully specified the so-
called decoder-only Transformer
35
, which is the essence within the generative
pre-trained Transformer (GPT) (Radford et al. 2018) framework.
5.4.3 Analysing LLM graph dataflow
Now we know that decoder-only Transformers (the prevalent LLM architecture
at the time of writing) are actually secretly (?) graph machine learning models
under the hood. A natural question arises: can we leverage this knowledge to
make sense of how data propagates within LLMs – and potentially even make
meaningful improvements by tinkering with their graph topology?
We can answer this question in the affirmative—in fact, the specific choice
of graph structure implied by a causal mask turns out to be harmful for
dataflow, from the perspective of (m)any metrics of graph “health” known to
graph theory. We will provide two examples of these issues for illustrative pur-
poses (see also Figure 5.14); while we remain cognisant of the rapid pace of
research in this space, we hope they serve as useful inspiration.
Firstly, because of the fully-connected structure of this causal graph, mes-
sages are always passed fully across all allowed pairs of tokens. This can lead
to a challenging phenomenon known as overmixing (Barbero et al. 2025),
Graphs 159
wherein the model loses the fidelity of information about individual tokens,
as they rapidly mix with other tokens’ embeddings across different layers.
A complementary issue arises due to the causality of the computational
graph, coupled with the fact that only the last token’s embedding is used to
predict the next one. The causal graph necessarily induces a form of token
asymmetry, with earlier tokens having significantly more paths to the last
token’s embedding than later ones. This leads to the over-squashing problem
(Barbero et al. 2024): as information is asymmetrically compressed into the
final token’s embedding (after L layers), h
(L)
n
, early tokens have a significantly
higher influence over it (in terms of the Jacobian, h
(L)
n
/x
v
, for input token at
position v). This asymmetry is so pronounced that it is possible to show that, in
the limit where the number of decoder-only Transformer layers goes to infinity
(L ), only the first token influences the final predictions.
While both of the above issues present important impediments to the
model’s dataflow, they may in principle be fixed by appropriately chosen atten-
tional coefficients, α
uv
. For example, the coefficients may be selected such that
they preserve the most vulnerable paths for longer, or delay mixing by concen-
trating attention in specific places. In both of these cases, a desirable property
is sharpness—being able to choose disproportionately higher values of α
uv
for
a smaller subset of tokens. While such sharp circuits may indeed be discov-
ered within trained LLMs (Elhage et al. 2021), they only tend to appear in data
distributions “familiar” to the model after training. It can, in fact, be proved
using standard results in real analysis (Veli
ˇ
ckovi
´
c et al. 2025) that the coeffi-
cients α
uv
must all disperse below any chosen positive ϵ for sufficiently large
input position u at inference time (when going beyond the maximal input size
encountered during training).
Many of these issues are well familiar to researchers in graph representa-
tion learning; overmixing is frequently referred to as oversmoothing
36
(Rusch,
Bronstein, and Mishra 2023), whereas oversquashing is known by the same
name (Alon and Yahav 2020). While this certainly invites a plethora of inter-
esting methods from the field, it is also important to acknowledge the unique
challenges that come from the feedforward structure of this particular graph,
which are seldom studied. Initial steps towards bridging this gap have recently
been made (Vitvitskyi et al. 2025), but a lot more work is needed.
Suggested Further Reading
Graph neural networks, as well as graph theory more generally, are very excit-
ing and vibrant areas of mathematics and computer science that are well worth
studying further than what our book has space for.
160 Chapter 5
For a graceful introduction to both, we warmly recommend the book from
Hamilton (2020). While it is somewhat dated in the specific GNN models it
studies, it provides a gentle and intuitive overview of computing important
metrics on a graph, as well as many relevant models that preceded graph neural
networks (such as node embedding techniques).
Then, should you wish to dive deeper into the many exciting facets of
graph machine learning, the book of L. Wu et al. (2022) is an excellent and
comprehensive reference which still holds up very well today.
In parallel with diving deeper into GNNs, or even before doing so, we
believe it is worthwhile to gain a strong foundation in graph theory. This will
be of significant value: both in gaining important intuitions for understand-
ing GNNs, and for inspiration on ways to further improve them. For this, we
recommend the foundational text from Bollobás (1998).
Exercises
1. A linear permutation-equivariant function on a set is a function of the form F(X) =
FX satisfying FPX = PFX for every permutation matrix P. Show that any lin-
ear permutation-equivariant function can be written as a linear combination of the
identity and average functions, i.e. FX = (I + 11
)X.
2. Let X be a feature space containing a countable set of possible values. This means
that every possible set of values coming from X can be represented by a bitmask in
B
X
(where B = {0, 1} is the set of bits), and any permutation-invariant real function
over such a set must be representable as f : B
X
R. Show that such a function can
always be expressed as a Deep Sets model, i.e., f (X) = ϕ
P
i∈V
ψ
(
x
i
)
, for some
choice of ψ : X R and ϕ : R R, for every set V B
X
.
3. Consider the recipe for building permutation-equivariant graph neural networks
F(X, A) from Equation 5.53. Recall how it leverages a local function ϕ(x
u
, X
N
u
),
applied to the neighbourhood of each node u V in parallel. Prove that, if ϕ is per-
mutation invariant, that is, ϕ(x
u
, PX
N
u
) = ϕ(x
u
, X
N
u
), then F will be permutation
equivariant, that is, F(PX, PAP
) = PF(X, A), for every permutation matrix P.
(Note that, when N
u
V, the space of P differs across the two equations.)
4. A popular way to implement attentional GNNs computes attentional logits as
a(x
u
, x
v
) = σ
a
W[x
u
x
v
]
, where a : R
l
, W : R
l×2k
are the parameters, : R
n
×
R
m
R
n+m
is vector concatenation, and σ : R R is a monotonic activation func-
tion. Another popular variant applies two of these operations in reverse: a(x
u
, x
v
) =
a
σ
W[x
u
x
v
]
. Explain, using a proof, which of the two variants is more expres-
sive (Hint: consider the ranking between attentional logits across all the nodes in
the neighbourhood). Which of the two forms is more scalable, and why? (Note that,
as the softmax function is monotonic, you may ignore it for this question.)
5. Recall that subgraph GNNs should be equivariant to G = S
n
×S
m
, where S
n
is the
n-element permutation group, n is the number of nodes in the graph and m is the
Graphs 161
number of its subgraphs we have sampled. Assuming that the input to a subgraph
GNN is a node feature matrix X R
n×k
and an adjacency tensor A R
n×n×m
, write
out the equation such a G-equivariant GNN should satisfy, and a possible implemen-
tation of such a GNN that allows subgraph representations to meaningfully interact.
Prove that, for a specific choice of m subgraphs, your model can accurately decide
isomorphism for a particular pair of graphs where the 1-WL test would fail.
6. Express the equations of dot-product self-attention, as leveraged within a decoder-
only Transformer in the message passing framework (i.e., by specifying functions
ψ, ϕ and
L
in Equation 5.56).
7. Consider a single-layer decoder-only Transformer, with model equations as
described in Section 5.4.1. Further, assume that its feedforward network, ϕ, is κ
1
-
Lipschitz; that is,
ϕ(x
u
,z
u
)
x
u
κ
1
and
ϕ(x
u
,z
u
)
z
u
κ
1
, and further assume that
V
2
κ
2
, where V is the matrix transforming token features into values (i.e.,
v
u
= Vx
u
). Prove that, if v u:
h
u
x
v
κ
1
(
δ
uv
+ κ
2
α
uv
)
(5.73)
where α
uv
is the attentional coefficient between v and u, and δ
uv
is 1 if u = v and 0
otherwise. Then, using this result, show that, in a L-layer decoder-only Transformer:
h
(L)
n
x
v
C
X
k
1
v
X
k
2
k
1
···
X
k
L–1
k
L–2
¯α
(L)
n,k
L–1
L–1
Y
=2
¯α
()
k
,k
–1
!
¯α
(1)
k
1
,v
(5.74)
where C > 0 is a constant, k
1
, . . . , k
L–1
are intermediate nodes on a particular path
of length L + 1 from u to n, and ¯α
(l)
uv
are simple functions of α
(l)
uv
—the attentional
coefficients between v and u in the l-th layer. This result implies that a deep decoder-
only Transformer is topologically biased to pay more attention towards early tokens,
as they will have substantially more paths to the final token (n).
Notes
1
For example, photos taken in the real world are two-dimensional projec-
tions of (at least!) three-dimensional scenes, often featuring objects that are
inherently not grid-structured themselves.
2
Depending on the application field, nodes may also be called vertices,
and edges are often referred to as links or relations. We will use these terms
interchangeably.
3
There are exactly n! such permutations, so S
n
is, even for modest n, a very
large group.
4
We use the bold notation for our function F(X) to emphasise it outputs
node-wise vector features and is hence a matrix-valued function.
5
When the graph is undirected, i.e. (u, v) E iff (v, u) E, the adjacency
matrix is symmetric, A = A
.
162 Chapter 5
6
PAP
is the representation of S
n
acting on matrices.
7
As a way to emphasise the fact that our functions operating over graphs
now need to take into account the adjacency information, we use the notation
f (X, A).
8
This corresponds to the Bell number B
4
, which counts the number of ways
to partition a set of 4 elements, in this case given by the 4-indices (u, v), (u
, v
)
indexing a linear map acting on the adjacency matrix.
9
Often, the node u itself is included in its own neighbourhood.
10
A multiset, denoted {{ . . . }}, is a set where the same element can appear
more than once. This is the case here because the features of different nodes
can be equal.
11
More precisely, we cannot define a non-trivial coarsening assuming set
structure alone. There exist established approaches that infer topological
structure from unordered sets, and those can admit non-trivial coarsening.
12
Though many works make a distinction between “spatial” and “spectral”
GNNs, we find that this distinction is rather artificial and must be taken with a
grain of salt. Many proposed spectral filters are expressible in terms of simple
matrix operations (e.g. polynomials of the graph Laplacian) and hence boil
down to simple spatial operations that fall under the “convolutional” flavour
presented here.
13
Most commonly, ψ and ϕ are learnable affine transformations with
activation functions; e.g. ψ(x) = Wx + b; ϕ(x, z) = σ
(
Wx + Uz + b
)
, where
W, U, b are learnable parameters and σ is an activation function such as the
rectified linear unit. The additional input of x
u
to ϕ represents an optional
skip-connection, which is often very useful.
14
It is worthy to note that this flavour does not express every GNN layer that is
convolutional (in the sense of commuting with the graph structure), but covers
most such approaches proposed in practice.
15
This is a direct consequence of the permutation invariance of
L
.
16
It is also appropriate to apply the message-passing flavour. While popu-
lar for physics simulations and relational reasoning (e.g. Battaglia et al. 2016;
Santoro et al. 2017), they have not been as widely used as Transformers. This
is likely due to the memory issues associated with computing vector mes-
sages over a complete graph, or the fact that vector-based messages are less
interpretable than the “soft adjacency” provided by self-attention.
17
Note that, due to the assumption of GNNs’ permutation invariance in h
G
, it
is certain that the embeddings of isomorphic graphs will be equal.
18
László Babai (2016) credits the invention of the k-WL tests to himself with
Rudolf Mathon and independently, Neil Immerman and Eric Lander (the latter
was trained as a mathematician but is more broadly known for his works in
Graphs 163
genomics and biology and also as the Science Advisor in President Biden’s
administration).
19
The k-GNN architecture introduced by Morris et al. (2019) incurs O(n
k
)
memory complexity. In a follow-up work, Morris, Rattan, and Mutzel (2020)
also devised a local version of k-GNNs based on aggregation in local neigh-
bourhoods, taking the sparsity of the underlying graph into account and showed
that this local version has at least the same power as the ordinary k-WL. Invari-
ant Graph Networks (IGN) introduced by Maron et al. (2018) are based on
k-order tensors and are k-WL-equivalent. IGNs are derived from a different
variant of k-WL and are more advantageous in terms of their complexity com-
pared to k-GNNs. In particular, the 3-WL-equivalent IGN has “only” quadratic
complexity.
20
ESAN itself is an instance of the Deep Sets of Symmetric Objects (DSS)
architecture proposed by Maron et al. (2020).
21
Corso et al. (2020) prove that no single aggregator can injectively embed
real-valued neighbourhoods by leveraging the Borsuk-Ulam theorem (Borsuk
1933). It can be used to conclude, for example, that there must always exist
two points on opposite sides of the Earth with an identical continuous property
(such as temperature, atmospheric pressure, etc.).
22
This connection was briefly explored and highlighted by two of the authors
of these papers (Fuchs and Veli
ˇ
ckovi
´
c 2023)—we recommend it as a good,
lightweight introduction to characterising expressivity of functions on sets and
graphs.
23
Specifically, Petar Veli
ˇ
ckovi
´
c is a firm believer that the message passing
primitive can be used to express everything of interest in deep learning—even
if sometimes it is not the most natural perspective of looking at the relevant
structures. He often calls upon the idea that even fundamental algorithms such
as matrix multiplication can be observed as a form of message passing, where
data along one dimension of the first matrix are exchanging messages with
data along a corresponding dimension of the second—followed by an appro-
priate sum-aggregation operation. Michael Bronstein has often responded to
this by saying that, when drilling down to the scale of matrix multiplies, such
argument is hardly practically useful.
24
This implicitly assumes that any discrete structure can be reasoned about
using some kind of graph.
25
Such an approach may have unexpected benefits to the model’s expressive
power—as we discussed within the context of subgraph GNNs.
26
It’s worthy to note that GraphSAGE (Hamilton, Ying, and Leskovec 2017)
was explicitly developed in order to prototype a GNN system capable of run-
ning in production at Pinterest—which later became PinSage. Pinterest had
164 Chapter 5
also presented follow-up work, PinnerSage (Pal et al. 2020), which effectively
integrates user-specific contextual information into the recommender.
27
Note that the list above only shows published academic papers detailing
how various companies have deployed GNNs in production. It excludes less-
detailed blog posts which describe GNN deployment within products like Uber
Eats and AirBnb, as well as GNN platforms like TF-GNN (Ferludin et al. 2022)
that directly state “Many production models at Google use TF-GNN” without
providing specifics.
28
The discussion within this section focusses squarely on analysing a single-
head dot product attention operator; however, it is simple to adapt it for
multi-head attention, a variant more commonly used in modern Transform-
ers (where the attention operator is replicated across k independent heads).
To do so, simply factorise all the relevant functions and parameters to operate
across multiple heads, similar to how we’ve analysed principal neighbourhood
aggregation.
29
The Transformer paper refers to this as “scaled dot-product attention”,
as the attentional logits are additionally scaled by a factor of 1/
d
k
; that is,
e
uv
=
1
d
k
q
u
k
v
, where d
k
is the dimensionality of the key vectors k
v
. This is
introduced to allow richer gradients at larger scales of dimensionality.
30
Interestingly, both Deep Sets (Zaheer et al. 2017) and Transformers
(Vaswani et al. 2017) were presented at the same conference: N(eur)IPS’17
in Long Beach, California.
31
It is also possible to define this operation in a way that does not delay
the renormalisation for later; for example by maintaining an online average:
(n, d) (n
, d
) =
dn+d
n
d+d
, d + d
.
32
Interestingly, beyond the permutation equivariance which is implied by this
derivation, the dot-product attention is also rotation invariant: if we rotate all
key and query vectors by the same rotation matrix, we will obtain exactly the
same attention logits (since rotations do not affect distances). This observation
will be leveraged in future Chapters.
33
This may no longer be the case when nodes become entire concepts.
34
A very good example of this principle is the GATv2 architecture (Brody,
Alon, and Yahav 2022). GATv2s are attentional GNNs that leverage deep mul-
tilayer perceptrons as their attention function, a, allowing them to provably
express certain patterns of attention that are out of reach of dot product self-
attention. However, the computation of the intermediate layers of deep MLPs
requires storing latent embeddings in the graph’s edges, making the model less
practical for use over large, fully-connected graphs of language tokens.
Graphs 165
35
The name “decoder-only” refers to the roots of the Transformer architec-
ture (Vaswani et al. 2017). It was initially used for machine translation—a
supervised learning task—rather than next-token prediction—a self-supervised
learning task. When used for machine translation, it was not necessary for the
entire architecture to have a causal mask: there could be a dedicated component
(the encoder) which sets N
u
= V, along with a causally masked component (the
decoder), setting N
u
= {v | v u}, generating the translated sentence word-for-
word. When only the decoder part is used, we recover the GPT setting. When
only the encoder part is used, we recover the BERT setting (Devlin et al. 2019).
36
In fact, the finding that decoder-only Transformers overmix information—
in spite of their attentional coefficients—is similar to the discovery that graph
attention networks oversmooth (X. Wu et al. 2023) in spite of the same
mechanism.