4
Foundations of Geometric Deep Learning
Symmetries are defined over signal spaces, defined over a physical
domain.
In order to beat the curse of dimensionality, we extend the invariance
prior with deformation stability and scale separation.
These priors can be used constructively in a generic neural architecture:
the GDL blueprint.
In the previous chapter, we have introduced the formalism of group theory
to account for the symmetries of a generic learning task. While this formalism
provides the basic backbone to think about geometric deep learning, we need
to complement it with additional structural assumptions in order to effectively
beat the curse of dimensionality.
First, the input space X of many learning problems is high-dimensional,
where symmetry structures are challenging to determine in all generality. Our
first task will be to describe instead input spaces as signal spaces, where x X
is itself a mapping x : C from a physical domain to an output channel
space C. In some applications, it may be convenient to consider instead the
domain as being the input, thus varying. Once these objects will be defined,
we will see how the notions of invariance and equivariance apply to this setting,
thus defining the familiar neural network geometric layers.
The physical domain is equipped with a metric structure, which will enable
us to go substantially further than the algebraic notions of invariance and equiv-
ariance. Indeed, this added metric structure enables the notion of scale, ie the
fact that signal measurements can be arranged into neighborhoods of varying
size. As it turns out, this metric structure provides two key geometric pri-
ors: stability to deformations and scale separation. Finally, combining both
algebraic and geometric priors will naturally lead to a blueprint for designing
neural neworks — the GDL blueprint.
102 Chapter 4
4.1 Geometric Domains
We concluded in chapter 2 that learning in high dimensions is intractable with-
out inductive biases, and while knowledge about symmetries could provide
a strong inductive bias, we saw in the previous section that it is unlikely we
will have knowledge of the full symmetry group of the label function. For-
tunately nature provides us with a solution: in a great many problems, the
data we observe are signals on a low-dimensional domain with structure and
symmetries that can be leveraged for more statistically efficient learning.
4.1.1 Physical Domains
The physical domain is where signal measurements take place, and depend-
ing on the application it may be defined with increasing amount of structure.
The following chapters will delve on the key representative applications, but
here we give a first glimpse of the diversity of physical domains that arises.
In computer vision applications, images x can be viewed as signals over a
two-dimensional grid
N
=

i
N
,
j
N

1i,jN
, where each pixel u encodes
color intensities x(u) {1, 255}
3
:= C. It will be convenient to view these
domains as discretizations of an underlying continuous object as N , given
by a rectangular domain
¯
= [0, 1]
2
R
2
, or in some settings its unbounded
extension
¯
= R
2
, as well as
¯
C = R
k
.
Another representative setting is given by = {1, N}, a set of N elements.
This represents the crudest form of structure, absent any notion of topology
nor metric structure, and applies to settings such as point clouds, in computer
graphics. The next fundamental domain is given by graphs = G = (V, E),
where now elements can be related to each other, thus defining a topology
and ultimately a metric structure.
Finally, we also encounter situations where the input to the learning system
is the domain itself, such as 3D meshes in computer graphics, or graphs in
some algorithmic tasks. The problems we consider thus come in two broad
categories: problems where each data point x X is a signal x : C on the
same domain (e.g. every image can be viewed as a function on the plane),
and problems where each data point corresponds to a different domain (e.g.
each data point is a different graph or mesh).
The key takeaway is that the domain is generally a much simpler object
than the input space X, enabling a richer description of the relevant symme-
tries. Before executing this plan, we need first to understand how to relate X
to , and, importantly, how symmetries defined over carry over to X.
Foundations of Geometric Deep Learning 103
4.1.2 Spaces of Signals on a Domain
Now that we have introduced the signal domain , we can revisit the input
space to the ML model as a vector space of functions over :
The space of C-valued signals on
1
(for a set, possibly with additional
structure, and C a vector space, whose dimensions are called channels)
X(, C) = {x : C} (4.36)
is a function space that has a vector space structure. Addition and scalar
multiplication of signals is defined as:
(αx + βy)(u) = αx(u) + βy(u) for all u ,
with real scalars α, β. Given an inner product v, w
C
on C and a measure
2
µ on (with respect to which we can define an integral), we can define an
inner product on X(, C) as
x, y=
Z
x(u), y(u)
C
dµ(u). (4.37)
We observe two simple, yet important properties of this space: while in many
cases linear combinations of points on are not well-defined
3
, we can linearly
combine signals on it, i.e., the space of signals forms a vector space. Moreover,
since we can define an inner product between signals, this space is a Hilbert
space.
How Signals Transform: Regular and Induced Representations Given a
symmetry group G acting on the signal domain , this action can be natu-
rally upgraded to X through the notion of induced representation. Let us now
describe it in simple terms.
To fix ideas, suppose first that C = R. Denoting the action of G on by
G × (4.38)
(g, u) 7→g ·u , (4.39)
we now consider the linear representation
G ×X X (4.40)
(g, x) 7→ρ(g)x , (4.41)
with (ρ(g)x)(u) := x(g
–1
·u). We verify that ρ is indeed a group representation,
which simply converts transformations of the domain into transformations of
signals, acting by composition. Figure 4.1 illustrates a simple example when G
is the translation group in (assuming for simplicity no boundary conditions).
104 Chapter 4
Figure 4.1
Example of the translation group defining a transformation over X.
Figure 4.2
Example of a color transformation by reflection g.y = ±y.
This formalism not only enables to express domain transformations, but also
more general situations where groups are not only acting on the domain
but also on the channel (or fiber) C. As a motivating example, suppose now
that C = R
3
is the RGB colorspace (assumed continous), and that, on top of
translations/rotations over , we can also perform color transformations in C;
see Figure 4.2. Then, if
˜
G is now a group acting on C, we can define a joint
representation for G ×
˜
G as follows:
(ρ(g,
˜
g)x)(u) :=
˜
g
–1
·(x(g
–1
·u)) . (4.42)
Finally, another relevant scenario is when the same group G defines a joint
group action on both domain and channel spaces, in which case we write
(ρ(g)x)(u) = g
–1
·(x(g
–1
·u)). Figure 4.3 illustrates this scenario over 2D vector
fields and the rotation group.
In summary, we have now seen how an initial low-dimensional description
of the symmetries expressed over physical domains can be naturally be
extended to signals on that domain.
Foundations of Geometric Deep Learning 105
Figure 4.3
The rotation group acting both on signal domain and output channel space, here illus-
trated over a vector field.
4.1.3 Fixed domain problems & domain symmetries
When the domain is fixed and known in advance, we can determine its sym-
metry group and use this knowledge for more efficient learning. This works
when the symmetries of the domain are also symmetries of the label function,
which is a common occurrence. For example, our data may consist of signals
on the plane (e.g. camera images) or on the sphere (e.g. global weather signals).
Shifts and translations are symmetries of the plane, and 3D rotations are sym-
metries of the sphere (since these maps are invertible and structure-preserving,
mapping the domain onto itself while preserving the relevant structure such as
distances).
As we have just seen, these symmetries can be applied not just to points
in the domain itself, but also to signals on the domain (See 4.1.2), and will
generally not change the semantics of the signal. Hence, our neural network
should process inputs related by symmetries in the same way.
Unlike the full group of symmetries of the label function, we know symme-
tries of the domain a priori, so we can use this knowledge to our advantage to
constraint the function class F. How this is done exactly via invariance and
equivariance will be discussed in section 4.5.
Finally, an important word on the defintion of symmetry: since we defined
symmetries as structure preserving invertible maps from the domain to itself,
what counts as a symmetry depends on what structure one wishes to preserve.
For instance, in = R
2
, rigid transformations of the form φ(x) = Ax + b, where
A
A = I
2
, are often symmetries of image classification problems, since they do
not change the semantics. However, in some cases this is not true: the ‘clas-
sic’ example is digit classification between digits ‘6’ and ‘9’. Unlike generic
106 Chapter 4
object classification, this task is not rotation-invariant. We will thus be careful
to remember that symmetries are inherently a user-specified notion.
4.1.4 Variable domain problems & domain isomorphisms
When the domain is variable, so is the symmetry group. There may be ways to
deal with the symmetries of each domain in the category of domains under con-
sideration, but more importantly we need to make sure we deal correctly with
domains that look different but are actually equivalent (isomorphic) in some
sense. In general, the appropriate notion of equivalence depends on the con-
text, but what we can say in general is that two domains are equivalent if there
is an invertible and structure-preserving map (isomorphism) between them. We
will see various examples of structured domains and structure-preserving maps
throughout the chapters of part II, but for now we will sketch two important
examples: graphs and meshes/manifolds.
One way to represent a graph is as a set of node labels (e.g. arbitrary but
unique integers) together with a set of edges (pairs of node labels). Then simply
changing the node labels gives us a new graph that is nevertheless equivalent,
and the mapping that sends each label of the original graph to the new label is
an isomorphism.
If we have two objects that are not identical but still equivalent (under some
notion of equivalence that depends on the context), we need to ensure that the
functions we learn produce an identical or at least equivalent output for the two
objects. For graph classification (chapter 5), this means that two graphs related
by a graph isomorphism (See fig. 4.4, left) should be assigned the same label.
Another example is meshes or manifolds (chapter 9), which may be con-
sidered equivalent if there exists a distance-preserving map (i.e. we consider
isometries to be isomorphisms); see Figure 4.5. Again, here we want two
meshes related by an isometry to be treated the same way; we want the method
to be intrinsic. We shall see how to build invariant representations for these
type of surfaces in Chapter 9.
4.2 Invariant Function Classes
At this point, it is useful to recall our learning setup from Chapter 2, and the
symmetry priors from Chapter 3 that presumably alleviate the learning task.
We have f
: X() Y an unknown function that we wish to learn using an
hypothesis class F. Moreover, we also have a symmetry group G acting on
X, and the promise that f
is G-invariant: f
(g.x) = f
(x) for all x X and all
g G. What is the easiest way to combine these objects?
Foundations of Geometric Deep Learning 107
1 2 3
4 5 6
1 2 3
4 5 6
1 0 0 1 0 1
0 1 0 0 0 1
0 0 1 1 1 0
1 0 1 1 0 1
0 0 1 0 1 1
1 1 0 1 1 1
1 1 0 0 1 0
1 1 1 1 0 1
0 1 1 0 0 0
0 1 0 1 0 1
1 0 0 0 1 1
0 1 0 1 1 1
A PAP
G
f
G
Figure 4.4
Figure of graph isomorphisms. Above: representing a graph by integer node labels and
edges (iso = relabeling), and below: representing a graph as an adjacency matrix (iso =
permutation)
Assume for simplicity that the group G is compact, meaning that it admits a
Haar measure
4
µ(g). This allows us to define a group averaging operator
S
G
f (x) :=
1
µ(G)
Z
G
f (g.x)µ(dg) . (4.43)
In words, the group averaging operator averages any function f F over the
group orbits {g.x; g G}; see Figure 4.6. Observe that, since we are assuming
that f
is G-invariant, it holds S
G
f
= f
. Since our target function is constant
along orbits, it seems like a good idea to enforce our hypothesis to be also
constant along orbits and hence to replace f F by its smoothed version
S
G
f .
In other words, we can now turn our hypothesis space F into a G-invariant
space via the averaging operator: S
G
F := {S
G
f ; f F}.
108 Chapter 4
Figure 4.5
The hand shape is both the domain and the input to the learning system.
Figure 4.6
Illustration of the orbits of the rotation group, as it acts on images via the standard
representation. The Group averaging operator S
G
averages a hypothesis f (x) over each
orbit.
Example 1 Consider = {1, , N} the discrete set of N elements and G =
S
N
, the symmetric group of N elements, which acts on by permuting its
elements. We consider a hypothesis class F {f : R
R} given by real poly-
nomials f (x) = p(x
1
, , x
N
) of degree up to k. The associated averaged class
S
G
F = {S
G
p(x); p F} corresponds in this case to the space of symmetric
polynomials of degree k. It is generated by the so-called ring of powersum
polymomials {q
j
(x
1
, , x
N
) :=
P
N
i=1
x
j
i
}.
Now let us formalize the intuition that replacing F with S
G
F is benefi-
cial for learning purposes, focusing on a regression problem R(f ) = E
xν
|f (x)
f
(x)|
2
:= f f
2
ν
, where ν is the data distribution. We assume that the data
distribution is compatible with the symmetry structure from G: if we denote by
Foundations of Geometric Deep Learning 109
[x] = {g.x; g G} the orbit associated with x, then the distribution of x, condi-
tional on [x], is the uniform measure on the orbit; precisely the Haar measure
of the group. In other words, the data distribution is unchanged if we replace x
by g.x for any g G.
Recall the decomposition of error from Chapter 2. Let us first address the
approximation error. We decompose
f f
2
= f S
G
f + S
G
f f
2
= f S
G
f
2
+ S
G
f f
2
+ 2f S
G
f , S
G
f f
= f S
G
f
2
+ S
G
f f
2
+ 2E
[x]
E
x|[x]
[(f (x) S
G
f (x))(S
G
f (x) f
(x))]
= f S
G
f
2
+ S
G
f f
2
+ 2E
[x]
[(S
G
f (x) S
G
f (x))(S
G
f (x) f
(x))]
= f S
G
f
2
+ S
G
f f
2
,
where the fourth equality holds because f
is constant along orbits. In words,
the group averaging is an orthogonal projection in L
2
(X, ν). Moreover, we
immediately deduce that f f
2
S
G
f f
2
for any f F, and therefore
that the approximation error of S
G
F is upper bounded by that of F. Addi-
tionally, we won’t formalize it here, but it is not hard to see that since group
averaging is a projection, the statistical error is reduced by averaging. We
thus have a hypothesis class S
G
F for which both the approximation and the
statistical errors are upper bounded by those of F.
This sounds like great news indeed, reaffirming the advantage of using
known symmetries of the target function into our hypothesis class. However,
there are (at least) two important issues at this point. First, the group averaging
operator defines a convenient mathematical object, but it requires an averag-
ing over group orbits, which can become prohivitively expensive very quickly,
for sufficiently large groups. Next, we have argued that statistical errors will
be smaller in the symmetric hypothesis space, but we haven’t actually quan-
tified by how much. As it turns out, the gains can be roughyly quantified as
follows: given a dataset of n points, the symmetric learning amounts to having
instead |G|n datapoints whenever G is discrete. In the continuous setting, the
gains amount to using [x] instead of x R
, since [x] lives on a space of lower
dimensionality || dim(G) than x. While this may be a substantial gain, it does
not unfortunately break the curse of dimensionality. Indeed, in the example of
images, the Euclidean group in R
2
has at most 6 degrees of freedom, a tiny
number against the thousands or millions of pixels!
We thus need to address these two major aspects with additional ideas. For
that purpose, we will now introduce two important concepts that extend the
notion of symmetric representation and enable efficient learning in the face of
110 Chapter 4
Figure 4.7
Two objects moving at different velocities in a video define a transformation outside
the translation group.
high-dimensional data. We first discuss stability to general domain transfor-
mations that are ‘near’-symmetries, and then introduce the concept of scale
separation, which allows us to decompose the learning task across different
resolutions by exploiting local interactions.
4.3 Deformation Stability
The symmetry formalism introduced so far captures an idealised world where
we know exactly which transformations are to be considered as symmetries,
and we want to respect these symmetries exactly. For instance in com-
puter vision, we might assume that planar translations are exact symmetries.
However, the real world is noisy and this model falls short.
First, while these simple groups provide a way to understand global sym-
metries of the domain (and by extension, of signals on it, X()), they do
not capture local symmetries well. For instance, consider a video scene with
several objects, each moving along its own different direction. At subsequent
frames, the resulting scene will contain approximately the same semantic infor-
mation, yet no global translation explains the transformation from one frame
to another. Next, in other cases, such as a deformable 3D object viewed by
a camera, it is simply very hard to describe the group of transformations that
preserve the object identity.
These examples illustrate that in reality we are more interested in a far larger
set of transformations where global, exact invariance is replaced by a local,
inexact one. In our discussion, we will distinguish between two scenarios:
the setting where the domain is fixed, and signals x X() are undergoing
deformations, and the setting where the domain itself may be deformed.
Foundations of Geometric Deep Learning 111
Figure 4.8
A warping of the domain that is not explained by any rigid transformation nearly pre-
serves the semantics.
4.3.1 Stability to signal deformations
In many applications, we know a priori that a small deformation of the signal
x should not change the output of f (x), so it is tempting to consider such defor-
mations as symmetries. For instance, we could view small diffeomorphisms
τ Diff(), or even small bijections, as symmetries. However, small deforma-
tions can be composed to form large deformations, so “small deformations”
do not form a group
5
, and we cannot ask for invariance or equivariance to
small deformations only. Since large deformations can can actually materially
change the semantic content of the input, it is not a good idea to use the full
group Diff() as symmetry group either.
A better approach is to quantify how “far” a given τ Diff() is from a
given symmetry subgroup G Diff() (e.g. translations) with a complexity
measure c(τ ), so that c(τ) = 0 whenever τ G. We can now replace our previ-
ous definition of exact invariance and equivarance under group actions with a
‘softer’ notion of deformation stability (or approximate invariance):
f (ρ(τ)x) f (x)Cc(τ)x, , x X(), τ Diff() , (4.44)
where ρ(τ)x(u) = x(τ
–1
u) as before, and where C is some constant independent
of the signal x. A function f F(X()) satisfying the above equation is said
to be geometrically stable. We will see examples of such functions in the next
Section 4.4.
Since c(τ ) = 0 for τ G, this definition generalises the G-invariance property
defined above. Its utility in applications depends on introducing an appropriate
deformation cost. In the case of images defined over a continuous Euclidean
plane, a popular choice is c
2
(τ) :=
R
∥∇τ(u)
2
du, which measures the ‘elas-
ticity’ of τ , i.e., how different it is from the displacement by a constant vector
field. This deformation cost is in fact a norm often called the Dirichlet energy,
and can be used to quantify how far τ is from the translation group.
112 Chapter 4
Figure 4.9
The set of all bijective mappings from into itself forms the set automorphism group
Aut(), of which a symmetry group G (shown as a circle) is a subgroup. Geometric Sta-
bility extends the notion of G-invariance and equivariance to ‘transformations around
G (shown as gray ring), quantified in the sense of some metric between transforma-
tions. In this example, a smooth distortion of the image is close to a shift.
4.3.2 Stability to domain deformations
In many applications, the object being deformed is not the signal, but the geo-
metric domain itself. Canonical instances of this are applications dealing
with graphs and manifolds: a graph can model a social network at different
instance of time containing slightly different social relations (follow graph),
or a manifold can model a 3D object undergoing non-rigid deformations. This
deformation can be quantified as follows. If D denotes the space of all possible
variable domains (such as the space of all graphs, or the space of Rieman-
nian manifolds), one can define for ,
˜
D an appropriate metric (‘distance’)
d(,
˜
) satisfying d(,
˜
) = 0 if and
˜
are equivalent in some sense: for
example, the graph edit distance vanishes when the graphs are isomorphic,
and the Gromov-Hausdorff distance between Riemannian manifolds equipped
with geodesic distances vanishes when two manifolds are isometric
6
.
A common construction of such distances between domains relies on some
family of invertible mapping η :
˜
that try to ‘align’ the domains in a
Foundations of Geometric Deep Learning 113
way that the corresponding structures are best preserved. For example, in the
case of graphs or Riemannian manifolds (regarded as metric spaces with the
geodesic distance), this alignment can compare pair-wise adjacency or distance
structures (d and
˜
d, respectively),
d
D
(,
˜
) = inf
ηG
d
˜
d (η ×η)
where G is the group of isomorphisms such as bijections or isometries, and
the norm is defined over the product space ×. In other words, a distance
between elements of ,
˜
is ‘lifted’ to a distance between the domains them-
selves, by accounting for all the possible alignments that preserve the internal
structure
7
. Given a signal x X() and a deformed domain
˜
, one can then
consider the deformed signal
˜
x = x η
–1
X(
˜
).
By slightly abusing the notation, we define X(D) = {(X(), ) : D}
as the ensemble of possible input signals defined over a varying domain. A
function f : X(D) Y is stable to domain deformations if
f (x, ) f (
˜
x,
˜
)Cxd
D
(,
˜
) (4.45)
for all ,
˜
D, and x X(). We will discuss this notion of stability in the
context of manifolds in Chapter 9, where isometric deformations play a crucial
role. Furthermore, it can be shown that the stability to domain deformations is
a natural generalisation of the stability to signal deformations, by viewing the
latter in terms of deformations of the volume form Gama, Ribeiro, and Bruna
(2019).
4.4 Scale Separation
While deformation stability substantially strengthens the global symmetry pri-
ors, it is not sufficient in itself to overcome the curse of dimensionality, in the
sense that, informally speaking, there are still “too many" functions that respect
(4.44) as the size of the domain grows. A key insight to overcome this curse
is to exploit the multiscale structure of physical tasks. Let us first provide the
essential idea before formalizing it with tools from harmonic analysis.
By viewing the input features x
1
, x
2
, , x
N
, x
i
= x(u
i
) C to a learning task
as ‘particles’, one can interpret a learning system as building an ‘energy’ func-
tion or Hamiltonian H(x
1
, , x
N
), that computes the likelihood of a certain
attribute, e.g. a cat detector would be associated with a ‘cat’ function H so that
H(x) is large whenever the image x contains a cat. The simplest Hamiltoni-
ans in physics are those which are separable: H(x) =
P
i
h(x
i
). In our example,
that would correspond to a cat detector that simply aggregates individual pixel
features, irrespective of their location. Since h is a low-dimensional function,
114 Chapter 4
learning such separable functionals poses no statistical challengence; how-
ever, this model, called a mean-field model in physics, disregards any form of
interaction between features/particles, and as such is not able to capture many
properties of interest.
It thus seems like a good idea to consider a broader class of energy func-
tions that account for interactions. While generic N-particle interactions define
function spaces with the unavoidable curse of dimensionality, a natural idea is
to focus only on dominant interactions. As it turns out, many physical systems
are dominated by local interactions, meaning that ‘nearby’ particles provide
most of the information. In other words, the energy model now becomes
H(x) =
P
i
h({x
j
; i j}), where i j informally denotes that particles i and j
are neighbors. Luckily, these neighborhoods are often of size only depending
on the dimension of the physical domain , and thus N. Back to our cat
detector example, such Ansatz would focus on capturing interactions between
neighboring pixels which makes sense, since shapes and textures form-
ing complex visual concepts are defined by notions of geometry, and natural
images are known to have a power-law correlation structure (van der Schaaf
and van Hateren 1996). The idea of focusing on local interactions has been
present in ML since at least the work of Geman and Geman (1984) and their
seminal Markov Random Field models.
We now turn to formalize this powerful inductive bias using multiscale
representations. Before describing multiscale representations, we need to intro-
duce the main elements of Fourier transforms, which rely on frequency rather
than scale.
4.4.1 Fourier Transform and Global invariants
Arguably the most famous signal decomposition is the Fourier transform,
the cornerstone of harmonic analysis. The classical one-dimensional Fourier
transform
ˆ
x(ξ) =
Z
+
x(u)e
–iξu
du
expresses the function x(u) L
2
() on the domain = R as a linear combina-
tion of orthogonal oscillating basis functions φ
ξ
(u) = e
iξu
, indexed by their rate
of oscillation (or frequency) ξ. Such an organisation into frequencies reveals
important information about the signal, e.g. its smoothness and localisation.
The Fourier basis itself has a deep geometric foundation and can be interpreted
as the natural vibrations of the domain, related to its geometric structure (see
e.g. Berger (2012)).
Foundations of Geometric Deep Learning 115
Figure 4.10
Left: Fourier basis functions have global support. As a result, local signals produce
energy across all frequencies. Right: Wavelets ψ
u,ξ
(v) = ψ
vu
ξ
at different resolutions
ξ and positions u. Contrary to the Fourier atoms, wavelets are localized both in space
and frequency.
The Fourier transform
8
plays a crucial role in signal processing as it offers
a dual formulation of convolution,
(x θ)(u) =
Z
+
x(v)θ(u v)dv
a standard model of linear signal filtering (here and in the following, x denotes
the signal and θ the filter). As we will show in the following, the convolution
operator is diagonalised in the Fourier basis, making it possible to express
convolution as the product of the respective Fourier transforms,
\
(x θ)(ξ) =
ˆ
x(ξ) ·
ˆ
θ(ξ),
a fact known in signal processing as the Convolution Theorem. We have stated
the Fourier transform in the one-dimensional case for simplicity of exposition,
but the reader should rest assured that all these important correspondences
remain valid in full generality.
As it turns out, many fundamental differential operators such as the Lapla-
cian are described as convolutions on Euclidean domains. Since such differ-
ential operators can be defined intrinsically over very general geometries, this
provides a formal procedure to extend Fourier transforms beyond Euclidean
domains, including graphs, groups and manifolds. We will discuss this in detail
in Chapters 7 and 9.
An essential aspect of Fourier transforms is that they reveal global properties
of the signal and the domain, such as smoothness or conductance. Such global
behavior is convenient in presence of global symmetries of the domain such
as translation, but not to study more general diffeomorphisms. This requires
116 Chapter 4
a representation that trades off spatial and frequential localisation, as we see
next.
4.4.2 Multiscale representations
The notion of local invariance can be articulated by switching from a Fourier
frequency-based representation to a scale-based representation, the corner-
stone of multi-scale decomposition methods such as wavelets
9
. The essential
insight of multi-scale methods is to decompose functions defined over the
domain into elementary functions that are localised both in space and
frequency. Contrary to Fourier, wavelet atoms are localised and multi-scale,
allowing to capture fine details of the signal with atoms having small spatial
support and coarse details with atoms having large spatial support. The term
atom here is synonymous with ‘basis element’ in Fourier analysis, with the
caveat that wavelets are redundant (over-complete). In the case of wavelets,
this is achieved by correlating a translated and dilated filter (mother wavelet)
ψ, producing a combined spatio-frequency representation called a continuous
wavelet transform
(W
ψ
x)(u, ξ) = ξ
–1/2
Z
+
ψ
v u
ξ
x(v)dv.
The translated and dilated filters are called wavelet atoms; their spatial position
and dilation correspond to the coordinates u and ξ of the wavelet transform.
These coordinates are usually sampled dyadically (ξ = 2
j
and u = 2
j
k), with j
referred to as scale. Multi-scale signal representations bring important benefits
in terms of capturing regularity properties beyond global smoothness, such as
piece-wise smoothness, which made them a popular tool in signal and image
processing and numerical analysis in the 90s.
4.4.3 Deformation stability of Multiscale representations
The benefit of multiscale localised wavelet decompositions over Fourier
decompositions is revealed when considering the effect of small deformations
‘nearby’ the underlying symmetry group. Let us illustrate this important con-
cept in the Euclidean domain and the translation group. Since the Fourier
representation diagonalises the shift operator (which can be thought of as
convolution, as we will see in more detail in Section 7), it is an efficient
representation for translation transformations. However, Fourier decomposi-
tions are unstable under high-frequency deformations. In contrast, wavelet
decompositions offer a stable representation in such cases.
Indeed, let us consider τ Aut() and its associated linear representation
ρ(τ). When τ(u) = u v is a shift, as we will verify in Chapter 7, the operator
Foundations of Geometric Deep Learning 117
Figure 4.11
Multiscale decomposition of an input image using wavelets: each scale decomposes the
existing coarse-grained image into an even coarser version, and the details needed to
reconstruct it.
ρ(τ) = S
v
is a shift operator that commutes with convolution. Since convolu-
tion operators are diagonalised by the Fourier transform, the action of shift in
the frequency domain amounts to shifting the complex phase of the Fourier
transform,
(
c
S
v
x)(ξ) = e
–iξv
ˆ
x(ξ).
Thus, the Fourier modulus f (x) = |
ˆ
x| removing the complex phase is a simple
shift-invariant function, f (S
v
x) = f (x). However, if we have only approximate
translation, τ(u) = u ˜τ(u) with ∥∇τ
= sup
u
∥∇˜τ(u)ϵ, the situation is
entirely different: it is possible to show that
f (ρ(τ )x) f (x)
x
= O(1)
irrespective of how small ϵ is (i.e., how close is τ to being a shift). Conse-
quently, such Fourier representation is unstable under deformations, however
small. This unstability is manifested in general domains and non-rigid transfor-
mations; we will see another instance of this unstability in the analysis of 3D
shapes using the natural extension of Fourier transforms described in Chapter
9.
Wavelets offer a remedy to this problem that also reveals the power of multi-
scale representations. In the above example, we can show (Mallat 2012) that
118 Chapter 4
the wavelet decomposition W
ψ
x is approximately equivariant to deformations,
ρ(τ)(W
ψ
x) W
ψ
(ρ(τ)x)
x
= O(ϵ).
10
In other words, decomposing the signal information into scales using localised
filters rather than frequencies turns a global unstable representation into a
family of locally stable features. Importantly, such measurements at different
scales are not yet invariant, and need to be progressively processed towards the
low frequencies, hinting at the deep compositional nature of modern neural net-
works, and captured in our Blueprint for Geometric Deep Learning, presented
next.
4.4.4 Scale Separation Prior
We can build from this insight by considering a multiscale coarsening of the
data domain into a hierarchy
1
, . . . ,
J
. As it turns out, such coarsening
can be defined on very general domains, including grids, graphs, and man-
ifolds. Informally, a coarsening assimilates nearby points u, u
together,
and thus only requires an appropriate notion of metric in the domain. If
X
j
(
j
, C
j
) := {x
j
:
j
C
j
} denotes signals defined over the coarsened domain
j
, we informally say that a function f : X() Y is locally stable at scale j
if it admits a factorisation of the form f f
j
P
j
, where P
j
: X() X
j
(
j
) is a
non-linear coarse graining and f
j
: X
j
(
j
) Y. In other words, while the target
function f might depend on complex long-range interactions between features
over the whole domain, in locally-stable functions it is possible to separate
the interactions across scales, by first focusing on localised interactions that
are then propagated towards the coarse scales. Figures 4.12 and 4.13 illustrate
simple examples of this scale factorization.
Such principles
11
are of fundamental importance in many areas of physics
and mathematics, as manifested for instance in statistical physics in the so-
called renormalisation group, or leveraged in important numerical algorithms
such as the Fast Multipole Method. In machine learning, multiscale repre-
sentations and local invariance are the fundamental mathematical principles
underpinning the efficiency of Convolutional Neural Networks and Graph
Neural Networks and are typically implemented in the form of local pooling. In
future work, we will further develop tools from computational harmonic anal-
ysis that unify these principles across our geometric domains and will shed
light onto the statistical learning benefits of scale separation.
Foundations of Geometric Deep Learning 119
Figure 4.12
Illustration of Scale Separation for image classification tasks. The classifier f
defined
on signals on the coarse grid X(
) should satisfy f f
P, where P : X() X(
).
Here, the coarse scales dominate: to discriminate between these two classes, one can
coarse grain the domain and drop all the details below a certain resolution.
Figure 4.13
Illustration of Scale Separation for image classification tasks. The classifier f
defined
on signals on the coarse grid X(
) should satisfy f f
P, where P : X() X(
).
Here, the fine scales dominate: to discriminate between these two classes, it is sufficient
to focus on local statistics and then combine them with a simple averaging.
4.5 The GDL Blueprint
The geometric principles of Symmetry, Geometric Stability, and Scale Sep-
aration discussed so far can be combined to provide a universal blueprint for
learning stable representations of high-dimensional data. These representations
will be produced by functions f operating on signals X(, C) defined on the
domain , which is endowed with a symmetry group G.
120 Chapter 4
The geometric priors we have described so far do not prescribe a specific
architecture for building such representation, but rather a series of necessary
conditions. However, they hint at an axiomatic construction that provably sat-
isfies these geometric priors, while ensuring a highly expressive representation
that can approximate any target function satisfying such priors.
4.5.1 Invariant and Equivariant Layers
A simple initial observation is that, in order to obtain a highly expressive repre-
sentation, we are required to introduce a non-linear element, since if f is linear
and G-invariant, then for all x X(),
12
f (x) =
1
µ(G)
Z
G
f (g.x)dµ(g) = f
1
µ(G)
Z
G
(g.x)dµ(g)
,
which indicates that F only depends on x through the G-average Ax =
1
µ(G)
R
G
(g.x)dµ(g). In the case of images and translation, this would entail using
only the average RGB color of the input!
While this reasoning shows that the family of linear invariants is not a very
rich object, the family of linear equivariants provides a much more powerful
tool, since it enables the construction of rich and stable features by com-
position with appropriate non-linear maps, as we will now explain. Indeed,
if B : X(, C) X(, C
) is G-equivariant satisfying B(g.x) = g.B(x) for all
x X and g G, and σ : C
C
′′
is an arbitrary (non-linear) map, then we
easily verify that the composition U := (σ B) : X(, C) X(, C
′′
) is also G-
equivariant, where σ : X(, C
) X(, C
′′
) is the element-wise instantiation
of σ given as (σ(x))(u) := σ(x(u)).
This simple property allows us to define a very general family of G-
invariants, by composing U with the group averages A U : X(, C) C
′′
.
A natural question is thus whether any G-invariant function can be approx-
imated at arbitrary precision by such a model, for appropriate choices of B
and σ. It is not hard to adapt the standard Universal Approximation Theorems
from unstructured vector inputs to show that shallow ‘geometric’ networks are
also universal approximators, by properly generalising the group average to
a general non-linear invariant
13
. However, as already described in the case of
Fourier versus Wavelet invariants, there is a fundamental tension between shal-
low global invariance and deformation stability. This motivates an alternative
representation, which considers instead localised equivariant maps
14
.
Assuming that is further equipped with a distance metric d, we call an
equivariant map U localised if (Ux)(u) depends only on the values of x(v) for
N
u
= {v : d(u, v) r}, for some small radius r; the latter set N
u
is called the
receptive field.
Foundations of Geometric Deep Learning 121
Figure 4.14
Geometric Deep Learning blueprint, exemplified on a graph. A typical Graph Neural
Network architecture may contain permutation equivariant layers (computing node-
wise features), local pooling (graph coarsening), and a permutation-invariant global
pooling layer (readout layer).
A single layer of local equivariant map U cannot approximate functions with
long-range interactions, but a composition of several local equivariant maps
U
J
U
J–1
···U
1
increases the receptive field
15
while preserving the stabil-
ity properties of local equivariants. The receptive field is further increased by
interleaving downsampling operators that coarsen the domain (again assum-
ing a metric structure), completing the parallel with Multiresolution Analysis
(MRA, see e.g. Mallat (1999)).
In summary, the geometry of the input domain, with knowledge of an
underyling symmetry group, provides three key building blocks: (i) a local
equivariant map, (ii) a global invariant map, and (iii) a coarsening operator.
These building blocks provide a rich function approximation space with pre-
scribed invariance and stability properties by combining them together in a
scheme we refer to as the Geometric Deep Learning Blueprint (Figure 4.14).
122 Chapter 4
Geometric Deep Learning Blueprint
Let and
be domains, G a symmetry group over , and write
if
can be considered a compact version of .
We define the following building blocks:
Linear G-equivariant layer B : X(, C) X(
, C
) satisfying
B(g.x) = g.B(x) for all g G and x X(, C).
Nonlinearity σ : C C
applied element-wise as (σ(x))(u) = σ(x(u)).
Local pooling (coarsening) P : X(, C) X(
, C), such that
.
G-invariant layer (global pooling) A : X(, C) Y satisfying
A(g.x) = A(x) for all g G and x X(, C).
Using these blocks allows constructing G-invariant functions f : X(, C)
Y of the form
f = A σ
J
B
J
P
J–1
. . . P
1
σ
1
B
1
where the blocks are selected such that the output space of each block
matches the input space of the next one. Different blocks may exploit
different choices of symmetry groups G.
4.5.2 Different settings of Geometric Deep Learning
One can make an important distinction between the setting when the domain
is assumed to be fixed and one is only interested in varying input signals
defined on that domain, or the domain is part of the input as varies together
with signals defined on it. A classical instance of the former case is encoun-
tered in computer vision applications, where images are assumed to be defined
on a fixed domain (grid). Graph classification is an example of the latter set-
ting, where both the structure of the graph as well as the signal defined on it
(e.g. node features) are important. In the case of varying domain, geometric
stability (in the sense of insensitivity to the deformation of ) plays a crucial
role in Geometric Deep Learning architectures.
Foundations of Geometric Deep Learning 123
This blueprint has the right level of generality to be used across a wide range
of geometric domains. Different Geometric Deep Learning methods thus differ
in their choice of the domain, symmetry group, and the specific implementation
details of the aforementioned building blocks. As we will see in the following,
a large class of deep learning architectures currently in use fall into this scheme
and can thus be derived from common geometric principles.
In the following Chapters we will describe the various geometric domains
focusing on the ‘5G’, as well as the specific implementations of Geometric
Deep Learning on these domains.
Architecture Domain Symmetry group G
CNN Grid Translation
Spherical CNN Sphere / SO(3) Rotation SO(3)
Mesh CNN Manifold Isometry Iso() /
Gauge symmetry SO(2)
GNN Graph Permutation Σ
n
Deep Sets Set Permutation Σ
n
Transformer Complete Graph Permutation Σ
n
E(3) GNNs Geometric Graph Permutation Σ
n
× Euclidean E(3)
LSTM 1D Grid Time warping
4.6 Summary
In this chapter, we have taken the ideas developed in Chapter 3 and deployed
them in learning problems with physical structure. This physical structure is
encoded in a simple domain where rich symmetries can be defined, and
then extended via composition to the signal space X(). In order to make the
symmetric learning principles effective, we need to augment the prior of invari-
ance/equivariance to include also stability to deformations, as well as the scale
separation principle. Taken together, these principles define a computational
blueprint (the GDL blueprint) that enables a rich, efficient class of invariant
and geometrically stable representations parametrized by neural networks. In
the following chapters, we will instantiate this GDL blueprint to key represen-
tative domains and explore the specific geometric properties associated to each
of them.
124 Chapter 4
Exercises
1. Count the number of symmetries of a label function. Assume finite input space and
label space.
2. Prove that if we have the full symmetry group of a label function, we only need one
label per class to learn the true function. (Maybe start with a sub-question: what are
the orbits of the full symmetry group of a label function?)
3. Show basic properties of group actions: e.g. how the inverse of a group element
yields the inverse map on , how identity acts as the identity map, etc.
4. Multiplication tables: does any finite group have a multiplication table? Can any
table we write down be interpreted as defining a group? If no, list the constraints the
table needs to satisfy.
5. Uniqueness of generators. Are they unique? If not, find an alternative set of
generators for D3, different from the ones in the text (rotation, flip). Write the
multiplication table and Cayley diagram in terms of these generators.
6. If we have a group with generators, and we take a subset of generators, do we get a
subgroup?
7. If we have a group with generators and a subgroup, does the subgroup necessarily
correspond to a subset of generators?
8. If we have a group element, it can always be written as a product of generators. Is
this way of writing it unique?
Notes
1
When has some additional structure, we may further restrict the kinds
of signals in X(, C). For example, when is a smooth manifold, we may
require the signals to be smooth. Whenever possible, we will omit the range C
for brevity.
2
When the domain is discrete, µ can be chosen as the counting measure,
in which case the integral becomes a sum. In the following, we will omit the
measure and use du for brevity.
3
must be a vector space in order for an expression αu + βv to make sense.
4
Think about it as the generalization of the uniform measure over the group,
so that each group element has equal ‘mass’; see Appendix B.
5
E.g., the composition of two ϵ-isometries is a 2ϵ-isometry, violating the
closure property.
6
The graph edit distance measures the minimal cost of making two graphs
isomorphic by a sequences of graph edit operations. The Gromov-Hausdorff
distance measures the smallest possible metric distortion of a correspondence
between two metric spaces, see Gromov (1981).
Foundations of Geometric Deep Learning 125
7
Two graphs can be aligned by the Quadratic Assignment Problem (QAP),
which considers in its simplest form two graphs G,
˜
G of the same size n, and
solves min
PΣ
n
trace(AP
˜
AP
), where A,
˜
A are the respective adjacency
matrices and Σ
n
is the group of n ×n permutation matrices. The graph edit
distance can be associated with such a QAP (Bougleux et al. 2015).
8
In the following, we will use convolution and (cross-)correlation
(x θ)(u) =
Z
+
x(v)θ(u + v)dv
interchangeably, as it is common in machine learning: the difference between
the two is whether the filter is reflected, and since the filter is typically
learnable, the distinction is purely notational.
9
See Mallat (1999) for a comperehensive introduction.
10
This notation implies that ρ(τ) acts on the spatial coordinate of (W
ψ
x)(u, ξ).
11
Fast Multipole Method (FMM) is a numerical technique originally devel-
oped to speed up the calculation of long-ranged forces in n-body problems.
FMM groups sources that lie close together and treats them as a single source.
12
Here, µ(g) is known as the Haar measure of the group G, and the integral
is performed over the entire group.
13
Such proofs have been demonstrated, for example, for the Deep Sets model
by Zaheer et al. (2017).
14
Meaningful metrics can be defined on grids, graphs, manifolds, and groups.
A notable exception are sets, where there is no predefined notion of metric.
15
The term ‘receptive field’ originated in the neuroscience literature, referring
to the spatial domain that affects the output of a given neuron.