Foundations of Geometric Deep Learning

•

Symmetries are deﬁned over signal spaces, deﬁned over a physical

domain.

•

In order to beat the curse of dimensionality, we extend the invariance

prior with deformation stability and scale separation.

•

These priors can be used constructively in a generic neural architecture:

the GDL blueprint.

In the previous chapter, we have introduced the formalism of group theory

to account for the symmetries of a generic learning task. While this formalism

provides the basic backbone to think about geometric deep learning, we need

to complement it with additional structural assumptions in order to effectively

beat the curse of dimensionality.

First, the input space X of many learning problems is high-dimensional,

where symmetry structures are challenging to determine in all generality. Our

ﬁrst task will be to describe instead input spaces as signal spaces, where x ∈X

is itself a mapping x : Ω →C from a physical domain Ω to an output channel

space C. In some applications, it may be convenient to consider instead the

domain Ω as being the input, thus varying. Once these objects will be deﬁned,

we will see how the notions of invariance and equivariance apply to this setting,

thus deﬁning the familiar neural network geometric layers.

The physical domain is equipped with a metric structure, which will enable

us to go substantially further than the algebraic notions of invariance and equiv-

ariance. Indeed, this added metric structure enables the notion of scale, ie the

fact that signal measurements can be arranged into neighborhoods of varying

size. As it turns out, this metric structure provides two key geometric pri-

ors: stability to deformations and scale separation. Finally, combining both

algebraic and geometric priors will naturally lead to a blueprint for designing

neural neworks — the GDL blueprint.

102 Chapter 4

4.1 Geometric Domains

We concluded in chapter 2 that learning in high dimensions is intractable with-

out inductive biases, and while knowledge about symmetries could provide

a strong inductive bias, we saw in the previous section that it is unlikely we

will have knowledge of the full symmetry group of the label function. For-

tunately nature provides us with a solution: in a great many problems, the

data we observe are signals on a low-dimensional domain with structure and

symmetries that can be leveraged for more statistically efﬁcient learning.

4.1.1 Physical Domains

The physical domain Ω is where signal measurements take place, and depend-

ing on the application it may be deﬁned with increasing amount of structure.

The following chapters will delve on the key representative applications, but

here we give a ﬁrst glimpse of the diversity of physical domains that arises.

In computer vision applications, images x can be viewed as signals over a

two-dimensional grid Ω





1≤i,j≤N

, where each pixel u ∈Ω encodes

color intensities x(u) ∈{1, 255}

:= C. It will be convenient to view these

domains as discretizations of an underlying continuous object as N →∞, given

by a rectangular domain

Ω = [0, 1]

⊂R

, or in some settings its unbounded

extension

Ω = R

, as well as

C = R

Another representative setting is given by Ω = {1, N}, a set of N elements.

This represents the crudest form of structure, absent any notion of topology

nor metric structure, and applies to settings such as point clouds, in computer

graphics. The next fundamental domain is given by graphs Ω = G = (V, E),

where now elements can be related to each other, thus deﬁning a topology

and ultimately a metric structure.

Finally, we also encounter situations where the input to the learning system

is the domain Ω itself, such as 3D meshes in computer graphics, or graphs in

some algorithmic tasks. The problems we consider thus come in two broad

categories: problems where each data point x ∈X is a signal x : Ω →C on the

same domain Ω (e.g. every image can be viewed as a function on the plane),

and problems where each data point corresponds to a different domain (e.g.

each data point is a different graph or mesh).

The key takeaway is that the domain Ω is generally a much simpler object

than the input space X, enabling a richer description of the relevant symme-

tries. Before executing this plan, we need ﬁrst to understand how to relate X

to Ω, and, importantly, how symmetries deﬁned over Ω carry over to X.

Foundations of Geometric Deep Learning 103

4.1.2 Spaces of Signals on a Domain

Now that we have introduced the signal domain Ω, we can revisit the input

space to the ML model as a vector space of functions over Ω:

The space of C-valued signals on Ω

(for Ω a set, possibly with additional

structure, and C a vector space, whose dimensions are called channels)

X(Ω, C) = {x : Ω →C} (4.36)

is a function space that has a vector space structure. Addition and scalar

multiplication of signals is deﬁned as:

(αx + βy)(u) = αx(u) + βy(u) for all u ∈Ω,

with real scalars α, β. Given an inner product ⟨v, w⟩

on C and a measure

µ on Ω (with respect to which we can deﬁne an integral), we can deﬁne an

inner product on X(Ω, C) as

⟨x, y⟩=

Ω

⟨x(u), y(u)⟩

dµ(u). (4.37)

We observe two simple, yet important properties of this space: while in many

cases linear combinations of points on Ω are not well-deﬁned

, we can linearly

combine signals on it, i.e., the space of signals forms a vector space. Moreover,

since we can deﬁne an inner product between signals, this space is a Hilbert

space.

How Signals Transform: Regular and Induced Representations Given a

symmetry group G acting on the signal domain Ω, this action can be natu-

rally upgraded to X through the notion of induced representation. Let us now

describe it in simple terms.

To ﬁx ideas, suppose ﬁrst that C = R. Denoting the action of G on Ω by

G ×Ω →Ω (4.38)

(g, u) 7→g ·u , (4.39)

we now consider the linear representation

G ×X →X (4.40)

(g, x) 7→ρ(g)x , (4.41)

with (ρ(g)x)(u) := x(g

–1

·u). We verify that ρ is indeed a group representation,

which simply converts transformations of the domain into transformations of

signals, acting by composition. Figure 4.1 illustrates a simple example when G

is the translation group in Ω (assuming for simplicity no boundary conditions).

104 Chapter 4

Figure 4.1

Example of the translation group deﬁning a transformation over X.

Figure 4.2

Example of a color transformation by reﬂection g.y = ±y.

This formalism not only enables to express domain transformations, but also

more general situations where groups are not only acting on the domain Ω

but also on the channel (or ﬁber) C. As a motivating example, suppose now

that C = R

is the RGB colorspace (assumed continous), and that, on top of

translations/rotations over Ω, we can also perform color transformations in C;

see Figure 4.2. Then, if

G is now a group acting on C, we can deﬁne a joint

representation for G ×

G as follows:

(ρ(g,

g)x)(u) :=

–1

·(x(g

–1

·u)) . (4.42)

Finally, another relevant scenario is when the same group G deﬁnes a joint

group action on both domain and channel spaces, in which case we write

(ρ(g)x)(u) = g

–1

·(x(g

–1

·u)). Figure 4.3 illustrates this scenario over 2D vector

ﬁelds and the rotation group.

In summary, we have now seen how an initial low-dimensional description

of the symmetries – expressed over physical domains – can be naturally be

extended to signals on that domain.

Foundations of Geometric Deep Learning 105

Figure 4.3

The rotation group acting both on signal domain and output channel space, here illus-

trated over a vector ﬁeld.

4.1.3 Fixed domain problems & domain symmetries

When the domain Ω is ﬁxed and known in advance, we can determine its sym-

metry group and use this knowledge for more efﬁcient learning. This works

when the symmetries of the domain are also symmetries of the label function,

which is a common occurrence. For example, our data may consist of signals

on the plane (e.g. camera images) or on the sphere (e.g. global weather signals).

Shifts and translations are symmetries of the plane, and 3D rotations are sym-

metries of the sphere (since these maps are invertible and structure-preserving,

mapping the domain onto itself while preserving the relevant structure such as

distances).

As we have just seen, these symmetries can be applied not just to points

in the domain itself, but also to signals on the domain (See 4.1.2), and will

generally not change the semantics of the signal. Hence, our neural network

should process inputs related by symmetries in the same way.

Unlike the full group of symmetries of the label function, we know symme-

tries of the domain a priori, so we can use this knowledge to our advantage to

constraint the function class F. How this is done exactly via invariance and

equivariance will be discussed in section 4.5.

Finally, an important word on the deﬁntion of symmetry: since we deﬁned

symmetries as structure preserving invertible maps from the domain to itself,

what counts as a symmetry depends on what structure one wishes to preserve.

For instance, in Ω = R

, rigid transformations of the form φ(x) = Ax + b, where

⊤

A = I

, are often symmetries of image classiﬁcation problems, since they do

not change the semantics. However, in some cases this is not true: the ‘clas-

sic’ example is digit classiﬁcation between digits ‘6’ and ‘9’. Unlike generic

106 Chapter 4

object classiﬁcation, this task is not rotation-invariant. We will thus be careful

to remember that symmetries are inherently a user-speciﬁed notion.

4.1.4 Variable domain problems & domain isomorphisms

When the domain is variable, so is the symmetry group. There may be ways to

deal with the symmetries of each domain in the category of domains under con-

sideration, but more importantly we need to make sure we deal correctly with

domains that look different but are actually equivalent (isomorphic) in some

sense. In general, the appropriate notion of equivalence depends on the con-

text, but what we can say in general is that two domains are equivalent if there

is an invertible and structure-preserving map (isomorphism) between them. We

will see various examples of structured domains and structure-preserving maps

throughout the chapters of part II, but for now we will sketch two important

examples: graphs and meshes/manifolds.

One way to represent a graph is as a set of node labels (e.g. arbitrary but

unique integers) together with a set of edges (pairs of node labels). Then simply

changing the node labels gives us a new graph that is nevertheless equivalent,

and the mapping that sends each label of the original graph to the new label is

an isomorphism.

If we have two objects that are not identical but still equivalent (under some

notion of equivalence that depends on the context), we need to ensure that the

functions we learn produce an identical or at least equivalent output for the two

objects. For graph classiﬁcation (chapter 5), this means that two graphs related

by a graph isomorphism (See ﬁg. 4.4, left) should be assigned the same label.

Another example is meshes or manifolds (chapter 9), which may be con-

sidered equivalent if there exists a distance-preserving map (i.e. we consider

isometries to be isomorphisms); see Figure 4.5. Again, here we want two

meshes related by an isometry to be treated the same way; we want the method

to be intrinsic. We shall see how to build invariant representations for these

type of surfaces in Chapter 9.

4.2 Invariant Function Classes

At this point, it is useful to recall our learning setup from Chapter 2, and the

symmetry priors from Chapter 3 that presumably alleviate the learning task.

We have f

∗

: X(Ω) →Y an unknown function that we wish to learn using an

hypothesis class F. Moreover, we also have a symmetry group G acting on

X, and the promise that f

∗

is G-invariant: f

∗

(g.x) = f

∗

(x) for all x ∈X and all

g ∈G. What is the easiest way to combine these objects?

Foundations of Geometric Deep Learning 107

1 2 3

4 5 6

1 2 3

4 5 6

1 0 0 1 0 1

0 1 0 0 0 1

0 0 1 1 1 0

1 0 1 1 0 1

0 0 1 0 1 1

1 1 0 1 1 1

1 1 0 0 1 0

1 1 1 1 0 1

0 1 1 0 0 0

0 1 0 1 0 1

1 0 0 0 1 1

0 1 0 1 1 1

A PAP

⊤

Figure 4.4

Figure of graph isomorphisms. Above: representing a graph by integer node labels and

edges (iso = relabeling), and below: representing a graph as an adjacency matrix (iso =

permutation)

Assume for simplicity that the group G is compact, meaning that it admits a

Haar measure

µ(g). This allows us to deﬁne a group averaging operator

f (x) :=

µ(G)

f (g.x)µ(dg) . (4.43)

In words, the group averaging operator averages any function f ∈F over the

group orbits {g.x; g ∈G}; see Figure 4.6. Observe that, since we are assuming

that f

∗

is G-invariant, it holds S

∗

= f

∗

. Since our target function is constant

along orbits, it seems like a good idea to enforce our hypothesis to be also

constant along orbits — and hence to replace f ∈F by its smoothed version

f .

In other words, we can now turn our hypothesis space F into a G-invariant

space via the averaging operator: S

F := {S

f ; f ∈F}.

108 Chapter 4

Figure 4.5

The hand shape is both the domain and the input to the learning system.

Figure 4.6

Illustration of the orbits of the rotation group, as it acts on images via the standard

representation. The Group averaging operator S

averages a hypothesis f (x) over each

orbit.

Example 1 Consider Ω = {1, …, N} the discrete set of N elements and G =

, the symmetric group of N elements, which acts on Ω by permuting its

elements. We consider a hypothesis class F ⊂{f : R

Ω

→R} given by real poly-

nomials f (x) = p(x

, …, x

) of degree up to k. The associated averaged class

F = {S

p(x); p ∈F} corresponds in this case to the space of symmetric

polynomials of degree ≤k. It is generated by the so-called ring of powersum

polymomials {q

, …, x

) :=

i=1

Now let us formalize the intuition that replacing F with S

F is beneﬁ-

cial for learning purposes, focusing on a regression problem R(f ) = E

x∼ν

|f (x) –

∗

(x)|

:= ∥f – f

∗

∥

, where ν is the data distribution. We assume that the data

distribution is compatible with the symmetry structure from G: if we denote by

Foundations of Geometric Deep Learning 109

[x] = {g.x; g ∈G} the orbit associated with x, then the distribution of x, condi-

tional on [x], is the uniform measure on the orbit; precisely the Haar measure

of the group. In other words, the data distribution is unchanged if we replace x

by g.x for any g ∈G.

Recall the decomposition of error from Chapter 2. Let us ﬁrst address the

approximation error. We decompose

∥f – f

∗

∥

= ∥f – S

f + S

f – f

∗

∥

= ∥f – S

f ∥

+ ∥S

f – f

∗

∥

+ 2⟨f – S

f , S

f – f

∗

⟩

= ∥f – S

f ∥

+ ∥S

f – f

∗

∥

+ 2E

[x]

x|[x]

[(f (x) – S

f (x))(S

f (x) – f

∗

(x))]

= ∥f – S

f ∥

+ ∥S

f – f

∗

∥

+ 2E

[x]

[(S

f (x) – S

f (x))(S

f (x) – f

∗

(x))]

= ∥f – S

f ∥

+ ∥S

f – f

∗

∥

where the fourth equality holds because f

∗

is constant along orbits. In words,

the group averaging is an orthogonal projection in L

(X, ν). Moreover, we

immediately deduce that ∥f – f

∗

∥

≥∥S

f – f

∗

∥

for any f ∈F, and therefore

that the approximation error of S

F is upper bounded by that of F. Addi-

tionally, we won’t formalize it here, but it is not hard to see that since group

averaging is a projection, the statistical error is reduced by averaging. We

thus have a hypothesis class S

F for which both the approximation and the

statistical errors are upper bounded by those of F.

This sounds like great news indeed, reafﬁrming the advantage of using

known symmetries of the target function into our hypothesis class. However,

there are (at least) two important issues at this point. First, the group averaging

operator deﬁnes a convenient mathematical object, but it requires an averag-

ing over group orbits, which can become prohivitively expensive very quickly,

for sufﬁciently large groups. Next, we have argued that statistical errors will

be smaller in the symmetric hypothesis space, but we haven’t actually quan-

tiﬁed by how much. As it turns out, the gains can be roughyly quantiﬁed as

follows: given a dataset of n points, the symmetric learning amounts to having

instead ≃|G|n datapoints whenever G is discrete. In the continuous setting, the

gains amount to using [x] instead of x ∈R

Ω

, since [x] lives on a space of lower

dimensionality |Ω| – dim(G) than x. While this may be a substantial gain, it does

not unfortunately break the curse of dimensionality. Indeed, in the example of

images, the Euclidean group in R

has at most 6 degrees of freedom, a tiny

number against the thousands or millions of pixels!

We thus need to address these two major aspects with additional ideas. For

that purpose, we will now introduce two important concepts that extend the

notion of symmetric representation and enable efﬁcient learning in the face of

110 Chapter 4

Figure 4.7

Two objects moving at different velocities in a video deﬁne a transformation outside

the translation group.

high-dimensional data. We ﬁrst discuss stability to general domain transfor-

mations that are ‘near’-symmetries, and then introduce the concept of scale

separation, which allows us to decompose the learning task across different

resolutions by exploiting local interactions.

4.3 Deformation Stability

The symmetry formalism introduced so far captures an idealised world where

we know exactly which transformations are to be considered as symmetries,

and we want to respect these symmetries exactly. For instance in com-

puter vision, we might assume that planar translations are exact symmetries.

However, the real world is noisy and this model falls short.

First, while these simple groups provide a way to understand global sym-

metries of the domain Ω (and by extension, of signals on it, X(Ω)), they do

not capture local symmetries well. For instance, consider a video scene with

several objects, each moving along its own different direction. At subsequent

frames, the resulting scene will contain approximately the same semantic infor-

mation, yet no global translation explains the transformation from one frame

to another. Next, in other cases, such as a deformable 3D object viewed by

a camera, it is simply very hard to describe the group of transformations that

preserve the object identity.

These examples illustrate that in reality we are more interested in a far larger

set of transformations where global, exact invariance is replaced by a local,

inexact one. In our discussion, we will distinguish between two scenarios:

the setting where the domain Ω is ﬁxed, and signals x ∈X(Ω) are undergoing

deformations, and the setting where the domain Ω itself may be deformed.

Foundations of Geometric Deep Learning 111

Figure 4.8

A warping of the domain that is not explained by any rigid transformation nearly pre-

serves the semantics.

4.3.1 Stability to signal deformations

In many applications, we know a priori that a small deformation of the signal

x should not change the output of f (x), so it is tempting to consider such defor-

mations as symmetries. For instance, we could view small diffeomorphisms

τ ∈Diff(Ω), or even small bijections, as symmetries. However, small deforma-

tions can be composed to form large deformations, so “small deformations”

do not form a group

, and we cannot ask for invariance or equivariance to

small deformations only. Since large deformations can can actually materially

change the semantic content of the input, it is not a good idea to use the full

group Diff(Ω) as symmetry group either.

A better approach is to quantify how “far” a given τ ∈Diff(Ω) is from a

given symmetry subgroup G ⊂Diff(Ω) (e.g. translations) with a complexity

measure c(τ ), so that c(τ) = 0 whenever τ ∈G. We can now replace our previ-

ous deﬁnition of exact invariance and equivarance under group actions with a

‘softer’ notion of deformation stability (or approximate invariance):

∥f (ρ(τ)x) – f (x)∥≤Cc(τ)∥x∥, , ∀x ∈X(Ω), τ ∈Diff(Ω) , (4.44)

where ρ(τ)x(u) = x(τ

–1

u) as before, and where C is some constant independent

of the signal x. A function f ∈F(X(Ω)) satisfying the above equation is said

to be geometrically stable. We will see examples of such functions in the next

Section 4.4.

Since c(τ ) = 0 for τ ∈G, this deﬁnition generalises the G-invariance property

deﬁned above. Its utility in applications depends on introducing an appropriate

deformation cost. In the case of images deﬁned over a continuous Euclidean

plane, a popular choice is c

(τ) :=

Ω

∥∇τ(u)∥

du, which measures the ‘elas-

ticity’ of τ , i.e., how different it is from the displacement by a constant vector

ﬁeld. This deformation cost is in fact a norm often called the Dirichlet energy,

and can be used to quantify how far τ is from the translation group.

112 Chapter 4

Figure 4.9

The set of all bijective mappings from Ω into itself forms the set automorphism group

Aut(Ω), of which a symmetry group G (shown as a circle) is a subgroup. Geometric Sta-

bility extends the notion of G-invariance and equivariance to ‘transformations around

G’ (shown as gray ring), quantiﬁed in the sense of some metric between transforma-

tions. In this example, a smooth distortion of the image is close to a shift.

4.3.2 Stability to domain deformations

In many applications, the object being deformed is not the signal, but the geo-

metric domain Ω itself. Canonical instances of this are applications dealing

with graphs and manifolds: a graph can model a social network at different

instance of time containing slightly different social relations (follow graph),

or a manifold can model a 3D object undergoing non-rigid deformations. This

deformation can be quantiﬁed as follows. If D denotes the space of all possible

variable domains (such as the space of all graphs, or the space of Rieman-

nian manifolds), one can deﬁne for Ω,

Ω ∈D an appropriate metric (‘distance’)

d(Ω,

Ω) satisfying d(Ω,

Ω) = 0 if Ω and

Ω are equivalent in some sense: for

example, the graph edit distance vanishes when the graphs are isomorphic,

and the Gromov-Hausdorff distance between Riemannian manifolds equipped

with geodesic distances vanishes when two manifolds are isometric

A common construction of such distances between domains relies on some

family of invertible mapping η : Ω →

Ω that try to ‘align’ the domains in a

Foundations of Geometric Deep Learning 113

way that the corresponding structures are best preserved. For example, in the

case of graphs or Riemannian manifolds (regarded as metric spaces with the

geodesic distance), this alignment can compare pair-wise adjacency or distance

structures (d and

d, respectively),

(Ω,

Ω) = inf

η∈G

∥d –

d ◦(η ×η)∥

where G is the group of isomorphisms such as bijections or isometries, and

the norm is deﬁned over the product space Ω ×Ω. In other words, a distance

between elements of Ω,

Ω is ‘lifted’ to a distance between the domains them-

selves, by accounting for all the possible alignments that preserve the internal

structure

. Given a signal x ∈X(Ω) and a deformed domain

Ω, one can then

consider the deformed signal

x = x ◦η

–1

∈X(

Ω).

By slightly abusing the notation, we deﬁne X(D) = {(X(Ω), Ω) : Ω ∈D}

as the ensemble of possible input signals deﬁned over a varying domain. A

function f : X(D) →Y is stable to domain deformations if

∥f (x, Ω) – f (

Ω)∥≤C∥x∥d

(Ω,

Ω) (4.45)

for all Ω,

Ω ∈D, and x ∈X(Ω). We will discuss this notion of stability in the

context of manifolds in Chapter 9, where isometric deformations play a crucial

role. Furthermore, it can be shown that the stability to domain deformations is

a natural generalisation of the stability to signal deformations, by viewing the

latter in terms of deformations of the volume form Gama, Ribeiro, and Bruna

(2019).

4.4 Scale Separation

While deformation stability substantially strengthens the global symmetry pri-

ors, it is not sufﬁcient in itself to overcome the curse of dimensionality, in the

sense that, informally speaking, there are still “too many" functions that respect

(4.44) as the size of the domain grows. A key insight to overcome this curse

is to exploit the multiscale structure of physical tasks. Let us ﬁrst provide the

essential idea before formalizing it with tools from harmonic analysis.

By viewing the input features x

, x

, …, x

, x

= x(u

) ∈C to a learning task

as ‘particles’, one can interpret a learning system as building an ‘energy’ func-

tion or Hamiltonian H(x

, …, x

), that computes the likelihood of a certain

attribute, e.g. a cat detector would be associated with a ‘cat’ function H so that

H(x) is large whenever the image x contains a cat. The simplest Hamiltoni-

ans in physics are those which are separable: H(x) =

h(x

). In our example,

that would correspond to a cat detector that simply aggregates individual pixel

features, irrespective of their location. Since h is a low-dimensional function,

114 Chapter 4

learning such separable functionals poses no statistical challengence; how-

ever, this model, called a mean-ﬁeld model in physics, disregards any form of

interaction between features/particles, and as such is not able to capture many

properties of interest.

It thus seems like a good idea to consider a broader class of energy func-

tions that account for interactions. While generic N-particle interactions deﬁne

function spaces with the unavoidable curse of dimensionality, a natural idea is

to focus only on dominant interactions. As it turns out, many physical systems

are dominated by local interactions, meaning that ‘nearby’ particles provide

most of the information. In other words, the energy model now becomes

H(x) =

h({x

; i ∼j}), where i ∼j informally denotes that particles i and j

are neighbors. Luckily, these neighborhoods are often of size only depending

on the dimension of the physical domain Ω, and thus ≪N. Back to our cat

detector example, such Ansatz would focus on capturing interactions between

neighboring pixels — which makes sense, since shapes and textures form-

ing complex visual concepts are deﬁned by notions of geometry, and natural

images are known to have a power-law correlation structure (van der Schaaf

and van Hateren 1996). The idea of focusing on local interactions has been

present in ML since at least the work of Geman and Geman (1984) and their

seminal Markov Random Field models.

We now turn to formalize this powerful inductive bias using multiscale

representations. Before describing multiscale representations, we need to intro-

duce the main elements of Fourier transforms, which rely on frequency rather

than scale.

4.4.1 Fourier Transform and Global invariants

Arguably the most famous signal decomposition is the Fourier transform,

the cornerstone of harmonic analysis. The classical one-dimensional Fourier

transform

x(ξ) =

+∞

–∞

x(u)e

–iξu

expresses the function x(u) ∈L

(Ω) on the domain Ω = R as a linear combina-

tion of orthogonal oscillating basis functions φ

(u) = e

iξu

, indexed by their rate

of oscillation (or frequency) ξ. Such an organisation into frequencies reveals

important information about the signal, e.g. its smoothness and localisation.

The Fourier basis itself has a deep geometric foundation and can be interpreted

as the natural vibrations of the domain, related to its geometric structure (see

e.g. Berger (2012)).

Foundations of Geometric Deep Learning 115

Figure 4.10

Left: Fourier basis functions have global support. As a result, local signals produce

energy across all frequencies. Right: Wavelets ψ

u,ξ

(v) = ψ



v–u



at different resolutions

ξ and positions u. Contrary to the Fourier atoms, wavelets are localized both in space

and frequency.

The Fourier transform

plays a crucial role in signal processing as it offers

a dual formulation of convolution,

(x ⋆ θ)(u) =

+∞

–∞

x(v)θ(u – v)dv

a standard model of linear signal ﬁltering (here and in the following, x denotes

the signal and θ the ﬁlter). As we will show in the following, the convolution

operator is diagonalised in the Fourier basis, making it possible to express

convolution as the product of the respective Fourier transforms,

(x ⋆ θ)(ξ) =

x(ξ) ·

θ(ξ),

a fact known in signal processing as the Convolution Theorem. We have stated

the Fourier transform in the one-dimensional case for simplicity of exposition,

but the reader should rest assured that all these important correspondences

remain valid in full generality.

As it turns out, many fundamental differential operators such as the Lapla-

cian are described as convolutions on Euclidean domains. Since such differ-

ential operators can be deﬁned intrinsically over very general geometries, this

provides a formal procedure to extend Fourier transforms beyond Euclidean

domains, including graphs, groups and manifolds. We will discuss this in detail

in Chapters 7 and 9.

An essential aspect of Fourier transforms is that they reveal global properties

of the signal and the domain, such as smoothness or conductance. Such global

behavior is convenient in presence of global symmetries of the domain such

as translation, but not to study more general diffeomorphisms. This requires

116 Chapter 4

a representation that trades off spatial and frequential localisation, as we see

next.

4.4.2 Multiscale representations

The notion of local invariance can be articulated by switching from a Fourier

frequency-based representation to a scale-based representation, the corner-

stone of multi-scale decomposition methods such as wavelets

. The essential

insight of multi-scale methods is to decompose functions deﬁned over the

domain Ω into elementary functions that are localised both in space and

frequency. Contrary to Fourier, wavelet atoms are localised and multi-scale,

allowing to capture ﬁne details of the signal with atoms having small spatial

support and coarse details with atoms having large spatial support. The term

atom here is synonymous with ‘basis element’ in Fourier analysis, with the

caveat that wavelets are redundant (over-complete). In the case of wavelets,

this is achieved by correlating a translated and dilated ﬁlter (mother wavelet)

ψ, producing a combined spatio-frequency representation called a continuous

wavelet transform

x)(u, ξ) = ξ

–1/2

+∞

–∞



v – u



x(v)dv.

The translated and dilated ﬁlters are called wavelet atoms; their spatial position

and dilation correspond to the coordinates u and ξ of the wavelet transform.

These coordinates are usually sampled dyadically (ξ = 2

–j

and u = 2

–j

k), with j

referred to as scale. Multi-scale signal representations bring important beneﬁts

in terms of capturing regularity properties beyond global smoothness, such as

piece-wise smoothness, which made them a popular tool in signal and image

processing and numerical analysis in the 90s.

4.4.3 Deformation stability of Multiscale representations

The beneﬁt of multiscale localised wavelet decompositions over Fourier

decompositions is revealed when considering the effect of small deformations

‘nearby’ the underlying symmetry group. Let us illustrate this important con-

cept in the Euclidean domain and the translation group. Since the Fourier

representation diagonalises the shift operator (which can be thought of as

convolution, as we will see in more detail in Section 7), it is an efﬁcient

representation for translation transformations. However, Fourier decomposi-

tions are unstable under high-frequency deformations. In contrast, wavelet

decompositions offer a stable representation in such cases.

Indeed, let us consider τ ∈Aut(Ω) and its associated linear representation

ρ(τ). When τ(u) = u – v is a shift, as we will verify in Chapter 7, the operator

Foundations of Geometric Deep Learning 117

Figure 4.11

Multiscale decomposition of an input image using wavelets: each scale decomposes the

existing coarse-grained image into an even coarser version, and the details needed to

reconstruct it.

ρ(τ) = S

is a shift operator that commutes with convolution. Since convolu-

tion operators are diagonalised by the Fourier transform, the action of shift in

the frequency domain amounts to shifting the complex phase of the Fourier

transform,

(

x)(ξ) = e

–iξv

x(ξ).

Thus, the Fourier modulus f (x) = |

x| removing the complex phase is a simple

shift-invariant function, f (S

x) = f (x). However, if we have only approximate

translation, τ(u) = u – ˜τ(u) with ∥∇τ∥

∞

= sup

u∈Ω

∥∇˜τ(u)∥≤ϵ, the situation is

entirely different: it is possible to show that

∥f (ρ(τ )x) – f (x)∥

∥x∥

= O(1)

irrespective of how small ϵ is (i.e., how close is τ to being a shift). Conse-

quently, such Fourier representation is unstable under deformations, however

small. This unstability is manifested in general domains and non-rigid transfor-

mations; we will see another instance of this unstability in the analysis of 3D

shapes using the natural extension of Fourier transforms described in Chapter

Wavelets offer a remedy to this problem that also reveals the power of multi-

scale representations. In the above example, we can show (Mallat 2012) that

118 Chapter 4

the wavelet decomposition W

x is approximately equivariant to deformations,

∥ρ(τ)(W

x) – W

(ρ(τ)x)∥

∥x∥

= O(ϵ).

In other words, decomposing the signal information into scales using localised

ﬁlters rather than frequencies turns a global unstable representation into a

family of locally stable features. Importantly, such measurements at different

scales are not yet invariant, and need to be progressively processed towards the

low frequencies, hinting at the deep compositional nature of modern neural net-

works, and captured in our Blueprint for Geometric Deep Learning, presented

next.

4.4.4 Scale Separation Prior

We can build from this insight by considering a multiscale coarsening of the

data domain Ω into a hierarchy Ω

, . . . , Ω

. As it turns out, such coarsening

can be deﬁned on very general domains, including grids, graphs, and man-

ifolds. Informally, a coarsening assimilates nearby points u, u

′

∈Ω together,

and thus only requires an appropriate notion of metric in the domain. If

(Ω

, C

) := {x

: Ω

→C

} denotes signals deﬁned over the coarsened domain

Ω

, we informally say that a function f : X(Ω) →Y is locally stable at scale j

if it admits a factorisation of the form f ≈f

◦P

, where P

: X(Ω) →X

(Ω

) is a

non-linear coarse graining and f

: X

(Ω

) →Y. In other words, while the target

function f might depend on complex long-range interactions between features

over the whole domain, in locally-stable functions it is possible to separate

the interactions across scales, by ﬁrst focusing on localised interactions that

are then propagated towards the coarse scales. Figures 4.12 and 4.13 illustrate

simple examples of this scale factorization.

Such principles

are of fundamental importance in many areas of physics

and mathematics, as manifested for instance in statistical physics in the so-

called renormalisation group, or leveraged in important numerical algorithms

such as the Fast Multipole Method. In machine learning, multiscale repre-

sentations and local invariance are the fundamental mathematical principles

underpinning the efﬁciency of Convolutional Neural Networks and Graph

Neural Networks and are typically implemented in the form of local pooling. In

future work, we will further develop tools from computational harmonic anal-

ysis that unify these principles across our geometric domains and will shed

light onto the statistical learning beneﬁts of scale separation.

Foundations of Geometric Deep Learning 119

Figure 4.12

Illustration of Scale Separation for image classiﬁcation tasks. The classiﬁer f

′

deﬁned

on signals on the coarse grid X(Ω

′

) should satisfy f ≈f

′

◦P, where P : X(Ω) →X(Ω

′

Here, the coarse scales dominate: to discriminate between these two classes, one can

coarse grain the domain and drop all the details below a certain resolution.

Figure 4.13

Illustration of Scale Separation for image classiﬁcation tasks. The classiﬁer f

′

deﬁned

on signals on the coarse grid X(Ω

′

) should satisfy f ≈f

′

◦P, where P : X(Ω) →X(Ω

′

Here, the ﬁne scales dominate: to discriminate between these two classes, it is sufﬁcient

to focus on local statistics and then combine them with a simple averaging.

4.5 The GDL Blueprint

The geometric principles of Symmetry, Geometric Stability, and Scale Sep-

aration discussed so far can be combined to provide a universal blueprint for

learning stable representations of high-dimensional data. These representations

will be produced by functions f operating on signals X(Ω, C) deﬁned on the

domain Ω, which is endowed with a symmetry group G.

120 Chapter 4

The geometric priors we have described so far do not prescribe a speciﬁc

architecture for building such representation, but rather a series of necessary

conditions. However, they hint at an axiomatic construction that provably sat-

isﬁes these geometric priors, while ensuring a highly expressive representation

that can approximate any target function satisfying such priors.

4.5.1 Invariant and Equivariant Layers

A simple initial observation is that, in order to obtain a highly expressive repre-

sentation, we are required to introduce a non-linear element, since if f is linear

and G-invariant, then for all x ∈X(Ω),

f (x) =

µ(G)

f (g.x)dµ(g) = f



µ(G)

(g.x)dµ(g)



which indicates that F only depends on x through the G-average Ax =

µ(G)

(g.x)dµ(g). In the case of images and translation, this would entail using

only the average RGB color of the input!

While this reasoning shows that the family of linear invariants is not a very

rich object, the family of linear equivariants provides a much more powerful

tool, since it enables the construction of rich and stable features by com-

position with appropriate non-linear maps, as we will now explain. Indeed,

if B : X(Ω, C) →X(Ω, C

′

) is G-equivariant satisfying B(g.x) = g.B(x) for all

x ∈X and g ∈G, and σ : C

′

→C

′′

is an arbitrary (non-linear) map, then we

easily verify that the composition U := (σ ◦B) : X(Ω, C) →X(Ω, C

′′

) is also G-

equivariant, where σ : X(Ω, C

′

) →X(Ω, C

′′

) is the element-wise instantiation

of σ given as (σ(x))(u) := σ(x(u)).

This simple property allows us to deﬁne a very general family of G-

invariants, by composing U with the group averages A ◦U : X(Ω, C) →C

′′

A natural question is thus whether any G-invariant function can be approx-

imated at arbitrary precision by such a model, for appropriate choices of B

and σ. It is not hard to adapt the standard Universal Approximation Theorems

from unstructured vector inputs to show that shallow ‘geometric’ networks are

also universal approximators, by properly generalising the group average to

a general non-linear invariant

. However, as already described in the case of

Fourier versus Wavelet invariants, there is a fundamental tension between shal-

low global invariance and deformation stability. This motivates an alternative

representation, which considers instead localised equivariant maps

Assuming that Ω is further equipped with a distance metric d, we call an

equivariant map U localised if (Ux)(u) depends only on the values of x(v) for

= {v : d(u, v) ≤r}, for some small radius r; the latter set N

is called the

receptive ﬁeld.

Foundations of Geometric Deep Learning 121

Figure 4.14

Geometric Deep Learning blueprint, exempliﬁed on a graph. A typical Graph Neural

Network architecture may contain permutation equivariant layers (computing node-

wise features), local pooling (graph coarsening), and a permutation-invariant global

pooling layer (readout layer).

A single layer of local equivariant map U cannot approximate functions with

long-range interactions, but a composition of several local equivariant maps

◦U

J–1

···◦U

increases the receptive ﬁeld

while preserving the stabil-

ity properties of local equivariants. The receptive ﬁeld is further increased by

interleaving downsampling operators that coarsen the domain (again assum-

ing a metric structure), completing the parallel with Multiresolution Analysis

(MRA, see e.g. Mallat (1999)).

In summary, the geometry of the input domain, with knowledge of an

underyling symmetry group, provides three key building blocks: (i) a local

equivariant map, (ii) a global invariant map, and (iii) a coarsening operator.

These building blocks provide a rich function approximation space with pre-

scribed invariance and stability properties by combining them together in a

scheme we refer to as the Geometric Deep Learning Blueprint (Figure 4.14).

122 Chapter 4

Geometric Deep Learning Blueprint

Let Ω and Ω

′

be domains, G a symmetry group over Ω, and write Ω

′

⊆Ω

if Ω

′

can be considered a compact version of Ω.

We deﬁne the following building blocks:

Linear G-equivariant layer B : X(Ω, C) →X(Ω

′

, C

′

) satisfying

B(g.x) = g.B(x) for all g ∈G and x ∈X(Ω, C).

Nonlinearity σ : C →C

′

applied element-wise as (σ(x))(u) = σ(x(u)).

Local pooling (coarsening) P : X(Ω, C) →X(Ω

′

, C), such that Ω

′

⊆Ω.

G-invariant layer (global pooling) A : X(Ω, C) →Y satisfying

A(g.x) = A(x) for all g ∈G and x ∈X(Ω, C).

Using these blocks allows constructing G-invariant functions f : X(Ω, C) →

Y of the form

f = A ◦σ

◦B

◦P

J–1

◦. . . ◦P

◦σ

◦B

where the blocks are selected such that the output space of each block

matches the input space of the next one. Different blocks may exploit

different choices of symmetry groups G.

4.5.2 Different settings of Geometric Deep Learning

One can make an important distinction between the setting when the domain

Ω is assumed to be ﬁxed and one is only interested in varying input signals

deﬁned on that domain, or the domain is part of the input as varies together

with signals deﬁned on it. A classical instance of the former case is encoun-

tered in computer vision applications, where images are assumed to be deﬁned

on a ﬁxed domain (grid). Graph classiﬁcation is an example of the latter set-

ting, where both the structure of the graph as well as the signal deﬁned on it

(e.g. node features) are important. In the case of varying domain, geometric

stability (in the sense of insensitivity to the deformation of Ω) plays a crucial

role in Geometric Deep Learning architectures.

Foundations of Geometric Deep Learning 123

This blueprint has the right level of generality to be used across a wide range

of geometric domains. Different Geometric Deep Learning methods thus differ

in their choice of the domain, symmetry group, and the speciﬁc implementation

details of the aforementioned building blocks. As we will see in the following,

a large class of deep learning architectures currently in use fall into this scheme

and can thus be derived from common geometric principles.

In the following Chapters we will describe the various geometric domains

focusing on the ‘5G’, as well as the speciﬁc implementations of Geometric

Deep Learning on these domains.

Architecture Domain Ω Symmetry group G

CNN Grid Translation

Spherical CNN Sphere / SO(3) Rotation SO(3)

Mesh CNN Manifold Isometry Iso(Ω) /

Gauge symmetry SO(2)

GNN Graph Permutation Σ

Deep Sets Set Permutation Σ

Transformer Complete Graph Permutation Σ

E(3) GNNs Geometric Graph Permutation Σ

× Euclidean E(3)

LSTM 1D Grid Time warping

4.6 Summary

In this chapter, we have taken the ideas developed in Chapter 3 and deployed

them in learning problems with physical structure. This physical structure is

encoded in a simple domain Ω where rich symmetries can be deﬁned, and

then extended via composition to the signal space X(Ω). In order to make the

symmetric learning principles effective, we need to augment the prior of invari-

ance/equivariance to include also stability to deformations, as well as the scale

separation principle. Taken together, these principles deﬁne a computational

blueprint (the GDL blueprint) that enables a rich, efﬁcient class of invariant

and geometrically stable representations parametrized by neural networks. In

the following chapters, we will instantiate this GDL blueprint to key represen-

tative domains and explore the speciﬁc geometric properties associated to each

of them.

124 Chapter 4

Exercises

1. Count the number of symmetries of a label function. Assume ﬁnite input space and

label space.

2. Prove that if we have the full symmetry group of a label function, we only need one

label per class to learn the true function. (Maybe start with a sub-question: what are

the orbits of the full symmetry group of a label function?)

3. Show basic properties of group actions: e.g. how the inverse of a group element

yields the inverse map on Ω, how identity acts as the identity map, etc.

4. Multiplication tables: does any ﬁnite group have a multiplication table? Can any

table we write down be interpreted as deﬁning a group? If no, list the constraints the

table needs to satisfy.

5. Uniqueness of generators. Are they unique? If not, ﬁnd an alternative set of

generators for D3, different from the ones in the text (rotation, ﬂip). Write the

multiplication table and Cayley diagram in terms of these generators.

6. If we have a group with generators, and we take a subset of generators, do we get a

subgroup?

7. If we have a group with generators and a subgroup, does the subgroup necessarily

correspond to a subset of generators?

8. If we have a group element, it can always be written as a product of generators. Is

this way of writing it unique?

Notes

When Ω has some additional structure, we may further restrict the kinds

of signals in X(Ω, C). For example, when Ω is a smooth manifold, we may

require the signals to be smooth. Whenever possible, we will omit the range C

for brevity.

When the domain Ω is discrete, µ can be chosen as the counting measure,

in which case the integral becomes a sum. In the following, we will omit the

measure and use du for brevity.

Ω must be a vector space in order for an expression αu + βv to make sense.

Think about it as the generalization of the uniform measure over the group,

so that each group element has equal ‘mass’; see Appendix B.

E.g., the composition of two ϵ-isometries is a 2ϵ-isometry, violating the

closure property.

The graph edit distance measures the minimal cost of making two graphs

isomorphic by a sequences of graph edit operations. The Gromov-Hausdorff

distance measures the smallest possible metric distortion of a correspondence

between two metric spaces, see Gromov (1981).

Foundations of Geometric Deep Learning 125

Two graphs can be aligned by the Quadratic Assignment Problem (QAP),

which considers in its simplest form two graphs G,

G of the same size n, and

solves min

P∈Σ

trace(AP

⊤

), where A,

A are the respective adjacency

matrices and Σ

is the group of n ×n permutation matrices. The graph edit

distance can be associated with such a QAP (Bougleux et al. 2015).

In the following, we will use convolution and (cross-)correlation

(x ⋆ θ)(u) =

+∞

–∞

x(v)θ(u + v)dv

interchangeably, as it is common in machine learning: the difference between

the two is whether the ﬁlter is reﬂected, and since the ﬁlter is typically

learnable, the distinction is purely notational.

See Mallat (1999) for a comperehensive introduction.

This notation implies that ρ(τ) acts on the spatial coordinate of (W

x)(u, ξ).

Fast Multipole Method (FMM) is a numerical technique originally devel-

oped to speed up the calculation of long-ranged forces in n-body problems.

FMM groups sources that lie close together and treats them as a single source.

Here, µ(g) is known as the Haar measure of the group G, and the integral

is performed over the entire group.

Such proofs have been demonstrated, for example, for the Deep Sets model

by Zaheer et al. (2017).

Meaningful metrics can be deﬁned on grids, graphs, manifolds, and groups.

A notable exception are sets, where there is no predeﬁned notion of metric.

The term ‘receptive ﬁeld’ originated in the neuroscience literature, referring

to the spatial domain that affects the output of a given neuron.