1
Introduction
The last decade has witnessed an experimental revolution in data science and
machine learning, epitomised by deep learning methods. Indeed, many high-
dimensional learning tasks previously thought to be beyond reach such as
computer vision, playing Go, or protein folding are in fact feasible with
appropriate data and computational scale. Remarkably, the essence of deep
learning is built from two simple algorithmic principles: first, the notion
of representation or feature learning, whereby adapted, often hierarchical,
features capture the appropriate notion of regularity for each task, and sec-
ond, learning by gradient descent-type optimisation, typically implemented as
backpropagation.
While learning generic functions in high dimensions is a cursed estimation
problem, most tasks of interest are not generic, and come with essential prede-
fined regularities arising from the underlying low-dimensionality and structure
of the physical world. This book is concerned with exposing these regulari-
ties through unified geometric principles that can be applied throughout a wide
spectrum of applications.
Exploiting the known symmetries of a large system is a powerful and clas-
sical remedy against the curse of dimensionality, and forms the basis of most
physical theories. Deep learning systems are no exception, and since the early
days researchers have adapted neural networks to exploit the low-dimensional
geometry arising from physical measurements, e.g. grids in images, sequences
in time-series, or position and momentum in molecules, and their associated
symmetries, such as translation or rotation. Throughout our exposition, we will
describe these models, as well as many others, as natural instances of the same
underlying principle of geometric regularity.
Geometric Deep Learning is a ‘geometric unification’ endeavour in the spirit
of Klein’s Erlangen Programme that serves a dual purpose. On one hand, it
provides a common mathematical framework to study the classical successful
4 Chapter 1
neural network architectures, such as CNNs, RNNs, GNNs, and Transform-
ers. We will show that all of the above can be obtained by the choice of a
geometric domain, its symmetry group, and appropriately constructed invari-
ant and equivariant neural network layers—what we refer to as the ‘Geometric
Deep Learning blueprint. We will exemplify instances of this blueprint on the
‘5G of Geometric Deep Learning’: Graphs, Grids, Groups, Geometric Graphs,
and Gauges. On the other hand, Geometric Deep Learning gives a construc-
tive procedure to incorporate prior physical knowledge into neural networks
and provide a principled way to build new architectures. With this premise,
we shall now explore how the ideas central to Geometric Deep Learning have
crystallised through time.
1.1 On the Shoulders of Giants
“Symmetry, as wide or as narrow as you may define its meaning, is one idea
by which man through the ages has tried to comprehend and create order,
beauty, and perfection.” This somewhat poetic definition of symmetry is given
in the eponymous book of the great mathematician Hermann Weyl (1952),
his Schwanengesang on the eve of retirement from the Institute for Advanced
Study in Princeton. Weyl traces the special place symmetry has occupied in
science and art to the ancient times, from Sumerian symmetric designs to the
Pythagoreans who believed the circle to be perfect due to its rotational sym-
metry. Plato considered the ve regular polyhedra bearing his name today so
fundamental that they must be the basic building blocks shaping the material
world.
Yet, though Plato is credited with coining the term σνµµϵτρ´ια, which lit-
erally translates as ‘same measure’, he used it only vaguely to convey the
beauty of proportion in art and harmony in music. It was the astronomer and
mathematician Johannes Kepler (1611) to attempt the first rigorous analysis of
the symmetric shape of water crystals. In his treatise ‘On the Six-Cornered
Snowflake’
1
he attributed the six-fold dihedral structure of snowflakes to
hexagonal packing of particles an idea that though preceded the clear
understanding of how matter is formed, still holds today as the basis of
crystallography (Ball 2011).
In modern mathematics, symmetry is almost univocally expressed in the
language of group theory. The origins of this theory are attributed to Évariste
Galois, who coined the term and used it to study the solvability of polynomial
equations in the 1830s.
2
Two other names associated with group theory are
those of Sophus Lie and Felix Klein, who met and worked fruitfully together
for a period of time (Tobies 2019). The former would develop the theory of
Introduction 5
Plato
(ca. 370 BC)
Johannes Kepler
(1571–1630)
Figure 1.1
Plato believed that symmetric polyhedra (“Platonic solids”) were the fundamental
building blocks of nature. Johannes Kepler attributed for the first time the six-fold
symmetry of water crystals to the hexagonal packing of particles, antedating modern
crystallography.
continuous symmetries that today bears his name (Lie groups); the latter pro-
claimed group theory to be the organising principle of geometry in his Erlangen
Programme, which we already mentioned in the Preface. Given that Klein’s
Programme is the inspiration for our book, it is worthwhile to spend more time
on its historical context and revolutionary impact.
1.1.1 A Strange New Universe out of Nothing
The foundations of modern geometry were formalised in ancient Greece nearly
2300 years ago by Euclid in a treatise named the Elements (Στ oιχϵ`ια).
Euclidean geometry (which is still taught at school as ‘the geometry’) was
a set of results built upon ve intuitive axioms or postulates. The Fifth Pos-
tulate stating that it is possible to pass only one line parallel to a given line
through a point outside it appeared less obvious and an illustrious row of
mathematicians broke their teeth trying to prove it since antiquity, to no avail.
An early approach to the problem of the parallels appears in the eleventh
century Persian treatise A commentary on the difficulties concerning the postu-
lates of Euclid’s Elements by Omar Khayyam.
3
The eighteenth-century Italian
Jesuit priest Giovanni Saccheri (1733) was likely aware of this previous work
judging by the title of his own work Euclides ab omni nævo vindicatus (‘Euclid
cleared of every stain’, see Figure 1.2). Like Khayyam, he considered the
summit angles of a quadrilateral with sides perpendicular to the base. The con-
clusion that acute angles lead to infinitely many non-intersecting lines that can
be passed through a point not on a straight line seemed so counter-intuitive
6 Chapter 1
that he rejected it as ‘repugnatis naturæ linæ rectæ’ (‘repugnant to the nature
of straight lines’).
4
Euclid
(ca. 300 BC)
Omar Khayyam
(1048–1131)
Figure 1.2
The “Father of Geometry”, Euclid, laid the foundations of modern geometry in the
Elements. Omar Khayyam made early attempts to prove Euclid’s fifth postulate, which
were also referenced in Saccheri’s work (Euclides ab omni nævo vindicatus).
The nineteenth century has brought the realisation that the Fifth Postulate is
not essential and one can construct alternative geometries based on different
notions of parallelism. One such early example is projective geometry, arising,
as the name suggests, in perspective drawing and architecture. In this geom-
etry, points and lines are interchangeable, and there are no parallel lines in
the usual sense: any lines meet in a ‘point at infinity. While results in projec-
tive geometry are known since antiquity, it was first systematically studied by
Jean-Victor Poncelet (1822)
5
; see Figure 1.3.
Gérard Desargues
(1591–1661)
Jean-Victor
Poncelet
(1788–1867)
Figure 1.3
Based on the prior work of Gérard Desargues, Jean-Victor Poncelet revived the interest
in projective geometry, one of the earliest examples of geometries not requiring the
parallel postulate.
Introduction 7
The credit for the first construction of a true non-Euclidean geometry is
disputed. The princeps mathematicorum Carl Friedrich Gauss worked on it
around 1813 but never published any results.
6
The first publication on the sub-
ject of non-Euclidean geometry was ‘On the Origins of Geometry’, by the
Russian mathematician Nikolai Lobachevsky (1829)—as depicted in Figure
1.4. In this work, he considered the Fifth Postulate an arbitrary limitation and
proposed an alternative one, that more than one line can pass through a point
that is parallel to a given one. Such a construction requires a space with nega-
tive curvature — what we now call a hyperbolic space a notion that was still
not fully mastered at that time.
7
Lobachevsky’s idea appeared heretical and he
was openly derided by colleagues.
8
A similar construction was independently
discovered by the Hungarian János Bolyai, who published it in 1832 under the
name ‘absolute geometry.’ In an earlier letter to his father dated 1823, he wrote
enthusiastically about this new development: ‘I have discovered such wonder-
ful things that I was amazed... out of nothing I have created a strange new
universe.
Carl Friedrich
Gauss
(1777–1855)
János
Bolyai
(1802–1860)
Nikolai
Lobachevsky
(1792–1856)
Bernhard
Riemann
(1826–1866)
Figure 1.4
Several mathematicians played pivotal roles in proposing early variants of non-
Euclidean geometries in the eighteenth century. Gauss reportedly worked on the topic
without publishing any material on it. The first publication—focussed on hyperbolic
geometry—came from Lobachevsky (On the Origins of Geometry, depicted here), with
Bolyai having worked on it concurrently. Riemann introduced any such geometries later
on, one of which was elliptic geometry.
In the meantime, new geometries continued to come out like from a cornu-
copia. August Möbius (1827), of the eponymous surface fame, studied affine
geometry. Gauss’ student Bernhardt Riemann introduced a very broad class
of geometries called today Riemannian is his honour in his habili-
tation lecture, subsequently published under the title Über die Hypothesen,
8 Chapter 1
welche der Geometrie zu Grunde liegen (‘On the Hypotheses on which Geom-
etry is Based, 1854). A special case of Riemannian geometry is the ‘elliptic’
geometry of the sphere, another construction violating Euclid’s Fifth Postu-
late, as there is no point on the sphere through which a line can be drawn that
never intersects a given line. Towards the second half of the nineteenth cen-
tury, Euclid’s monopoly over geometry was completely shuttered. New types
of geometry (Euclidean, affine, projective, hyperbolic, spherical) emerged
and became independent fields of study. However, the relationships of these
geometries and their hierarchy were not understood.
It was in this exciting but messy situation that Felix Klein came forth, with
a genius insight to use group theory as an algebraic abstraction of symmetry to
organise the ‘geometric zoo.
9
It appeared that Euclidean geometry is a special
case of affine geometry, which is in turn a special case of projective geom-
etry (or, in terms of group theory, the Euclidean group is a subgroup of the
projective group). Klein, and independently the Italian geometer Eugenio Bel-
trami, further showed that constant-curvature non-Euclidean geometries (i.e.,
the hyperbolic geometry of Lobachevsky and Bolyai and the spherical geom-
etry or Riemann) could be obtained as special cases of projective geometry
10
;
also see Figure 1.5. More general Riemannian geometry with non-constant
curvature was not included in Klein’s unified geometric picture, and it took
another fifty years before it was integrated, largely thanks to the work of Élie
Cartan in the 1920s.
Felix Klein
(1849–1925)
Eugenio Beltrami
(1835–1900)
Figure 1.5
Felix Klein’s Erlangen Programme offered a way to organise and categorise existing
geometries according to their symmetries. Additionally, Klein and Beltrami indepen-
dently proved that non-Euclidean geometries with constant curvature are special cases
of projective geometry.
Klein’s Erlangen Programme has had a profound methodological and cul-
tural impact on geometry and mathematics in general. It was in a sense the
Introduction 9
‘second algebraisation’ of geometry (the first one being the analytic geometry
of René Descartes and the method of coordinates bearing his latinised name
Cartesius) that allowed to produce results impossible by previous methods.
Category Theory abstracting the relations between objects and now pervasive
in pure mathematics, can be “regarded as a continuation of the Klein Erlangen
Programme, in the sense that a geometrical space with its group of transforma-
tions is generalized to a category with its algebra of mappings, in the words
of its creators Samuel Eilenberg and Saunders Mac Lane—see (Marquis 2009)
and Figure 1.6.
Élie
Cartan
(1869–1951)
Samuel
Eilenberg
(1913–1998)
Saunders
Mac Lane
(1909–2005)
Figure 1.6
While the Erlangen Programme outlined a path towards unifying all geometries under
a common lens, it was not until the work of Élie Cartan several decades later that this
unification was entirely formalised. Additionally, the Erlangen Programme inspired the
creation of entire novel fields of mathematics, as evidenced by ‘General Theory of
Natural Equivalences’, the original text on Category Theory by Eilenberg and Mac
Lane.
1.1.2 The Theory of Everything
Considering projective geometry the most general of all, Klein complained
‘how persistently the mathematical physicist disregards the advantages
afforded him in many cases by only a moderate cultivation of the projective
view’ (Klein 1872). His advocating for the exploitation of geometry and the
principles of symmetry in physics has foretold the following century that was
truly revolutionary for the field.
In Göttingen,
11
Klein’s colleague Emmy Noether (1918)
12
proved that every
differentiable symmetry of the action of a physical system has a corresponding
conservation law. It was by all means a stunning result: beforehand, meticulous
experimental observation was required to discover fundamental laws such as
10 Chapter 1
the conservation of energy, and even then, it was an empirical result not com-
ing from anywhere. Noether’s Theorem “a guiding star to 20th and 21st
century physics,” in the words of the Nobel laureate Frank Wilczek — allowed
for example to show that the conservation of energy emerges from the transla-
tional symmetry of time, a rather intuitive idea that the results of an experiment
should not depend on whether it is conducted today or tomorrow. The first page
of this landmark result is depicted in Figure 1.7.
Another symmetry associated with charge conservation, the global gauge
invariance of the electromagnetic field, first appeared in Maxwell’s formu-
lation of electrodynamics (Maxwell 1865); however, its importance initially
remained unnoticed. The same Hermann Weyl who wrote so dithyrambically
about symmetry is the one who first introduced the concept of gauge invari-
ance in physics in the early 20th century,
13
emphasizing its role as a principle
from which electromagnetism can be derived. It took several decades until
this fundamental principle in its generalised form developed by Yang and
Mills (1954) – proved successful in providing a unified framework to describe
the quantum-mechanical behaviour of electromagnetism and the weak and
strong forces, finally culminating in the Standard Model that captures all the
fundamental forces of nature but gravity. As succinctly put by another Nobel-
winning physicist, Philip Anderson (1972), “it is only slightly overstating the
case to say that physics is the study of symmetry.
Emmy
Noether
(1882–1935)
Hermann
Weyl
(1885–1955)
Chen-Ning
Yang
(b. 1922)
Robert
Mills
(1927–1999)
Figure 1.7
The Erlangen Programme had significant spillover effects in physics. The landmark
result was Noether’s theorem, which directly specifies how conservation laws arise
directly from symmetry constraints. Later on, gauge symmetries first introduced by
Weyl, then developed by Yang and Mills – proved an effective abstraction for discover-
ing the presently best-known model of the physical world: the Standard Model.
Introduction 11
1.2 Towards Geometric Deep Learning
An impatient reader might wonder at this point, what does all this excursion
into the history of geometry and physics, however exciting it might be, have
to do with deep learning? As we will see, the geometric notions of symmetry
and invariance have been recognised as crucial even in early attempts to do
‘pattern recognition, and it is fair to say that geometry has accompanied the
nascent field of artificial intelligence from its very beginning. While it is hard
to agree on a specific point in time when ‘artificial intelligence’ was born as
a scientific field (at the end, humans have been obsessed with comprehending
intelligence and learning from the dawn of civilisation), and even the history
of deep learning is disputed, we will try a less risky task of looking at the
precursors of geometric deep learning the main topic of our book. This
history can be packed into less than a century.
1.2.1 Early Neural Networks and the AI Winter
By the 1930s, it has become clear that the mind resides in the brain, and
research efforts turned to explaining brain functions such as memory, per-
ception, and reasoning, in terms of brain network structures. McCulloch and
Pitts (1943) are credited with the first mathematical abstraction of the neu-
ron, showing its capability to compute logical functions. Just a year after the
legendary Dartmouth College workshop that coined the very term ‘artificial
intelligence,
14
an American psychologist Frank Rosenblatt (1957) from Cor-
nell Aeronautical laboratory proposed a class of neural networks he called
‘perceptrons’
15
, cf. Figure 1.8. Perceptrons, first implemented on a digital
machine and then in dedicated hardware, managed to solve simple pattern
recognition problems such as classification of geometric shapes. However, the
quick rise of ‘connectionism’ (how AI researchers working on artificial neural
networks labeled themselves) received a bucket of cold water in the form of
the now infamous book Perceptrons by Minsky and Papert (1969); the neural
network model they used is depicted in Figure 1.9.
In the deep learning community, it is common to retrospectively blame
Minsky and Papert for the onset of the first AI Winter, which made neural
networks fall out of fashion for over a decade. A typical narrative mentions
the ‘XOR Affair,’ a proof that perceptrons were unable to learn even very sim-
ple logical functions as evidence of their poor expressive power. Some sources
even add a pinch of drama recalling that Rosenblatt and Minsky went to the
same school and even alleging that Rosenblatt’s premature death in a boating
accident in 1971 was a suicide in the aftermath of the criticism of his work by
colleagues.
12 Chapter 1
SENSORY
UNITS
(S-UNITS)
RETINAL
UNITS
CIRCUITS
R
1
3
R
2
R
4
R
5
R
6
R
7
R
8
R
RETNA
ASSOCIATION
UNITS
(A-UNITS)
RESPONSE
UNITS
(R-UNITS)
Frank Rosenblatt
(1928–1971)
Figure 1.8
The Perceptron proposed by Frank Rosenblatt (1957) was one of the simplest neural
network architectures.
The reality is probably more mundane and more nuanced at the same time.
First, a far more plausible reason for the AI Winter’ in the USA is the 1969
Mansfield Amendment, which required the military to fund “mission-oriented
direct research, rather than basic undirected research. Since many efforts in
artificial intelligence at the time, including Rosenblatt’s research, were funded
by military agencies and did not show immediate utility, the cut in funding
has had dramatic effects. Second, neural networks and artificial intelligence
in general were over-hyped: it is enough to recall a 1958 New Yorker article
calling perceptrons a “first serious rival to the human brain ever devised” and
“remarkable machines” that were “capable of what amounts to thought,
16
or
the overconfident MIT Summer Vision Project expecting a “construction of a
significant part of a visual system” and achieving the ability to perform “pattern
recognition”
17
during one summer term of 1966. A realisation by the research
community that initial hopes to ‘solve intelligence’ had been overly optimistic
was just a matter of time.
If one, however, looks into the substance of the dispute, it is apparent that
what Rosenblatt called a ‘perceptron’ is rather different from what Minsky
and Papert understood under this term. Minsky and Papert focused their analy-
sis and criticism on a narrow class of single-layer neural networks they called
‘simple perceptrons’ (and what is typically associated with this term in modern
times, see Figure 1.8) that compute a weighted linear combination of the inputs
followed by a nonlinear function.
18
On the other hand, Rosenblatt considered a
broader class of architectures that antedated many ideas of what would now be
considered ‘modern’ deep learning, including multi-layered networks with ran-
dom and local connectivity.
19
Rosenblatt could have probably rebutted some
Introduction 13
Marvin Minsky
(1927–2016)
Seymour Papert
(1928–2016)
Figure 1.9
The influential book Perceptrons by Minsky and Papert (1969) considered simple
single-layer neural networks depicted here. It was perhaps the earliest geometric
approach to learning, including the introduction of group invariance.
of the criticism concerning the expressive power of perceptrons had he known
about the proof of the Thirteenth Hilbert Problem
20
by Vladimir Arnold (1956)
and Andrey Kolmogorov (1957) establishing that a continuous multivariate
function can be written as a superposition of continuous functions of a single
variable. The Arnold–Kolmogorov theorem was a precursor of a subsequent
class of results known as ‘universal approximation theorems’ for multi-layer
(or ‘deep’) neural networks that put these concerns to rest.
While most remember the book of Minsky and Papert for the role it played
in cutting the wings of the early day connectionists and lament the lost oppor-
tunities, an important overlooked aspect is that it for the first time presented a
geometric analysis of learning problems. This fact is reflected in the very name
of the book, subtitled An introduction to Computational Geometry. At the time
it was a radically new idea, and Block (1970) in his critical review of the book
(which essentially stood to the defense of Rosenblatt), wondered whether “the
new subject of ‘Computational Geometry”’ would “grow into an active field
of mathematics, or will it peter out in a miscellany of dead ends?” (the former
happened: computational geometry is now a well-established field
21
).
Furthermore, Minsky and Papert probably deserve the credit for the first
introduction of group theory into the realm of machine learning: their Group
Invariance theorem stated that if a neural network is invariant to some group,
then its output could be expressed as functions of the orbits of the group (we
will define these terms in subsequent Chapters). While they used this result to
prove limitations of what a perceptron could learn, similar approaches were
subsequently used by Amari (1978) for the construction of invariant features
in pattern recognition problems. An evolution of these ideas in the works of
14 Chapter 1
Sejnowski, Kienker, and Hinton (1986) and Shawe-Taylor (1989, 1993), unfor-
tunately rarely cited today, provided the foundations of the geometric learning
blueprint described in this book.
1.2.2 Universal Approximation and the Curse of Dimensionality
The aforementioned notion of universal approximation deserves further discus-
sion. The term refers to the ability to approximate any continuous multivariate
function to any desired accuracy; in the machine learning literature, this type
of results is usually credited to Cybenko (1989) and Hornik (1991). Figure
1.10 outlines the pioneering researchers working on universal approximation
results.
David
Hilbert
(1862–1943)
Andrey
Kolmogorov
(1903–1987)
Vladimir
Arnold
(1937–2010)
George
Cybenko
Kurt
Hornik
Figure 1.10
David Hilbert’s Thirteenth Problem, proven by Andrey Kolmogorov and Vladimir
Arnold, was one of the first results showing that a multivariate continuous function
could be expressed as a composition and sum of simple one-dimensional functions.
George Cybenko and Kurt Hornik proved results specific to neural networks, showing
that a perceptron with one hidden layer can approximate any continuous function to any
desired accuracy.
Unlike ‘simple’ (single-layer) perceptrons criticised by Minsky and Papert
(1969), multilayer neural networks are universal approximators and thus are
an appealing choice of architecture for learning problems. We can think of
supervised machine learning as a function approximation problem: given the
outputs of some unknown function (e.g., an image classifier) on a training set
(e.g., images of cats and dogs), we try to find a function from some hypoth-
esis class that fits well the training data and allows to predict the outputs on
previously unseen inputs (‘generalisation’).
Universal approximation guarantees that we can express functions from a
very broad regularity class (continuous functions) by means of a multi-layer
neural network. In other words, there exists a neural network with a certain
Introduction 15
number of neurons and certain weights that approximates a given function
mapping from the input to the output space (e.g., from the space of images
to the space of labels). However, universal approximation theorems do not tell
us how to find such weights. In fact, learning (i.e., finding weights) in neu-
ral networks has been a big challenge in the early days. Rosenblatt showed
a learning algorithm only for a single-layer perceptron; to train multi-layer
neural networks, Ivakhnenko and Lapa (1966) used a layer-wise learning algo-
rithm called ‘group method of data handling.’ This allowed Ivakhnenko (1971)
to go as deep as eight layers — a remarkable feat for the early 1970s!
A breakthrough came with the invention of backpropagation, an algorithm
using the chain rule to compute the gradient of the weights with respect to
the loss function and allowed to use gradient descent-based optimisation tech-
niques to train neural networks. As of today, this is the standard approach in
deep learning. While the origins of backpropagation date back to at least the
1960s,
22
the first convincing demonstration of this method in neural networks
was in the widely cited Nature paper of Rumelhart, Hinton, and Williams
(1986). The introduction of this simple and efficient learning method has been
a key contributing factor to the return of neural networks to the AI scene in the
1980s and 1990s. The key contributors in this space are discussed in Figure
1.11.
Alexey
Ivakhnenko
(1913–2007)
Seppo
Linnainmaa
(b. 1945)
Paul
Werbos
(b. 1947)
David
Rumelhart
(1942–2011)
Figure 1.11
Four of the most important figures in the development of modern methods of learning
via gradient descent. Ivakhnenko derived a layer-wise learning algorithm which allowed
for training an eight-layer neural network in 1971. The backpropagation algorithm was
initially described by Linnainmaa, and first deployed in deep neural networks by Wer-
bos. Rumelhart’s work offered the first convincing demonstration of the method in deep
neural networks, which propelled the use of backpropagation in future works.
Looking at neural networks through the lens of approximation theory has
led some cynics to call deep learning a “glorified curve fitting.” We will let the
16 Chapter 1
reader judge how true this maxim is by trying to answer the following impor-
tant question: how many samples (training examples) are needed to accurately
approximate a function? Approximation theorists will immediately retort that
the class of continuous functions that multilayer perceptrons can represent
is obviously way too large: one can pass infinitely many different continu-
ous functions through a finite collection of points.
23
It is necessary to impose
additional regularity assumptions such as Lipschitz continuity,
24
in which case
one can provide a bound on the required number of samples. Unfortunately,
these bounds scale exponentially with dimension a phenomenon colloqui-
ally known as the ‘curse of dimensionality’
25
which is unacceptable in
machine learning problems: even small-scale pattern recognition problems,
such as image classification, deal with input spaces of thousands of dimen-
sions. If one had to rely only on classical results from approximation theory,
machine learning would be impossible. In our illustration, the number of exam-
ples of cat and dog images that would be required in theory in order to learn to
tell them apart would be way larger than the number of atoms in the universe
26
— there are simply not enough cats and dogs around to do it.
The struggle of machine learning methods to scale to high dimensions was
brought up by the British mathematician Sir James Lighthill (1973) in a paper
that AI historians call the ‘Lighthill Report, in which he used the term ‘com-
binatorial explosion’ and claimed that existing AI methods could only work
on toy problems and would become intractable in real-world applications.
Lighthill further complained that “most workers in AI research and in related
fields confess to a pronounced feeling of disappointment in what has been
achieved in the past twenty-five years” and “in no part of the field have the
discoveries made so far produced the major impact that was then promised.
These were not mere frustrations of a grumpy academic stuck with a hard
problem: the Report was commissioned by the British Science Research Coun-
cil to evaluate academic research in the field of artificial intelligence, and its
pessimistic conclusions resulted in funding cuts across the pond. Together
with similar decisions by the American funding agencies, this amounted to
a wrecking ball for AI research in the 1970s.
For us, the realisation that classical functional analysis cannot provide an
adequate framework to deal with learning problems will be the motivation
to seek stronger, geometric forms of regularity, that can be implemented in
a particular wiring of the neural network such as the local connectivity of
convolutional neural networks. It is fair to say that the triumphant reemergence
of deep learning a decade ago owes, at least in part, to these insights. It is also
probably true that the role of symmetry and invariance as a broad organising
Introduction 17
and design principle in neural networks has not been given enough credit at the
time, and us highlighting these principles here are a hindsight.
1.2.3 Secrets of the Visual Cortex and the Neocognitron
The inspiration for the first neural network architectures of the new ‘geo-
metric’ type came from neuroscience. In a series of experiments that would
become classical and bring them a Nobel Prize in medicine, the duo of
Harvard neurophysiologists David Hubel and Torsten Wiesel (1959; 1962)
unveiled the structure and function of a part of the brain responsible for pat-
tern recognition—the visual cortex. By presenting changing light patterns to
a cat and measuring the response of its brain cells (neurons) as depicted in
Figure 1.12 they showed that the neurons in the visual cortex have a multi-
layer structure with local spatial connectivity: a cell would produce response
only if cells in its proximity (‘receptive field’
27
) were activated. Furthermore,
the organisation appeared to be hierarchical, where the responses of ‘simple
cells’ reacting to local primitive oriented step-like stimuli were aggregated
by ‘complex cells, which produced responses to more complex patterns. It
was hypothesized that cells in deeper layers of visual cortex would respond to
increasingly complex patterns composed of simpler ones, with a semi-joking
suggestion of the existence of a ‘grandmother cell’
28
that reacts only when
shown the face of one’s grandmother.
Electrical signal
from brain
Stimulus
Visual area
of brain
Recording electrode
David Hubel
(1926–2013)
Torsten Wiesel
(b. 1924)
Figure 1.12
The classical experiment of Hubel and Wiesel (1962) revealed the structure of the brain
visual cortex and inspired a new generation of neural network architectures mimicking
its local connectivity.
The understanding of the structure of the visual cortex has had profound
impact on early works in computer vision and pattern recognition, with mul-
tiple attempts to imitate its main ingredients. Kunihiko Fukushima (1980),
at that time a researcher at the Japan Broadcasting Corporation, developed a
18 Chapter 1
new neural network architecture “similar to the hierarchy model of the visual
nervous system proposed by Hubel and Wiesel, which was given the name
neocognitron.
29
The neocognitron consisted of interleaved S- and C-layers of
neurons (a naming convention reflecting its inspiration in the biological visual
cortex); the neurons in each layer were arranged in 2D arrays following the
structure of the input image (‘retinotopic’), with multiple ‘cell-planes’ (fea-
ture maps in modern terminology) per layer. The S-layers were designed to be
translationally symmetric: they aggregated inputs from a local receptive field
using shared learnable weights, resulting in cells in a single cell-plane have
receptive fields of the same function, but at different positions. The rationale
was to pick up patterns that could appear anywhere in the input. The C-layers
were fixed and performed local pooling (a weighted average), affording insen-
sitivity to the specific location of the pattern: a C-neuron would be activated if
any of the neurons in its input are activated.
Since the main application of the neocognitron was character recognition,
translation invariance
30
was crucial. This property was a fundamental differ-
ence from earlier neural networks such as Rosenblatt’s perceptron: in order
to use a perceptron reliably, one had to first normalise the position of the
input pattern, whereas in neocognitron the insensitivity to the pattern posi-
tion was baked into the architecture. Neocognitron achieved it by interleaving
translationally-equivariant local feature extraction layers with pooling, cre-
ating a multiscale representation—we will refer to this principle as scale
separation and study in subsequent Chapters why it can also help deal with
a broader class of geometric transformations in addition to translations. Com-
putational experiments showed that Fukushima’s architecture was able to
successfully recognise complex patterns such as letters or digits, even in the
presence of noise and geometric distortions.
Looking from the vantage point of four decades of progress in the field,
one finds that the neocognitron already had strikingly many characteristics of
modern deep learning architectures: depth (Fukishima simulated a seven-layer
network in his paper), local receptive fields, shared weights, and pooling. It
even used half-rectifier (ReLU) activation function, which is often believed
to be introduced in recent deep learning architectures.
31
The main distinction
from modern systems was in the way the network was trained: neocogni-
tron was a ‘self-organised’ architecture trained in an unsupervised manner,
since backpropagation had still not been widely used in the neural network
community.
Introduction 19
1.2.4 Convolutional Neural Networks
Fukushima’s design was further developed by Yann LeCun, a fresh graduate
from the University of Paris
32
with a PhD thesis on the use of backpropa-
gation for training neural networks. In his first post-doctoral position at the
AT&T Bell Laboratories, LeCun and colleagues built a system to recognise
hand-written digits on envelopes in order to allow the US Postal Service to
automatically route mail. In a paper that is now classical, LeCun et al. (1989)
described the first three-layer convolutional neural network (CNN).
33
Simi-
larly to the neocognitron, LeCun’s CNN also used local connectivity with
shared weights and pooling (Figure 1.13). However, it forwent Fukushima’s
more complex nonlinear filtering (inhibitory connections) in favour of sim-
ple linear filters that could be efficiently implemented as convolutions using
multiply-accumulate operations on a digital signal processor (DSP).
34
This
design choice, departing from the neuroscience inspiration and terminology
and moving into the realm of signal processing, would play a crucial role in
the ensuing success of deep learning. Another key novelty of CNN was the use
of backpropagation for training.
Kunihiko
Fukushima
10 output units
H2.12
H1.12
H1.1
9
layer H3
30 hidden units
layer H2
12 x 16 =192
hidden units
hidden units
12 x 64 =768
layer H1
256 input units
fully connected
300 links
fully connected
6000 links
40,000 links
from 12 kernels
5 x 5 x 8
20,000 links
from 12 kernels
5 x 5
H2.1
0
Yann LeCun
Figure 1.13
The original variants of convolutional neural network architectures have been intro-
duced by Fukushima and LeCun (though the name “convolutional” would appear later).
Implemented on a digital signal processor, LeCun’s CNN allowed real-time handwrit-
ten digit recognition for the first time.
LeCun’s works showed convincingly the power of gradient-based methods
for complex pattern recognition tasks and was one of the first practical deep
learning-based systems for computer vision. An evolution of this architecture,
a CNN with five layers named LeNet-5 as a pun on the author’s name (LeCun
et al. 1998), was used by US banks to read handwritten cheques. The computer
vision research community, however, in its vast majority steered away from
neural networks and took a different path. The typical architecture of visual
20 Chapter 1
recognition systems of the first decade of the new millennium was a care-
fully hand-crafted feature extractor (typically detecting interesting points in an
image and providing their local description in a way that is robust to perspec-
tive transformations and contrast changes
35
) followed by a simple classifier
(most often a support vector machine (SVM) and more rarely, a small neural
network).
36
1.2.5 Recurrent Neural Networks, Vanishing gradients, and LSTMs
While CNNs were mainly applied to modelling data with spatial symmetries
such as images, another development was brewing: one that recognises that
data is often non-static, but can evolve in a temporal manner as well. The sim-
plest form of temporally-evolving data is a time series consisting of a sequence
of steps with a data point provided at each step.
A model handling time-series data hence needs to be capable of meaning-
fully adapting to sequences of arbitrary length a feat that CNNs were not
trivially capable of. This led to development of Recurrent Neural Networks
(RNNs) in the late 1980s and early 1990s
37
, the earliest examples of which
simply applies a shared update rule distributed through time (Jordan 1986;
Elman 1990). At every step, the RNN state is updated as a function of the
previous state and the current input.
One issue that plagued RNNs from the beginning was the vanishing gradi-
ent problem. Vanishing gradients arise in very deep neural networks—of which
RNNs are often an instance of, given that their effective depth is equal to the
sequence length—with sigmoidal activations is unrolled over many steps, suf-
fers from gradients quickly approaching zero as backpropagation is performed.
This effect strongly limits the influence of earlier steps in the sequence to the
predictions made in the latter steps.
A solution to this problem was first elaborated in Sepp Hochreiter’s Diploma
thesis (1991), carried out in Munich under the supervision of Jürgen Schmid-
huber, in the form of an architecture that was dubbed Long Short-Term
Memory (LSTM; Figure 1.14).
38
LSTMs combat the vanishing gradient prob-
lem by having a memory cell, with explicit gating mechanisms deciding how
much of that cell’s state to overwrite at every step — allowing one to learn the
degree of forgetting from data (or alternatively, remembering for long time).
In contrast, simple RNNs of Jordan (1986) and Elman (1990) perform a full
overwrite at every step.
This ability to remember context for long time turned crucial in natural lan-
guage processing and speech analysis applications, where recurrent models
were shown successful applications starting from the early 2000s (Gers and
Schmidhuber 2001; Graves et al. 2004). However, like it happened with CNNs
Introduction 21
Sepp
Hochreiter
Jürgen
Schmidhuber
Figure 1.14
The creators of the long short-term memory cell—the first recurrent neural network
module capable of dealing with long-range dependencies—Hochreiter and Schmidhu-
ber, pictured alongside one of the earliest schematic depictions of their creation.
in computer vision, the breakthrough would need to wait for another decade to
come.
1.2.6 Gating Mechanisms and Time Warping
In our context, it may not be initially obvious whether recurrent models
embody any kind of symmetry or invariance principle similar to the trans-
lational invariance we have seen in CNNs. Nearly three decades after the
development of RNNs, Corentin Tallec and Yann Ollivier (2018) showed
that there is a type of symmetry underlying recurrent neural networks: time
warping.
Time series have a very important but subtle nuance: it is rarely the case that
the individual steps of a time series are evenly spaced. Indeed, in many natural
phenomena, we may deliberately choose to take many measurements at some
time points and very few (or no) measurements during others. In a way, the
notion of ‘time’ presented to the RNN working on such data undergoes a form
of warping operation. Can we somehow guarantee that our RNN is “resistant”
to time warping, in the sense that we will always be able to find a set of useful
parameters for it, regardless of the warping operation applied?
In order to handle time warping with a dynamic rate, the network needs
to be able to dynamically estimate how quickly time is being warped (the
so-called ‘warping derivative’) at every step. This estimate is then used to
selectively overwrite the RNN’s state. Intuitively, for larger values of the warp-
ing derivative, a more strict state overwrite should be performed, as more time
has elapsed since.
The idea of dynamically overwriting data in a neural network is imple-
mented through the gating mechanism. Tallec and Ollivier effectively show
that, in order to satisfy time warping invariance, an RNN needs to have a
22 Chapter 1
gating mechanism. This provides a theoretical justification for gate RNN mod-
els, such as the aforementioned LSTMs of Hochreiter and Schmidhuber or
the Gated Recurrent Unit (GRU) (Cho et al. 2014). Full overwriting used in
simple RNNs corresponds to an implicit assumption of a constant time warp-
ing derivative equal to one a situation unlikely to happen in most real-world
scenarios – which also explains the success of LSTMs.
1.2.7 The Triumph of Deep Learning
As already mentioned, the initial reception of what would later be called “deep
learning” had initially been rather lukewarm. The computer vision commu-
nity, where the first decade of the new century was dominated by handcrafted
feature descriptors, appeared to be particularly hostile to the use of neural
networks. However, the balance of power was changed by the rapid growth
in computing power and the amounts of available annotated visual data. It
became possible to implement and train increasingly bigger and more com-
plex CNNs that allowed addressing increasingly challenging visual pattern
recognition tasks,
39
culminating in a Holy Grail of computer vision at that
time: the ImageNet Large Scale Visual Recognition Challenge (Figure 1.15).
Established by the American-Chinese researcher Fei-Fei Li, ImageNet was an
annual challenge consisting in the classification of millions of human-labelled
images into 1000 different categories.
0%
25%
26%
16.4%
11.7%
6.7%
5%
3.6%
3.1%
2011
XREC
2015
ResNet
5%
10%
15%
20%
30%
Top-5 error
28%
2010
NEC-UIUC
2012
AlexNet
2013
ZFNet
2014
GoogLeNet
VGGNet
Human
2016
GoogLeNet
-v4
Fei Fei Li
Figure 1.15
ImageNet, a benchmark developed by Fei Fei Li at Stanford University was one of the
‘Holy Grail’ challenges in computer vision in the early 2010s. The dramatic perfor-
mance improvement provided by the convolutional neural network AlexNet in 2012 is
considered the turning point leading to the widespread adoption of deep learning in the
field.
A CNN architecture developed at the University of Toronto by Krizhevsky,
Sutskever, and Hinton (2012) managed to beat by a large margin
40
all the
competing approaches such as smartly engineered feature detectors based on
decades of research in the field. AlexNet (as the architecture was called in
Introduction 23
honour of its developer, Alex Krizhevsky; see Figure 1.16) was significantly
bigger in terms of the number of parameters and layers compared to its older
sibling LeNet-5,
41
but conceptually the same. The key difference was the use
of a graphics processor (GPU) for training
42
, now the mainstream hardware
platform for deep learning.
43
11
11
11
11
5
5
3
3
3
3
3
3
3
5
5
3
3
3
3
3
3
3
3
224
224
55
48
55
128
27
27
13
13
13
13
13
13
192 192 128
128
192 192
128
2048 2048
2048 2048
1000
48
Stride
of 4
Max
pooling
Max
pooling
Max
pooling
dense
dense dense
CONV1
POOL1 POOL2
CONV2 CONV3 CONV4 CONV5
POOL3
FC1 FC2 FC3
Alex Krizhevsky
Figure 1.16
Alex Krizhevsky, pictured next to a schematic of AlexNet—the first deep neural
network-based solution to win the ImageNet contest, by a significant margin. AlexNet’s
result is commonly seen as a pivotal moment in the development of modern deep learn-
ing, ushering in a significant developments in subsequent years—eventually leading to
surpassing human performance on ImageNet only a few years later.
The success of CNNs on ImageNet became the turning point for deep learn-
ing and heralded its broad acceptance in the following decade. A similar
transformation happened in natural language processing and speech recog-
nition, which moved entirely to neural network-based approaches during the
2010s; indicatively, Google and Facebook switched their machine translations
systems to LSTM-based architectures around 2016–2017. Multi-billion dollar
industries emerged as a result of this breakthrough, with deep learning success-
fully used in commercial systems ranging from speech recognition in Apple
iPhone to Tesla self-driving cars. More than forty years after the scathing
review of Rosenblatt’s work, the connectionists were finally vindicated.
1.2.8 Graph Neural Networks and Their Chemical Precursors
If the history of symmetry is tightly intertwined with physics, the history of
graph neural networks—one of the central topics of our book—has roots in
another branch of natural science: chemistry. Chemistry has historically been
– and still is – one of the most data-intensive academic disciplines. The emer-
gence of modern chemistry in the eighteenth century resulted in the rapid
24 Chapter 1
growth of known chemical compounds and an early need for their organi-
sation. This role was initially played by periodicals such as the Chemisches
Zentralblatt
44
and “chemical dictionaries” like the Gmelins Handbuch der
anorganischen Chemie (an early compendium of inorganic compounds first
published in 1817
45
) and Beilsteins Handbuch der organischen Chemie (a sim-
ilar effort for organic chemistry) all initially published in German, which was
the dominant language of science until the early 20th century.
In the English-speaking world, the Chemical Abstracts Service (CAS) was
created in 1907 and has gradually become the central repository for the world’s
published chemical information.
46
However, the sheer amount of data (the
Beilstein alone has grown to over 500 volumes and nearly half a million pages
over its lifetime) has quickly made it impractical to print and use such chemical
databases.
Since the mid-nineteenth century, chemists have established a universally
understood way to refer to chemical compounds through structural formulae,
indicating a compound’s atoms, the bonds between them, and even their 3D
geometry. But such structures did not lend themselves to easy retrieval. In
the first half of the 20th century, with the rapid growth of newly discovered
compounds and their commercial use, the problem of organising, searching,
and comparing molecules became of crucial importance: for example, when a
pharmaceutical company sought to patent a new drug, the Patent Office had to
verify whether a similar compound had been previously deposited.
To address this challenge, several systems for indexing molecules were
introduced in the 1940s, forming foundations for a new discipline that would
later be called chemoinformatics. One such system, named the ‘GKD chemical
cipher’ after the authors Gordon, Kendall, and Davison (1948), was developed
at the English tire firm Dunlop to be used with early punchcard-based com-
puters.
47
In essence, the GKD cipher was an algorithm for parsing a molecular
structure into a string that could be more easily looked up by a human or a
computer.
However, the GKD cipher and other related methods
48
were far from sat-
isfactory. In chemical compounds, similar structures often result in similar
properties. Chemists are trained to develop intuition to spot such analogies,
and look for them when comparing compounds.
49
On the other hand, when
a molecule is represented as a string (such as in the GKD cipher), the con-
stituents of a single chemical structure may be mapped into different positions
of the cipher. As a result, two molecules containing a similar substructure (and
thus possibly similar properties) might be encoded in very different ways (see
an example in Figure 1.17).
Introduction 25
George Vl
˘
adu¸t
Figure 1.17
A figure from Vl
˘
adu¸t et al. (1959) showing a chemical molecule (top left) and its frag-
ment (top right) and the corresponding GKD-ciphers (bottom). Note that this coding
system breaks the spatial locality of connected atoms in the molecule, such that the
fragment cipher cannot be found by simple substring matching in that of the the full
molecule. This drawback of early chemical representation methods was one of the moti-
vations for the search of structural representations of molecules as graphs.
This realisation has encouraged the development of “topological ciphers,
trying to capture the structure of the molecule. First works of this kind were
done at the Dow Chemicals company by Opler and Norton (1956) and the
US Patent Office by Ray and Kirsch (1957) both heavy users of chemical
databases. One of the most famous such descriptors, known as the ‘Morgan
fingerprint,’ was developed by Harry Morgan (1965) at the Chemical Abstracts
Service and used until today.
50
A figure that has played a key role in developing early “structural”
approaches for searching chemical databases is the Romanian-born Soviet
researcher George Vl
˘
adu¸t (Figure 1.17). A chemist by training (with a PhD
in organic chemistry defended at the Moscow Mendeleev Institute in 1952),
he experienced a traumatic encounter with the gargantuan Beilstein handbook
in his freshman years.
51
This steered his research interests towards chemoin-
formatics (Vl
˘
adu¸t et al. 1959), a field in which he worked for the rest of his
life.
Vl
˘
adu¸t is credited as one of the pioneers of using graph theory for modeling
the structures and reactions of chemical compounds. In a sense, this should not
come as a surprise: graph theory has been historically tied to chemistry, and
even the term ‘graph’ (referring to a set of nodes and edges, rather than a plot
of a function) was introduced by the mathematician James Sylvester (1878) as
a mathematical abstraction of chemical molecules; cf. Figure 1.18.
In particular, Vl
˘
adu¸t advocated the formulation of molecular structure com-
parison as the graph isomorphism problem; his most famous work was on
26 Chapter 1
August Kekulé
(1829–1896)
James Joseph
Sylvester
(1814–1897)
Figure 1.18
The structural formula of benzene (C
6
H
6
) proposed by the 19th-century German
chemist August Kekulé. The term “graph” (in the sense used in graph theory) was
first introduced as a model of molecules by James Sylvester in an 1878 Nature note,
explicitly relating molecules to graphs expressing their “Kekuléan diagrams”.
classifying chemical reactions as the partial isomorphism (maximum common
subgraph) of the reactant and product molecules (Vleduts 1963).
Vl
˘
adu¸t’s work inspired
52
a pair of young researchers, Boris Weisfeiler (an
algebraic geometer) and Andrey Lehman
53
(self-described as a “program-
mer”
54
). In a classical joint paper (Weisfeiler and Leman 1968), the duo
introduced an iterative algorithm for testing whether a pair of graphs are iso-
morphic (i.e., the graphs have the same structure up to reordering of nodes),
which became known as the Weisfeiler-Lehman (WL) test
55
; see Figure 1.19.
Though the two had known each other from school years, their ways parted
shortly after their publication and each became accomplished in his respective
field
56
.
Weisfeiler and Lehman’s initial conjecture that their algorithm solved the
graph isomorphism problem (and does it in polynomial time) was incorrect:
while Leman (1970) demonstrated it computationally for graphs with at most
nine nodes, a larger counterexample was found a year later (Adelson-Velskii
et al. 1969) (and in fact, a strongly regular graph failing the WL test called the
‘Shrinkhande graph’ had been known even earlier, (Shrikhande 1959)).
The paper of Weisfeiler and Lehman has become foundational in under-
standing graph isomorphism. To put their work of in historical perspective, one
should remember that in the 1960s, complexity theory was still embryonic and
algorithmic graph theory was only taking its first baby steps. As Lehman recol-
lected in the late 1990s: “in the 60s, one could in matter of days re-discover all
the facts, ideas, and techniques in graph isomorphism theory. I doubt, that the
word ‘theory’ is applicable; everything was at such a basic level. In the con-
text of graph neural networks, Weisfeiler and Lehman have recently become
Introduction 27
Boris
Weisfeiler
Andrey
Lehman
Figure 1.19
The creators of the eponymous Weisfeiler-Lehman graph isomorphism test, the bedrock
of expressivity analysis for both graph isomorphism and graph neural networks, are
pictured alongside the front page of their paper.
household names with the proof of the equivalence of their graph isomorphism
test to message passing (Xu et al. 2018; Morris et al. 2019).
1.2.9 Back to the Origins
Though chemists have been using GNN-like algorithms for decades, it is likely
that their works on molecular representation remained practically unknown in
the machine learning community.
57
We find it hard to pinpoint precisely when
the concept of graph neural networks has begun to emerge: partly due to the
fact that most of the early work did not place graphs as a first-class citizen,
partly since graph neural networks became practical only in the late 2010s,
and partly because this field emerged from the confluence of several adjacent
research areas; nonetheless, here we will discuss several pioneering works,
many of which have been designed by researchers in Figure 1.20.
Early forms of graph neural networks can be traced back at least to the
1990s, with examples including “Labeling RAAM” by Alessandro Sperduti
(1994), the “backpropagation through structure” of Goller and Kuchler (1996),
and adaptive processing of data structures (Sperduti and Starita 1997; Frasconi,
Gori, and Sperduti 1998). While these works were primarily concerned with
operating over “structures” (often trees or directed acyclic graphs), many of
the invariances preserved in their architectures are reminiscent of the GNNs
more commonly in use today.
The first proper treatment of the processing of generic graph structures (and
the coining of the term ‘graph neural network’) happened after the turn of the
twenty-first century. A University of Siena team led by Marco Gori (2005) and
28 Chapter 1
Franco Scarselli (2008) proposed the first “GNN. They relied on recurrent
mechanisms, required the neural network parameters to specify contraction
mappings, and thus computing node representations by searching for a fixed
point this in itself necessitated a special form of backpropagation and did
not depend on node features at all. All of the above issues were rectified by
the Gated GNN (GGNN) model of Yujia Li et al. (2015), which brought many
benefits of modern RNNs, such as gating mechanisms (Cho et al. 2014) and
backpropagation through time. The neural network for graphs (NN4G) pro-
posed by Alessio Micheli (2009) around the same time used a feedforward
rather than recurrent architecture, in fact resembling more the modern GNNs.
Alessandro
Sperduti
Christoph
Goller
Andreas
Küchler
Marco
Gori
Franco
Scarselli
Alessio
Micheli
Figure 1.20
Six of the pioneering authors of graph neural networks (GNNs) the within machine
learning community. Alessandro Sperduti developed the earliest known GNN-like
architecture to be published at NeurIPS, in 1994. Goller and Küchler also published
an early form of backprop through structures in the ’90s. The term “GNN” was coined
by Gori and Scarselli’s work in 2005, firmly establishing the term that’s very popular to
this day—although, the concurrent NN4G work of Alessio Micheli uses a mechanism
that more closely resembles modern GNN implementations.
Another important class of graph neural networks, often referred to as “spec-
tral”, relied on the notion of the Graph Fourier transform (Bruna et al. 2013).
The roots of this construction are in the signal processing and computational
harmonic analysis communities, where dealing with non-Euclidean signals has
become prominent in the late 2000s and early 2010s.
58
Influential papers by
Shuman et al. (2013) and Sandryhaila and Moura (2013) popularised the notion
of “Graph Signal Processing” (GSP) and the generalisation of Fourier trans-
forms based on the eigenvectors of graph adjacency and Laplacian matrices.
The graph convolutional neural network relying on spectral filters by Deffer-
rard, Bresson, and Vandergheynst (2016) and the GCN model from Kipf and
Welling (2016)—with its presentation inspired by spectral filters—are among
the most cited in the field.
Introduction 29
It is worth noting that, while the concept of GNNs experienced several inde-
pendent re-derivations in the 2010s arising from several perspectives (besides
the already mentioned connections to computational chemistry and signal pro-
cessing, we should highlight probabilistic graphical models (Dai, Dai, and
Song 2016) and natural language processing (Vaswani et al. 2017)), the fact
that all of these models arrived at different instances of a common blueprint
is certainly telling. In fact, it has recently been posited (Veli
ˇ
ckovi
´
c 2022) that
GNNs may offer a universal framework for processing discretised data. Other
neural architectures we will discuss in this book (such as CNNs or RNNs) may
be recovered as special cases of GNNs, by inserting appropriate priors into the
message passing functions or the graph structure over which the messages are
computed.
In a somewhat ironic twist of fate, modern GNNs were triumphantly re-
introduced to chemistry (Figure 1.21, a field they originated from, by David
Duvenaud et al. (2015) as a replacement for handcrafted Morgan’s molecular
fingerprints, and by Justin Gilmer et al. (2017) in the form of message-passing
neural networks equivalent to the Weisfeiler-Lehman test. After fifty years,
the circle finally closed. At the time of writing, graph neural networks have
become a standard tool in chemistry and already used in drug discovery and
design pipelines. A notable accolade was claimed with the GNN-based discov-
ery of novel antibiotic compounds (Stokes et al. 2020). DeepMind’s AlphaFold
2 (Jumper et al. 2021) used a form of GNNs in order to address a hallmark
problem in structural biology – the problem of protein folding.
David Duvenaud Justin Gilmer
Figure 1.21
The two works spearheaded by David Duvenaud and Justin Gilmer, respectively, have
greatly popularised the use of GNNs for both drug screening and quantum chemistry—
applications that remain prominent to this day.
30 Chapter 1
In 1999, Andrey Lehman wrote to a mathematician colleague that he had
the “pleasure to learn that ‘Weisfeiler-Leman’ was known and still caused
interest.
59
He did not live to see the rise of GNNs based on his work of
fifty years earlier. Nor did George Vl
˘
adu¸t see the realisation of his ideas in
chemoinformatics, many of which remained on paper during his lifetime.
1.3 The ‘Erlangen Programme’ of Deep Learning
Our historical overview of the geometric foundations of deep learning has now
naturally brought us to the blueprint that underpins this book. Taking Convo-
lutional and Graph Neural Networks as two prototypical examples, at the first
glance completely unrelated, we find several common traits. First, both operate
on data (images in the case of CNNs or molecules in the case of GNNs) that has
some underlying geometric domain (respectively, a grid or a graph). Second,
in both cases the tasks have a natural notion of invariance (e.g. to the position
of an object in image classification, or the numbering of atoms in a molecule
in chemical property prediction) that can be formulated through appropriate
symmetry group (translation in the former example and permutation in the lat-
ter). Third, both CNNs and GNNs incorporate the respective symmetries as an
inductive bias by making their layers interact appropriately with the action of
the symmetry group on the input. In CNNs, it comes in the form of convolu-
tional layers whose output transforms in the same way as the input (we will
call this property translation-equivariance, which is the same as saying that
convolution commutes with the shift operator). In GNNs, it assumes the form
of a symmetric message passing function that aggregates the neighbour nodes
irrespective of their order, and the overall output of a message passing layer
transforms in the same way with the permutation of the input (permutation-
equivariant). Finally, in some architectural instances, the data are processed
in a multi-scale fashion; this corresponds to pooling in CNNs (uniformly sub-
sampling the grid that underlies the image) or graph coarsening in some types
of GNNs.
Overall, this appears to be a very general principle that can be applied to
a broad range of problems, types of data, and architectures. We will use the
Geometric Deep Learning Blueprint to derive from first principles some of
the most common and popular neural network architectures (CNNs, GNNs,
LSTMs, DeepSets, Transformers), which as of today constitute the majority
of the deep learning toolkit. As we will see in Chapters ?–?, all of the above
can be obtained by an appropriate choice of the domain and the associated
symmetry group.
Further extensions of the Blueprint, such as incorporating the symmetries
of the data in addition to those of the domain, allow to obtain a new class of
Introduction 31
equivariant GNN architectures, that will be discussed later on in this book.
Such architectures have recently come to limelight in molecular modeling, and
account for the fact that in molecular graphs (where nodes represent atoms and
edges are chemical bonds), the permutation of the nodes and the rotation of the
node features (atomic coordinates) are different symmetries.
Suggested Further Reading
On the topic of symmetry, Weyl’s classic (1952) is hard to beat. Straumann
(1996) provides a fascinating account on the development of early gauge the-
ories and Weyl’s role in it. An overview of the historical development of
non-Euclidean geometry is given by Faber (1983). Brannan, Esplen, and Gray
(2011) give a ‘Kleinian’ introduction into geometry, and Sharpe (2000) shows
Cartan’s generalisation of the Erlangen Programme. For readers interested in
the role of symmetry in physics we recommend the book Physics from Symme-
try (Schwichtenberg 2015), which lives up to the promise in its title, and the
magnum opus of Sir Roger Penrose (2005). For biographic notes on the life of
work of Klein and Noether, see Tobies (2019) and Quigg (2019); the book of
Uspensky gives a good overview of the intellectual and political environment
in which Weisfeiler and Lehman worked. Finally, we invite the readers to actu-
ally read the controversial book of Minsky and Papert (1969) and its shorter
critical review by Block (1970).
Notes
1
Fully titled Strena, Seu De Nive Sexangula, (‘New Year’s gift, or on the Six-
Cornered Snowflake’, see Figure 1.1) was, as suggested by the title, a small
booklet sent by Kepler in 1611 as a Christmas gift to his patron and friend
Johannes Matthäus Wackher von Wackenfels.
2
Galois famously described the ideas of group theory (which he considered
in the context of finding solutions to polynomial equations) and coined the
term ‘group’ (groupe in French) in a letter to a friend written on the eve of his
fatal duel. He asked to communicate his ideas to prominent mathematicians
of the time, expressing the hope that they would be able to ‘decipher all this
mess’ (‘déchiffrer tout ce gâchis’). Galois died two days later from wounds
suffered in the duel aged only 20, but his work has been transformational in
mathematics.
3
Omar Khayyam is nowadays mainly remembered as a poet and author of the
immortal line “a flask of wine, a book of verse, and thou beside me.
4
The publication of Euclides vindicatus required the approval of the Inqui-
sition, which came just a few months before the author’s death. Rediscovered
32 Chapter 1
by the Italian differential geometer Eugenio Beltrami in the nineteenth cen-
tury, Saccheri’s work is now considered an early, almost-successful, attempt to
construct hyperbolic geometry.
5
Poncelet was a military engineer and participant of Napoleon’s Russian cam-
paign, where he was captured and held as prisoner until the end of the war. It
was during this captivity period that he wrote the Traité des propriétés projec-
tives des figures (‘Treatise on the projective properties of figures’, 1822) that
revived the interest in projective geometry. Earlier foundation work on this
subject was done by his compatriot Gérard Desargues (1643).
6
In the 1832 letter to Farkas Bolyai following the publication of his son’s
results, Gauss famously wrote: “To praise it would amount to praising myself.
For the entire content of the work coincides almost exactly with my own
meditations which have occupied my mind for the past thirty or thirty-five
years. Gauss was also the first to use the term ‘non-Euclidean geometry’ (in
a letter to Heinrich Christian Schumacher), referring strictu sensu to his own
construction of hyperbolic geometry (Faber 1983).
7
A model for hyperbolic geometry known as the pseudosphere, a surface
with constant negative curvature, was shown by Beltrami, who also proved that
hyperbolic geometry was logically consistent. The term ‘hyperbolic geometry’
was introduced by Klein in his 1873 paper Über die sogenannte nicht-
Euklidische Geometrie (‘On the so-called non-Euclidean geometry’). The
prefix ‘so-called’ might strike with a somewhat negative flavour, and indeed
it does: in his Erlangen Programme, Klein wrote that “with the name non-
Euclidean geometry have been associated a multitude of non-mathematical
ideas, which have been as zealously cherished by some as resolutely rejected
by others.” He however dropped the derogatory adjective in his latter works.
8
For example, an 1834 pamphlet signed only with the initials “S.S.
(believed by some to belong to Lobachevsky’s long-time opponent Ostrograd-
sky) claimed that Lobachevsky made “an obscure and heavy theory” out of
“the lightest and clearest chapter of mathematics, geometry, wondered why
one would print such “ridiculous fantasies,” and suggested that the book was a
“joke or satire.
9
According to a popular belief, repeated in many sources including
Wikipedia, the Erlangen Programme was delivered in Klein’s inaugural
address in October 1872. Klein indeed gave such a talk (though on December
7, 1872), but it was for a non-mathematical audience and concerned primarily
his ideas of mathematical education (Tobies 2019).
10
Klein’s projective model of hyperbolic geometry is often called the ‘Klein
disk’ or the ‘Cayley-Beltrami-Klein model.
Introduction 33
11
At the time, Göttingen was Germany’s and the world’s leading centre
of mathematics. Though Erlangen is proud of its association with Klein, he
stayed there for only three years, moving in 1875 to the Technical University
of Munich (then called Technische Hochschule), followed by Leipzig (1880),
and finally settling down in Göttingen from 1886 until his retirement.
12
Emmy Noether is rightfully regarded as one of the most important women in
mathematics and one of the greatest mathematicians of the twentieth century.
She was unlucky to be born and live in an epoch when the academic world
was still entrenched in the medieval beliefs of the unsuitability of women
for science. Her career as one of the few women in mathematics having to
overcome prejudice and contempt was a truly trailblazing one. It should be
said to the credit of her male colleagues that some of them tried to break the
rules. When Klein and David Hilbert first unsuccessfully attempted to secure
a teaching position for Noether at Göttingen, they met fierce opposition from
the academic hierarchs. Hilbert reportedly retorted sarcastically to concerns
brought up in one such discussion: “I do not see that the sex of the candidate
is an argument against her admission as a Privatdozent. After all, the Senate
is not a bathhouse. (Reid 1976) Nevertheless, Noether enjoyed great esteem
among her close collaborators and students, and her male peers in Göttingen
affectionately referred to her as Der Noether, in the masculine (Quigg 2019).
13
Weyl first conjectured (incorrectly) in 1919 that invariance under the change
of scale or “gauge” was a local symmetry of electromagnetism. The term
gauge, or Eich in German, was chosen by analogy to the various track gauges
of railroads. After the development of quantum mechanics, Weyl (1929) mod-
ified the gauge choice by replacing the scale factor with a change of wave
phase. See Straumann (1996).
14
The Dartmouth Summer Research Project on Artificial Intelligence was
a 1956 summer workshop at Dartmouth College that is considered to be the
founding event of the field of artificial intelligence.
15
From ‘perception’ and the Greek suffix -τρoν denoting an instrument.
16
That is why the authors can only smile at similar recent claims about the
‘consciousness’ of deep neural networks: nihil sub sole novum.
17
Sic in quotes, as pattern recognition had not yet become a common term.
18
Specifically, Minsky and Papert (1969) considered binary classification
problems on a 2D grid (‘retina’ in their terminology) and a set of linear thresh-
old functions. While the inability to compute the XOR function is always
brought up as the main point of criticism in the book, much of the attention
was dedicated to geometric predicates such as parity and connectedness. This
34 Chapter 1
problem is alluded to on the cover of the book, which is adorned by two pat-
terns: one is connected, one is not. Even for a human it is very difficult to
determine which is which.
19
Kussul et al. (2001) showed that Rosenblatt’s 3-layer perceptron trained
with modern methods and implemented on the 21st century hardware was able
to achieve an accuracy of 99.2% on the MNIST digit recognition task, on par
with modern models.
20
Hilbert’s Thirteenth Problem is one of the 23 problems compiled by
David Hilbert in 1900 and entailing the proof of whether a solution exists for
all 7th-degree equations using continuous functions of two arguments. Kol-
mogorov and his student Arnold showed a solution of a generalised version of
this problem, which is now known as the Arnold–Kolmogorov Superposition
Theorem.
21
Computational geometry has subject code I.3.5 in the ACM Computing
Classification System.
22
Backpropagation is based on the chain rule of differentiation that itself
goes back to the co-inventor of differential calculus Gottfried Wilhelm von
Leibniz in 1676. A precursor of backpropagation was used by Kelley 1960
to perform optimisation of complex nonlinear multi-stage systems. Efficient
backpropagation that is still in use today was described by Seppo Linnainmaa
(1970) in his master’s thesis in Finnish. The earliest use in neural networks is
due to Paul Werbos (1982), which is usually cited as the origin of the method.
See Schmidhuber (2015).
23
There are even examples of continuous nowhere differentiable functions
such as the construction of Weierstrass (1872).
24
Roughly, Lipschitz-continuous functions do not arbitrarily shrink or expand
the distance between points on the domain. For differentiable functions, Lip-
schitz continuity can be expressed as an upper bound on the norm of the
gradient, implying that the function does not ‘jump’ too abruptly.
25
The first to use the term was Richard Bellman (1957) in the preface of
his book Dynamic Programming, where he refers to dimensionality as ‘a curse
which has hung over the head of the physicist and astronomer for many a year.
26
The number of protons in the observable universe, known as the Eddington
number, is estimated at 10
80
.
27
The term ‘receptive field’ predates Hubel and Wiesel and was used by
neurophysiologists from the early twentieth century, see Sherrington (1906).
28
The term ‘grandmother cell’ is likely to have first appeared in Jerry Lettvin’s
course ‘Biological Foundations for Perception and Knowledge’ held at MIT in
1969. A similar concept of ‘gnostic neurons’ was introduced two years earlier
in a book by a Polish neuroscientist Jerzy Konorski (1967). See Gross (2002).
Introduction 35
29
The name ‘neocognitron’ suggests it was an improved version of an earlier
architecture of Fukushima (1975), the cognitron.
30
In the words of the author himself, having an output that is “dependent
only upon the shape of the stimulus pattern, and is not affected by the position
where the pattern is presented.
31
ReLU-type activations date back to at least the 1960s and have been
previously employed by Fukushima (1969, 1975).
32
Université Pierre-et-Marie-Curie, today part of the Sorbonne University.
33
In LeCun’s 1989 paper, the architecture was not named; the term ‘con-
volutional neural network’ or ‘convnet’ would appear in a later paper in
1998.
34
LeCun’s first CNN was trained on a CPU (a SUN-4/250 machine). However,
the image recognition system using a trained CNN was run on AT&T DSP-32C
(a second-generation digital signal processor with 256KB of memory capa-
ble of performing 125m floating point multiply-and-accumulate operations per
second with 32-bit precision), achieving over 30 classifications per second.
35
One of the most popular feature descriptors was the scale-invariant feature
transform (SIFT), introduced by David Lowe (1999). It is one of the most cited
computer vision papers.
36
A prototypical approach was “bag-of-words” representing images as
histograms of vector-quantised local descriptors (Sivic and Zisserman 2003).
37
As it often happens, it is hard to point to the first RNN design. Recurrence in
neural networks was already described McCulloch and Pitts (1943), who noted
that “the nervous system contains many circular paths” and referred to “pre-
cise specification of these implications by means of recursive functions and
determination of those that can be embodied in the activity of nervous nets.
The book of Minsky (1967) uses McCulloch-Pitts neurons and calls recurrent
architectures “networks with cycles.” The technical report of Rumelhart, Hin-
ton, and Williams (1985) (an earlier version of the famous Nature paper (1986)
by the same authors) contains generalisations for learning in RNNs, which are
named “recurrent nets,” and credited to Minsky and Papert (1969).
38
The paper presenting LSTMs was initially rejected from NIPS in 1995 and
appeared only two years later as Hochreiter and Schmidhuber (1997).
39
In particular, the group of Jürgen Schmidhuber developed deep large-scale
CNN models that won several vision competitions, including Chinese charac-
ter recognition (Cire¸san et al. 2010) and traffic sign recognition (Ciresan et
al. 2012).
40
AlexNet achieved an error over 10.8% smaller than the runner up.
41
AlexNet had eleven layers was trained on 1.2M images from ImageNet (for
comparison, LeNet-5 had ve layers and was trained on 60K MNIST digits).
36 Chapter 1
Additional important changes compared to LeNet-5 were the use of ReLU acti-
vation (instead of tanh), maximum pooling, dropout regularisation, and data
augmentation.
42
It took nearly a week to train AlexNet on a pair of GPUs
43
Though GPUs were initially designed for graphics applications, they turned
out to be a convenient hardware platform for general-purpose computations
(“GPGPU”). First such works showed linear algebra algorithms (Krüger and
Westermann 2005). The first use of GPUs for neural networks was by Oh and
Jung (2004), predating AlexNet by nearly a decade.
44
Originally Pharmaceutisches Central-Blatt, it was the oldest German
chemical abstracts journal published between 1830 and 1969.
45
Named after Leopold Gmelin who published the first version in 1817,
the Gmelins Handbuch last print edition appeared in the 1990s. The database
currently contains 1.5 million compounds and 1.3 million different reactions
discovered between 1772 and 1995.
46
In 1906, the American Chemical Society authorised the publication of
Chemical Abstracts, charging it with the mission of abstracting the world’s
literature of chemistry and assigning an initial budget of fifteen thousand dol-
lars. The first publication appeared in 1907 under the stewardship of William
Noyes. Over a century since its establishment, the CAS contains nearly 200
million organic and inorganic substances disclosed in publications since the
early 1800s.
47
A special purpose computer (“Electronic Structural Correlator”), to be used
in conjunction with a punch card sorter, was proposed by the Gordon-Kendall-
Davison group in connection with their system of chemical ciphering, but never
built.
48
There were several contemporaneous systems that competed with each
other, see Wiswesser (1968).
49
For example, the association of the benzene ring with odouriferous proper-
ties was the reason for the naming of the chemical class of aromatic compounds
in the 19th century.
50
Not much biographical information is available about Harry Morgan.
According to an obituary, after publishing his famous molecular fingerprints
paper, he moved to a managerial position at IBM, where he stayed until
retirement in 1993. He died in 2007.
51
According to Vladimir Uspensky, Vl
˘
adu¸t told the anecdote of his encounter
with Beilstein in the first lecture of his undergraduate course on organic chem-
istry during his Patterson-Crane Award acceptance speech at the American
Chemical Society.
Introduction 37
52
We were unable to find solid proof of whether or how Weisfeiler and
Lehman interacted with Vl
˘
adu¸t, as most of the people who had known both are
now dead. The strongest evidence is a comment in their classical paper (Weis-
feiler and Leman 1968) acknowledging Vl
˘
adu¸t for “formulating the problem.
It is also certain that Weisfeiler and Lehman were aware of the methods devel-
oped in the chemical community, in particular the method of Morgan (1965),
whom they cited in their paper as a “similar procedure.
53
Andrey Lehman’s surname is often also spelled Leman, a variant that he
preferred himself, stating in an email that the former spelling arose from a
book by the German publisher Springer who believed “that every Leman is
a hidden Lehman. Since Lehman’s family had Teutonic origins by his own
admission, we stick here to the German spelling.
54
Lehman unsuccessfully attempted to defend a thesis based on his work on
graph isomorphism in 1971, which was rejected due to the personal enmity of
the head of the dissertation committee with a verdict “it is not mathematics.
To this, Lehman bitterly responded: “I am not a mathematician, I am a pro-
grammer. He eventually defended another dissertation in 1973, on topics in
databases.
55
There are in fact multiple versions of the Weisfeiler-Lehman test. The orig-
inal paper described what is now called the “2-WL test, which is however
equivalent to 1-WL or node colour refinement algorithm.
56
Since little biographical information is available in English on our heroes,
we will use this note to outline the rest of their careers. All the three ended
up in the United States. George Vl
˘
adu¸t applied for emigration in 1974, which
was a shock to his bosses and resulted in his demotion from the head of lab-
oratory post (emigration was considered a “mortal sin” in the USSR to
people from the West, it is now hard to imagine what an epic effort it used to
be for Soviet citizens). Vl
˘
adu¸t left his family behind and worked at the Insti-
tute for Scientific Information in Philadelphia until his death in 1990. Being
of Jewish origin, Boris Weisfeiler decided to emigrate in 1975 due to growing
official antesemitism in the USSR the last drop was the refusal to publish a
monograph on which he had worked extensively as too many authors had “non-
Russian surnames. He became a professor at Pennsylvania State University
working on algebraic geometry after a short period of stay at the Institute for
Advanced Study in Princeton. An avid mountaineer, he disappeared during a
hike in Chile in 1985. Andrey Lehman left the USSR in 1990 and subsequently
worked as programmer in multiple American startups. He died in 2012.
57
In the chemical community, multiple works proposed GNN-like models
including Kireev (1995), Baskin, Palyulin, and Zefirov (1997), and Merkwirth
and Lengauer (2005).
38 Chapter 1
58
It is worth noting that in the field of computer graphics and geometry pro-
cessing, non-Euclidean harmonic analysis predates Graph Signal Processing
by at least a decade. We can trace spectral filters on manifolds and meshes to
the works of Taubin, Zhang, and Golub (1996), Karni and Gotsman (2000),
and Lévy (2006).
59
From a private email sent by Lehman to Ilia Ponomarenko in 1999.