6
Grids
Grids are a common, particular type of graphs, that come with additional
geometric priors.
These priors are captured by the translation group—enabling the use of
oriented features.
We describe how Fourier analysis provides the natural tool to fully char-
acterise translation-equivariant maps, and we present canonical instances
including convolutional and recurrent neural networks.
Going beyond pure translations, we demonstrate how studying equivari-
ance to time warping enforces the usage of gating in RNNs, leading to
the popular long short-term memory model.
The second type of objects we consider are grids. It is fair to say that the
impact of deep learning has been particularly dramatic in computer vision,
natural language processing, and speech recognition. These applications all
share a geometric common denominator: an underlying grid structure.
Grids are a particular case of graphs, but they come with specific structure
that sets them apart. On the one hand, grids come with a canonical arrangement
of nodes that make the permutation invariance from general graphs unneces-
sary. On the other hand, machine learning problems defined over grids often
exhibit a strong geometric prior, encoded by the translation group: an object
identity does not depend on its absolute spatial location, or an aminoacid
sequence encodes similar structures independently of the location of their
associated genes in the genome.
The translation group provides a particularly simple instantiation of the
GDL blueprint, thanks to the harmonic analysis perspective brought by Fourier
analysis. Indeed, since the translation group is commutative, we can explicitly
characterize the class of linear equivariant maps (that is, convolutions) using
a shared eigendecomposition precisely the Fourier transform. Moreover,
grids are discretizations of low-dimensional physical domains, where locality
is properly defined.
168 Chapter 6
Figure 6.1
An illustration of the one-dimension grid, using boundary conditions.
In this chapter, we will first derive the fundamental links between convo-
lutions and Fourier transforms. Then we will discuss the distinction between
the graph and the grid formalism, and finally move to describe representative
architectures for image and sequence applications.
6.1 Domain
A grid D of size N in s dimensions can be best viewed as the set Z
s
N
, contain-
ing s-tuples of integers (j
1
, , j
s
) modulo N. It will be convenient to view the
domain as having no boundary, and to interpret the grid is a discretization of an
underlying torus T
s
= R
s
/Z
s
. This domain is itself a group, with the operation
inherited by the additive modular structure of either Z
N
or R/Z: given u, v D,
the point u + v is well-defined in D. We denote this group as the translation
group.
Let us now investigate how the GDL blueprint instantiates in this setting.
For ease of exposition, we will focus on the one-dimensional s = 1 setting. Let
us recall that the key questions we need to address are: (i) characterize the class
of linear equivariants associated with the symmetry group, and (ii) implement
locality.
Circulant matrices and Convolutions We can view the one-dimensional
grid as a ring graph (illustrated in Figure 6.1), with nodes indexed by
0, 1, . . . , N 1 modulo N, and the adjacency matrix with elements a
u,u+1 mod N
=
1 and zero otherwise.
As mentioned earlier, while it is tempting to reduce ourselves to the gen-
eral graph formulation seen in Chapter 5, there are two main differences that
set our present setting apart. First, each node u has identical connectivity, to
its neighbours u 1 and u + 1, and thus structure-wise indistinguishable from
the others. In the language of group theory, we say that the grid is a homo-
geneous space: For any two points u and v in the domain, there is a group
element g G that maps u to g.u = v. Next, since the nodes of the grid have
a fixed ordering, we also have a fixed ordering of the neighbours: we can
call u 1 the ‘left neighbour’ and u + 1 the ‘right neighbour’. More generally,
the grid allows to define orientation, in a way that it is globally consistent
across all neighborhoods. If we use our previous approach for designing a
Grids 169
equivariant function F using a local aggregation function ϕ, we now have
f(x
u
) = ϕ(x
u–1
, x
u
, x
u+1
) at every node of the grid: ϕ does not need to be
permutation invariant anymore. For a particular choice of a linear transforma-
tion ϕ(x
u–1
, x
u
, x
u+1
) = θ
–1
x
u–1
+ θ
0
x
u
+ θ
1
x
u+1
, we can write F(X) as a matrix
product,
F(X) =
θ
0
θ
1
θ
–1
θ
–1
θ
0
θ
1
.
.
.
.
.
.
.
.
.
θ
–1
θ
0
θ
1
θ
1
θ
–1
θ
0
x
0
x
1
.
.
.
x
n–2
x
n–1
Note this very special multi-diagonal structure with one element repeated
along each diagonal, sometimes referred to as “weight sharing” in the machine
learning literature.
More generally, given a vector θ = (θ
0
, . . . , θ
n–1
), a circulant matrix C(θ) =
(θ
uv mod n
) is obtained by appending circularly shifted versions of the vector θ.
Circulant matrices are synonymous with discrete convolutions,
1
(x θ)
u
=
n–1
X
v=0
x
v mod n
θ
(uv) mod n
as one has C(θ)x = x θ. A particular choice of θ = (0, 1, 0, . . . , 0)
yields a
special circulant matrix that shifts vectors to the right by one position. This
matrix is called the (right) shift or translation operator and denoted by S.
2
Circulant matrices can be characterised by their commutativity property: the
product of circulant matrices is commutative, i.e. C(θ)C(η) = C(η)C(θ) for
any θ and η. Since the shift is a circulant matrix, we get the familiar translation
or shift equivariance of the convolution operator,
SC(θ)x = C(θ)Sx.
Such commutativity property should not be surprising, since the underlying
symmetry group (the translation group) is Abelian. Moreover, the opposite
direction is true as well, i.e. a matrix is circulant iff it commutes with shift.
This, in turn, allows us to define convolution as a translation equivariant lin-
ear operation, and is a nice illustration of the power of geometric priors and
the overall philosophy of Geometric ML: convolution emerges from the first
principle of translational symmetry.
Note that unlike the situation on sets and graphs, the number of linearly
independent shift-equivariant functions (convolutions) grows with the size of
the domain (since we have one degree of freedom in each diagonal of a circu-
lant matrix). However, the scale separation prior guarantees filters can be local,
170 Chapter 6
resulting in the same Θ(1)-parameter complexity per layer, as we will verify in
Section 6.2.1 when discussing the use of these principles in the implementation
of Convolutional Neural Network architectures.
Derivation of the discrete Fourier transform In Chapter 4 we introduced
Fourier transforms, and we have already mentioned its connection to convolu-
tion: the fact that the Fourier transform diagonalises the convolution operation
is an important property used in signal processing to perform convolution in
the frequency domain as an element-wise product of the Fourier transforms.
Let us verify this important property in our setting of one-dimensional grids.
For this purpose, recall a fact from linear
3
algebra that (diagonalisable)
matrices are joinly diagonalisable iff they mutually commute. In other words,
there exists a common eigenbasis for all the circulant matrices, in which they
differ only by their eigenvalues. We can therefore pick one circulant matrix and
compute its eigenvectors—we are assured that these will be the eigenvectors
of all other circulant matrices as well. It is convenient to pick the shift operator
S, for which the eigenvectors happen to be the discrete Fourier basis
4
φ
k
=
1
N
1, e
2πik
N
, e
4πik
N
, . . . , e
2πi(N–1)k
N
, k = 0, 1, . . . , N 1,
which we can arrange into an N ×N Fourier matrix Φ = (φ
0
, . . . , φ
N–1
).
Indeed, we verify directly that Sφ
k
= e
2πi/N
φ
k
.
Multiplication by Φ
5
gives the Discrete Fourier Transform (DFT), and by
Φ the inverse DFT,
ˆ
x
k
=
1
N
N–1
X
u=0
x
u
e
2πiku
N
x
u
=
1
N
N–1
X
k=0
ˆ
x
k
e
+
2πiku
N
.
Since all circulant matrices are jointly diagonalisable,
6
they are also diago-
nalised by the Fourier transform and differ only in their eigenvalues. Since the
eigenvalues of the circulant matrix C(θ) are the Fourier transform of the filter
(see e.g. Bamieh 2018),
ˆ
θ = Φ
θ, we obtain the Convolution Theorem:
C(θ)x = Φ
ˆ
θ
0
.
.
.
ˆ
θ
n–1
Φ
x = Φ(
ˆ
θ
ˆ
x)
Because the Fourier matrix Φ has a special algebraic structure, the products
Φ
x and Φx can be computed with O(n log n) complexity using a Fast Fourier
Transform (FFT) algorithm. This is one of the reasons why frequency-domain
filtering is so popular in signal processing; furthermore, the filter is typically
designed directly in the frequency domain, so the Fourier transform
ˆ
θ is never
explicitly computed.
Grids 171
Besides the didactic value of the derivation of the Fourier transform and con-
volution we have done here, it provides a scheme to generalise these concepts
to graphs. Realising that the adjacency matrix of the ring graph is exactly the
shift operator, one can can develop the graph Fourier transform and an analogy
of the convolution operator by computing the eigenvectors of the adjacency
matrix (see e.g. Sandryhaila and Moura 2013).
Early attempts to develop graph neural networks by analogy to CNNs, some-
times termed ‘spectral GNNs’, exploited this exact blueprint.
7
We will see in
Sections 8 that this analogy has some important limitations. The first limita-
tion comes from the fact that a grid is fixed, and hence all signals on it can
be represented in the same Fourier basis. In contrast, on general graphs, the
Fourier basis depends on the structure of the graph. Hence, we cannot directly
compare Fourier transforms on two different graphs a problem that trans-
lated into a lack of generalisation in machine learning problems. Secondly,
multi-dimensional grids, which are constructed as tensor products of one-
dimensional grids, retain the underlying structure: the Fourier basis elements
and the corresponding frequencies (eigenvalues) can be organised in multiple
dimensions. In images, for example, we can naturally talk about horizontal and
vertical frequency and filters have a notion of direction. On graphs, the struc-
ture of the Fourier domain is one-dimensional, as we can only organise the
Fourier basis functions by the magnitude of the corresponding frequencies. As
a result, graph filters are oblivious of direction or isotropic.
Derivation of the continuous Fourier transform We can revisit the previ-
ous analysis in the continuous setting where the grid is replaced by the real
line. Like in Chapter 4, we consider functions defined on = R and the trans-
lation operator (S
v
f )(u) = f (u v) shifting f by some position v. Applying S
v
to
the Fourier basis functions φ
ξ
(u) = e
iξu
yields, by associativity of the exponent,
S
v
e
iξu
= e
iξ(uv)
= e
–iξv
e
iξu
,
i.e., φu
ξ
(u) is the complex eigenvector of S
v
with the complex eigenvalue e
–iξv
exactly mirroring the situation we had in the discrete setting. Since S
v
is a
unitary operator (i.e., S
v
x
p
= x
p
for any p and x L
p
(R)), any eigenvalue λ
must satisfy |λ| = 1, which corresponds precisely to the eigenvalues e
iξv
found
above. Moreover, the spectrum of the translation operator is simple, meaning
that two functions sharing the same eigenvalue must necessarily be collinear.
Indeed, suppose that S
v
f = e
–iξ
0
v
f for some ξ
0
. Taking the Fourier transform in
both sides, we obtain
ξ , e
–iξv
ˆ
f (ξ) = e
–iξ
0
v
ˆ
f (ξ) ,
which implies that
ˆ
f (ξ) = 0 for ξ ̸= ξ
0
, thus f = αφ
ξ
0
.
172 Chapter 6
For a general linear operator C that is translation equivariant (S
v
C = CS
v
),
we have
S
v
Ce
iξu
= CS
v
e
iξu
= e
–iξv
Ce
iξu
,
implying that Ce
iξu
is also an eigenfunction
8
of S
v
with eigenvalue e
–iξv
, from
where it follows from the simplicity of spectrum that Ce
iξu
= βφ
ξ
(u); in other
words, the Fourier basis is the eigenbasis of all translation equivariant opera-
tors. As a result, C is diagonal in the Fourier domain and can be expressed
as Ce
iξu
=
ˆ
p
C
(ξ)e
iξu
, where
ˆ
p
C
(ξ) is a transfer function acting on different
frequencies ξ. Finally, for an arbitrary function x(u), by linearity,
(Cx)(u) = C
Z
+
ˆ
x(ξ)e
iξu
dξ =
Z
+
ˆ
x(ξ)
ˆ
p
C
(ξ)e
iξu
dξ
=
Z
+
p
C
(v)x(u v)dv = (x p
C
)(u),
9
where p
C
(u) is the inverse Fourier transform of
ˆ
p
C
(ξ). It thus follows that every
linear translation equivariant operator is a convolution.
6.2 Model
Now that we have completely characterized the class of linear equivariant
maps, given by convolutions x 7→x h, and that we can furthermore impose
locality of these maps by simply constraining the support of the filter h, we
can directly deploy the GDL blueprint from Chapter 4, yielding the house-
hold CNNs (Convolutional Neural Networks) and RNNs (recurrent neural
networks) architectures.
6.2.1 Convolutional Neural Networks
Convolutional Neural Networks are perhaps the earliest and most well known
example of deep learning architectures following the blueprint of Geometric
Deep Learning. In Section 6.1 we have fully characterised the class of linear
and local translation equivariant operators, given by convolutions C(θ)x = x
θ with a localised filter θ
10
. Let us first focus on scalar-valued (‘single-channel’
or ‘grayscale’) discretised images, where the domain is the grid = [H] ×[W]
with u = (u
1
, u
2
) and x X(, R).
Any convolution with a compactly supported filter of size H
f
×W
f
can be
written as a linear combination of generators θ
1,1
, . . . , θ
H
f
,W
f
, given for exam-
ple by the unit peaks θ
vw
(u
1
, u
2
) = δ(u
1
v, u
2
w). Any local linear equivariant
map is thus expressible as
11
F(x) =
H
f
X
v=1
W
f
X
w=1
α
vw
C(θ
vw
)x , (6.75)
Grids 173
0 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 0 1 1 1 0
0 0 0 1 1 0 0
0 0 1 1 0 0 0
0 1 1 0 0 0 0
1 1 0 0 0 0 0
x
?
1 0 1
0 1 0
1 0 1
C(θ)
z }| {
θ
11
+ θ
13
+ θ
22
+ θ
31
+ θ
33
=
1 4 3 4 1
1 2 4 3 3
1 2 3 4 1
1 3 3 1 1
3 3 1 1 0
x ? θ
1 0 1
0 1 0
1 0 1
×1 ×0 ×1
×0 ×1 ×0
×1 ×0 ×1
Figure 6.2
The process of convolving an image x with a filter C(θ). The filter parameters θ can
be expressed as a linear combination of generators θ
vw
.
which, in coordinates, corresponds to the familiar 2D convolution (see Figure
6.2 for an overview):
F(x)
uv
=
H
f
X
a=1
W
f
X
b=1
α
ab
x
u+a,v+b
. (6.76)
Other choices of the basis θ
vw
are also possible and will yield equivalent
operations (for potentially different choices of α
vw
). A popular example are
directional derivatives: θ
vw
(u
1
, u
2
) = δ(u
1
, u
2
) δ(u
1
v, u
2
w), (v, w) ̸= (0, 0)
taken together with the local average θ
0
(u
1
, u
2
) =
1
H
f
W
f
. In fact, directional
derivatives can be considered a grid-specific analogue of diffusion processes
on graphs, which we recover if we assume each pixel to be a node connected
to its immediate neighbouring pixels in the grid.
When the scalar input channel is replaced by multiple channels (e.g., RGB
colours, or more generally an arbitrary number of feature maps), the con-
volutional filter becomes a convolutional tensor expressing arbitrary linear
combinations of input features into output feature maps. In coordinates, this
can be expressed as
F(x)
uvj
=
H
f
X
a=1
W
f
X
b=1
M
X
c=1
α
jabc
x
u+a,v+b,c
, j [N] , (6.77)
where M and N are respectively the number of input and output channels. This
basic operation encompasses a broad class of neural network architectures,
which, as we will show in the next section, have had a profound impact across
many areas of computer vision, signal processing, and beyond. Here, rather
174 Chapter 6
3 2 1 0 1 2 3
0
1
2
3
Figure 6.3
ReLU, often considered a ‘modern’ architectural choice, was already used in the
Neocognitron (Fukushima 1980). Rectification is equivalent to the principle of demodu-
lation, which is fundamental in electrical engineering as the basis for many transmission
protocols, such as FM radio; and also has a prominent role in models for neuronal activ-
ity.
than dissecting the myriad of possible architectural variants of CNNs, we pre-
fer to focus on some of the essential innovations that enabled their widespread
use.
Efficient multiscale computation As discussed in the GDL template for
general symmetries, extracting translation invariant features out of the con-
volutional operator F requires a non-linear step; e.g. the obiquituous ReLU
activation illustrated in Figure 6.3. Convolutional features are processed
through a non-linear activation function σ, acting element-wise on the input—
i.e., σ : X() X(), as σ(x)(u) = σ(x(u)). Perhaps the most popular example
at the time of writing is the Rectified Linear Unit (ReLU): σ(x) = max(x, 0).
This non-linearity effectively rectifies the signals, pushing their energy towards
lower frequencies, and enabling the computation of high-order interactions
across scales by iterating the construction.
Already in the early works of Fukushima 1980 and LeCun et al. 1998, CNNs
and similar architectures had a multiscale structure, where after each convolu-
tional layer (6.77) one performs a grid coarsening P : X() X(
), where
the grid
has coarser resolution than . This enables multiscale filters with
effectively increasing receptive field, yet retaining a constant number of param-
eters per scale. Several signal coarsening strategies P (referred to as pooling)
may be used, the most common are applying a low-pass anti-aliasing filter (e.g.
local average) followed by grid downsampling, or non-linear max-pooling.
In summary, a ‘vanilla’ CNN layer can be expressed as the composition of
the basic objects already introduced in our Geometric Deep Learning blueprint:
h = P(σ(F(x))) , (6.78)
Grids 175
i.e. an equivariant linear layer F, a coarsening operation P, and a non-linearity
σ. It is also possible to perform translation invariant global pooling opera-
tions within CNNs. Intuitively, this involves each pixel—which, after several
convolutions, summarises a patch centered around it—proposing the final rep-
resentation of the image
12
, with the ultimate choice being guided by a form
of aggregation of these proposals. A popular choice here is the average func-
tion, as its outputs will retain similar magnitudes irrespective of the image size
(Springenberg et al. 2014).
Prominent examples following this CNN blueprint (some of which we will
discuss next) are displayed in Figure 6.4.
Deep and Residual Networks A CNN architecture, in its simplest form,
is therefore specified by hyperparameters (H
f
k
, W
f
k
, N
k
, p
k
)
kK
, with M
k+1
= N
k
and p
k
= 0, 1 indicating whether grid coarsening is performed or not. While
all these hyperparameters are important in practice, a particularly important
question is to understand the role of depth K in CNN architectures, and what
are the fundamental tradeoffs involved in choosing such a key hyperparameter,
especially in relation to the filter sizes (H
f
k
, W
f
k
).
While a rigorous answer to this question is still elusive, mounting empirical
evidence collected throughout the recent years suggests a favourable tradeoff
towards deeper (large K) yet thinner (small (H
f
k
, W
f
k
)) models
13
. In this context,
a crucial insight by He et al. 2016 was to reparametrise each convolutional
layer to model a perturbation of the previous features, rather than a generic
non-linear transformation:
h = P
(
x + σ(F(x))
)
. (6.79)
The resulting residual networks provide several key advantages over the pre-
vious formulation. In essence, the residual parametrisation is consistent with
the view that the deep network is a discretisation of an underlying continu-
ous dynamical system, modelled as an ordinary differential equation (ODE)
14
.
Crucially, learning a dynamical system by modeling its velocity turns out to
be much easier than learning its position directly. In our learning setup, this
translates into an optimisation landscape with more favorable geometry, lead-
ing to the ability to train much deeper architectures than was possible before.
As will be discussed in future work, learning using deep neural networks
defines a non-convex optimisation problem, which can be efficiently solved
using gradient-descent methods under certain simplifying regimes. The key
advantage of the ResNet parametrisation has been rigorously analysed in sim-
ple scenarios (Hardt and Ma 2016), and remains an active area of theoretical
investigation. Finally, Neural ODEs (R. T. Chen et al. 2018) are a recent popu-
lar architecture that pushes the analogy with ODEs even further, by learning the
3
224
96
55
256
27
384
13
384
13
256
13
1
4096
1
4096
1000
SOFT
1
28
28
Input
16
28
3x3
Conv
16
28
16
28
+
16
28
16
28
16
28
+
16
28
16
28
16
28
+
16
28
32
28
32
28
1x1
32
28
BatchNorm
ReLU
+
32
28
128
23
23
Average pooling
1
10
SOFT
64 64
I
128 128
I/2
256 256
I/4
512 512
I/8
1024 1024
I/16
Bottleneck Conv
512 512 512 512
I/8
256 256 256 256
I/4
128 128 128 128
I/2
64 64 64 64
I
Softmax
Figure 6.4
Prominent examples of CNN architectures. Top-to-bottom: LeNet (LeCun et al. 1998),
AlexNet (Krizhevsky, Sutskever, and Hinton 2012), ResNet (He et al. 2016) and U-Net
(Ronneberger, Fischer, and Brox 2015). Drawn using the PlotNeuralNet package (Iqbal
2018).
Grids 177
parameters of the ODE
˙
x = σ(F(x)) directly and relying on standard numerical
integration.
Normalisation Another important algorithmic innovation that boosted the
empirical performance of CNNs significantly is the notion of normalisation. In
early models of neural activity, it was hypothesised that neurons perform some
form of local ‘gain control’, where the layer coefficients x
k
are replaced by
˜x
k
= σ
–1
k
(x
k
µ
k
). Here, µ
k
and σ
k
encode the first and second-order moment
information of x
k
, respectively. Further, they can be either computed globally
or locally.
In the context of Deep Learning, this principle was widely adopted through
the batch normalisation layer (Ioffe and Szegedy 2015)
15
, followed by several
variants (Ba, Kiros, and Hinton 2016; Salimans and Kingma 2016; Ulyanov,
Vedaldi, and Lempitsky 2016; Cooijmans et al. 2016; Wu and He 2018).
Despite some attempts to rigorously explain the benefits of normalisation in
terms of better conditioned optimisation landscapes (Santurkar et al. 2018), a
general theory that can provide guiding principles is still missing at the time of
writing.
Data augmentation While CNNs encode the geometric priors associated
with translation invariance and scale separation, they do not explicitly account
for other known transformations that preserve semantic information, e.g light-
ing or color changes, or small rotations and dilations. A pragmatic approach to
incorporate these priors with minimal architectural changes is to perform data
augmentation, where one manually performs said transformations to the input
images and adds them into the training set.
Data augmentation has been empirically successful and is widely used—not
only to train state-of-the-art vision architectures, but also to prop up several
developments in self-supervised and causal representation learning (T. Chen
et al. 2020; Grill et al. 2020; Mitrovic et al. 2020). However, it is provably
sub-optimal in terms of sample complexity (Mei, Misiakiewicz, and Monta-
nari 2021; Bietti, Venturi, and Bruna 2021); a more efficient strategy considers
instead architectures with richer invariance groups, as discussed in Chapter 7.
6.2.2 Recurrent Neural Networks
Our discussion has thus far always assumed the inputs to be solely spatial
across a given domain. However, in many common use cases, the inputs can
also be considered sequential (e.g. video, text or speech). In this case, we
assume that the input consists of arbitrarily many steps, wherein at each step t
we are provided with an input signal, which we represent as X
(t)
X(
(t)
).
16
178 Chapter 6
While in general the domain can evolve in time together with the signals
on it, it is typically assumed that the domain is kept fixed across all the t, i.e.
(t)
= . Here, we will exclusively focus on this case, but note that exceptions
are common. Social networks are an example where one often has to account
for the domain changing through time, as new links are regularly created as
well as erased. The domain in this setting is often referred to as a dynamic
graph (D. Xu et al. 2020; Rossi, Chamberlain, et al. 2020).
Often, the individual X
(t)
inputs will exhibit useful symmetries and hence
may be nontrivially treated by any of our previously discussed architectures.
Some common examples include: videos ( is a fixed grid, and signals are a
sequence of frames); fMRI scans ( is a fixed mesh representing the geometry
of the brain cortex, where different regions are activated at different times as a
response to presented stimuli); and traffic flow networks ( is a fixed graph rep-
resenting the road network, on which e.g. the average traffic speed is recorded
at various nodes).
Let us assume an encoder function f (X
(t)
) providing latent representations
at the level of granularity appropriate for the problem and respectful of the
symmetries of the input domain. As an example
17
, consider processing video
frames: that is, at each timestep, we are given a grid-structured input rep-
resented as an n ×d matrix X
(t)
, where n is the number of pixels (fixed in
time) and d is the number of input channels (e.g. d = 3 for RGB frames). Fur-
ther, we are interested in analysis at the level of entire frames, in which case
it is appropriate to implement f as a translation invariant CNN, outputting a
k-dimensional representation z
(t)
= f (X
(t)
) of the frame at time-step t.
We are now left with the task of appropriately summarising a sequence of
vectors z
(t)
across all the steps. A canonical way to dynamically aggregate
this information in a way that respects the temporal progression of inputs and
also easily allows for online arrival of novel data-points, is using a Recurrent
Neural Network (RNN).
18
What we will show here is that RNNs are an inter-
esting geometric architecture to study in their own right, since they implement
a rather unusual type of symmetry over the inputs z
(t)
.
SimpleRNNs At each step, the recurrent neural network computes an m-
dimensional summary vector h
(t)
of all the input steps up to and including t.
This (partial) summary is computed conditional on the current step’s features
and the previous step’s summary, through a shared update function, R : R
k
×
R
m
R
m
, as follows (see Figure 6.5 for a summary):
h
(t)
= R(z
(t)
, h
(t–1)
) (6.80)
and, as both z
(t)
and h
(t–1)
are flat vector representations, R may be most easily
expressed as a single fully-connected neural network layer (often known as
Grids 179
h
(0)
R R R R
. . .
z
(1)
z
(2)
z
(3)
z
(4)
h
(4)
h
(3)
h
(2)
h
(1)
f f f f
X
(1)
X
(2)
X
(3)
X
(4)
h
(1)
h
(2)
h
(3)
. . .
Figure 6.5
Illustration of processing video input with RNNs. Each input video frame X
(t)
is
processed using a shared function f —e.g. a translation invariant CNN—into a flat
representation z
(t)
. Then the RNN update function R is iterated across these vectors,
iteratively updating a summary vector h
(t)
which summarises all the inputs up to and
including z
(t)
. The computation is seeded with an initial summary vector h
(0)
, which
may be either pre-determined or learnable.
SimpleRNN
19
; see Elman (1990) and Jordan (1997)):
h
(t)
= σ(Wz
(t)
+ Uh
(t–1)
+ b) (6.81)
where W R
k×m
, U R
m×m
and b R
m
are learnable parameters, and σ is
an activation function. While this introduces loops in the network’s compu-
tational graph, in practice the network is unrolled for an appropriate number
of steps, allowing for backpropagation through time (Robinson and Fallside
1987; Werbos 1988; Mozer 1989) to be applied.
The summary vectors may then be appropriately leveraged for the down-
stream task—if a prediction is required at every step of the sequence, then a
180 Chapter 6
shared predictor may be applied to each h
(t)
individually. For classifying entire
sequences, typically the final summary, h
(T)
, is passed to a classifier. Here, T
is the length of the sequence.
Specially, the initial summary vector is usually either set to the zero-vector,
i.e. h
(0)
= 0, or it is made learnable. Analysing the manner in which the ini-
tial summary vector is set also allows us to deduce an interesting form of
translation equivariance exhibited by RNNs.
Translation equivariance in RNNs Since we interpret the individual steps
t as discrete time-steps, the input vectors z
(t)
can be seen as living on a
one-dimensional
20
grid of time-steps. While it might be attractive to attempt
extending our translation equivariance analysis from CNNs here, it cannot be
done in a trivial manner.
To see why, let us assume that we have produced a new sequence z
(t)
= z
(t+1)
by performing a left-shift of our sequence by one step. It might be tempting
to attempt showing h
(t)
= h
(t+1)
, as one expects with translation equivariance;
however, this will not generally hold. Consider t = 1; directly applying and
expanding the update function, we recover the following:
h
(1)
= R(z
(1)
, h
(0)
) = R(z
(2)
, h
(0)
) (6.82)
h
(2)
= R(z
(2)
, h
(1)
) = R(z
(2)
, R(z
(1)
, h
(0)
)) (6.83)
Hence, unless we can guarantee that h
(0)
= R(z
(1)
, h
(0)
), we will not recover
translation equivariance. Similar analysis can then be done for steps t > 1.
Fortunately, with a slight refactoring of how we represent z, and for a
suitable choice of R, it is possible to satisfy the equality above, and hence
demonstrate a setting in which RNNs are equivariant to shifts. Our prob-
lem was largely one of boundary conditions: the equality above includes z
(1)
,
which our left-shift operation destroyed. To abstract this problem away, we
will observe how an RNN processes an appropriately left-padded sequence,
¯
z
(t)
, defined as follows:
¯
z
(t)
=
(
0 t t
z
(tt
)
t > t
Such a sequence now allows for left-shifting
21
by up to t
steps without destroy-
ing any of the original input elements. Further, note we do not need to handle
right-shifting separately; indeed, equivariance to right shifts naturally follows
from the RNN equations.
We can now again analyse the operation of the RNN over a left-shifted
verson of
¯
z
(t)
, which we denote by
¯
z
(t)
=
¯
z
(t+1)
, as we did in Equations
Grids 181
6.82–6.83:
h
(1)
= R(
¯
z
(1)
, h
(0)
) = R(
¯
z
(2)
, h
(0)
)
h
(2)
= R(
¯
z
(2)
, h
(1)
) = R(
¯
z
(2)
, R(
¯
z
(1)
, h
(0)
)) = R(
¯
z
(2)
, R(0, h
(0)
))
where the substitution
¯
z
(1)
= 0 holds as long as t
1, i.e. as long as any
padding is applied
22
. Now, we can guarantee equivariance to left-shifting by
one step (h
(t)
= h
(t+1)
) as long as h
(0)
= R(0, h
(0)
).
Said differently, h
(0)
must be chosen to be a fixed point of a function γ(h) =
R(0, h). If the update function R is conveniently chosen, then not only can we
guarantee existence of such fixed points, but we can even directly obtain them
by iterating the application of R until convergence; e.g., as follows:
h
0
= 0 h
k+1
= γ(h
k
), (6.84)
where the index k refers to the iteration of R in our computation, as opposed
to the the index (t) denoting the time step of the RNN. If we choose R such
that γ is a contraction mapping
23
, such an iteration will indeed converge to
a unique fixed point. Accordingly, we can then iterate Equation (6.84) until
h
k+1
= h
k
, and we can set h
(0)
= h
k
. Note that this computation is equivalent to
left-padding the sequence with “sufficiently many” zero-vectors.
Depth in RNNs It is also easy to stack multiple RNNs—simply use the h
(t)
vectors as an input sequence for a second RNN. This kind of construction
is occasionally called a “deep RNN”, which is potentially misleading. Effec-
tively, due to the repeated application of the recurrent operation, even a single
RNN “layer” has depth equal to the number of input steps.
This often introduces uniquely challenging learning dynamics when opti-
mising RNNs, as each training example induces many gradient updates to
the shared parameters of the update network. Here we will focus on perhaps
the most prominent such issue—that of vanishing and exploding gradients
(Bengio, Simard, and Frasconi 1994)—which is especially problematic in
RNNs, given their depth and parameter sharing. Further, it has single-handedly
spurred some of the most influential research on RNNs. For a more detailed
overview, we refer the reader to Pascanu, Mikolov, and Bengio (2013), who
have studied the training dynamics of RNNs in great detail, and exposed these
challenges from a variety of perspectives: analytical, geometrical, and the lens
of dynamical systems.
To illustrate vanishing gradients, consider a SimpleRNN with a sigmoidal
activation function σ, whose derivative magnitude |σ
| is always between 0
and 1; see Figure 6.6. Multiplying many such values results in gradients that
182 Chapter 6
6 5 4 3 2 1 0 1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
Figure 6.6
Illustration of the logistic function, σ(x) =
1
1+exp(–x)
. Another popular choice is the hyper-
bolic tangent, σ(x) = tanh x. They are called sigmoidal due to the distinct S-shape of
their plots.
quickly tend to zero, implying that early steps in the input sequence may not
be able to have influence in updating the network parameters at all.
For example, consider the next-word prediction task (common in e.g. predic-
tive keyboards), and the input text “Petar is Serbian. He was born on . . . [long
paragraph] . . . Petar currently lives in . Here, predicting the next
word as “Serbia” may only be reasonably concluded by considering the very
start of the paragraph—but gradients have likely vanished by the time they
reach this input step, making learning from such examples very challenging.
Deep feedforward neural networks have also suffered from the vanishing
gradient problem, until the invention of the ReLU activation (which has gradi-
ents equal to exactly zero or one—thus fixing the vanishing gradient problem).
However, in RNNs, using ReLUs may easily lead to exploding gradients, as the
output space of the update function is now unbounded, and gradient descent
will update the cell once for every input step, quickly building up the scale of
the updates. Historically, the vanishing gradient phenomenon was recognised
early on as a significant obstacle in the use of recurrent networks. Coping with
this problem motivated the development of more sophisticated RNN layers,
which we describe next.
6.3 Case Study: Long Short-Term Memory
A key invention that significantly reduced the effects of vanishing gradients
in RNNs is that of gating mechanisms, which allow the network to selectively
overwrite information in a data-driven way. Prominent examples of these gated
RNNs include the Long Short-Term Memory (LSTM; Hochreiter and Schmid-
huber (1997)) and the Gated Recurrent Unit (GRU; Cho et al. (2014)). Here
we will primarily discuss the LSTM—specifically, the variant presented by
Graves (2013)—in order to illustrate the operations of such models. Concepts
from LSTMs easily carry over to other gated RNNs. Throughout this section,
Grids 183
W
c
, U
c
, b
c
W
i
, U
i
, b
i
W
f
, U
f
, b
f
W
o
, U
o
, b
o
z
(t)
h
(t1)
× +
tanh
×
h
(t)
×
M
e
c
(t)
c
(t1)
i
(t)
o
(t)
f
(t)
c
(t)
LSTM
Figure 6.7
The dataflow of the long short-term memory (LSTM), with its components and memory
cell (M) clearly highlighted. Based on the current input z
(t)
, previous summary h
(t–1)
and
previous cell state c
(t–1)
, the LSTM predicts the updated cell state c
(t)
and summary h
(t)
.
it will likely be useful to refer to Figure 6.7, which illustrates all of the LSTM
operations that we will discuss in text.
The LSTM augments the recurrent computation by introducing a memory
cell, which stores cell state vectors, c
(t)
R
m
, that are preserved between com-
putational steps. The LSTM computes summary vectors, h
(t)
, directly based
on c
(t)
, and c
(t)
is, in turn, computed using z
(t)
, h
(t–1)
and c
(t–1)
. Critically, the
cell is not completely overwritten based on z
(t)
and h
(t–1)
, which would expose
the network to the same issues as the SimpleRNN. Instead, a certain quantity
of the previous cell state may be retained—and the proportion by which this
occurs is explicitly learned from data.
Just like in SimpleRNN, we compute features by using a single fully-
connected neural network layer over the current input step and previous
summary:
24
e
c
(t)
= tanh(W
c
z
(t)
+ U
c
h
(t–1)
+ b
c
) (6.85)
But, as mentioned, we do not allow all of this vector to enter the cell—hence
why we call it the vector of candidate features, and denote it as
e
c
(t)
. Instead,
the LSTM directly learns gating vectors, which are real-valued vectors in the
range [0, 1], and decide how much of the signal should be allowed to enter,
exit, and overwrite the memory cell.
Three such gates are computed, all based on z
(t)
and h
(t–1)
: the input gate
i
(t)
, which computes the proportion of the candidate vector allowed to enter the
cell; the forget gate f
(t)
, which computes the proportion of the previous cell
184 Chapter 6
state to be retained, and the output gate o
(t)
, which computes the proportion
of the new cell state to be used for the final summary vector. Typically all of
these gates are also derived using a single fully connected layer, albeit with the
logistic sigmoid activation logistic(x) =
1
1+exp(–x)
, in order to guarantee that the
outputs are in the [0, 1] range
25
:
i
(t)
= logistic(W
i
z
(t)
+ U
i
h
(t–1)
+ b
i
) (6.86)
f
(t)
= logistic(W
f
z
(t)
+ U
f
h
(t–1)
+ b
f
) (6.87)
o
(t)
= logistic(W
o
z
(t)
+ U
o
h
(t–1)
+ b
o
) (6.88)
Finally, these gates are appropriately applied to decode the new cell state,
c
(t)
, which is then modulated by the output gate to produce the summary vector
h
(t)
, as follows:
c
(t)
= i
(t)
e
c
(t)
+ f
(t)
c
(t–1)
(6.89)
h
(t)
= o
(t)
tanh(c
(t)
) (6.90)
where is element-wise vector multiplication. Applied together, Equations
(6.85)–(6.90) completely specify the update rule for the LSTM, which now
takes into account the cell vector c
(t)
as well
26
:
(h
(t)
, c
(t)
) = R(z
(t)
, (h
(t–1)
, c
(t–1)
))
Note that, as the values of f
(t)
are derived from z
(t)
and h
(t–1)
—and therefore
directly learnable from data—the LSTM effectively learns how to appropri-
ately forget past experiences. Indeed, the values of f
(t)
directly appear in the
backpropagation update for all the LSTM parameters (W
, U
, b
), allowing
the network to explicitly control, in a data-driven way, the degree of vanishing
for the gradients across the time steps.
Besides tackling the vanishing gradient issue head-on, it turns out that
gated RNNs also unlock a very useful form of invariance to time-warping
transformations, which remains out of reach of SimpleRNNs.
Time warping invariance of gated RNNs We will start by illustrating, in a
continuous-time setting
27
, what does it mean to warp time, and what is required
of a recurrent model in order to achieve invariance to such transformations.
Our exposition will largely follow the work of Tallec and Ollivier (2018), that
initially described this phenomenon—and indeed, they were among the first to
actually study RNNs from the lens of invariances.
Let us assume a continuous time-domain signal z(t), on which we would like
to apply an RNN. To align the RNN’s discrete-time computation of summary
vectors h
(t)28
with an analogue in the continuous domain, h(t), we will observe
Grids 185
Figure 6.8
Time-warping operations can be simple, such as time rescaling; e.g. τ (t) = 0.7t (dis-
played here), which, in a discrete setting, would amount to new inputs being received
every 1.43 steps. However, it also admits a wide spectrum of variably-changing
sampling rates, e.g. sampling may freely accelerate or decelerate throughout the time
domain.
its linear Taylor expansion:
h(t + δ) h(t) + δ
dh(t)
dt
(6.91)
and, setting δ = 1, we recover a relationship between h(t) and h(t + 1), which is
exactly what the RNN update function R (Equation 6.80) computes. Namely,
the RNN update function satisfies the following differential equation:
dh(t)
dt
= h(t + 1) h(t) = R(z(t + 1), h(t)) h(t) (6.92)
We would like the RNN to be resilient to the way in which the signal is sam-
pled (e.g. by changing the time unit of measurement), in order to account for
any imperfections or irregularities therein. Formally, we denote a time warping
operation τ : R
+
R
+
, as any monotonically increasing differentiable mapping
between times. The notation τ is chosen because time warping represents an
automorphism of time. See Figure 6.8 for an illustration of warping.
Further, we state that a class of models is invariant to time warping if, for
any model of the class and any such τ, there exists another (possibly the same)
model from the class that processes the warped data in the same way as the
original model did in the non-warped case.
This is a potentially very useful property. If we have an RNN class capable
of modelling short-term dependencies well, and we can also show that this
class is invariant to time warping, then we know it is possible to train such a
model in a way that will usefully capture long-term dependencies as well (as
they would correspond to a time dilation warping of a signal with short-term
dependencies). As we will shortly see, it is no coincidence that gated RNN
models such as the LSTM were proposed to model long-range dependencies.
Achieving time warping invariance is tightly coupled with presence of gating
mechanisms, such as the input/forget/output gates of LSTMs.
186 Chapter 6
When time gets warped by τ, the signal observed by the RNN at time
t is z(τ(t)) and, to remain invariant to such warpings, it should predict an
equivalently-warped summary function h(τ(t)). Using Taylor expansion argu-
ments once more, we derive a form of Equation 6.92 for the warped time, that
the RNN update R should satisfy:
dh(τ(t))
dτ(t)
= R(z(τ (t + 1)), h(τ (t))) h(τ (t)) (6.93)
However, the above derivative is computed with respect to the warped time
τ(t), and hence does not take into account the original signal. To make our
model take into account the warping transformation explicitly, we need to dif-
ferentiate the warped summary function with respect to t. Applying the chain
rule, this yields the following differential equation:
dh(τ(t))
dt
=
dh(τ(t))
dτ(t)
dτ(t)
dt
=
dτ(t)
dt
R(z(τ(t + 1)), h(τ (t)))
dτ(t)
dt
h(τ(t)) (6.94)
and, for our (continuous-time) RNN to remain invariant to any time warping
τ(t), it needs to be able to explicitly represent the derivative
dτ(t)
dt
, which is not
assumed known upfront! We need to introduce a learnable function Γ which
approximates this derivative. For example, Γ could be a neural network taking
into account z(t + 1) and h(t) and predicting scalar outputs.
Now, remark that, from the point of view of a discrete RNN model under
time warping, its input z
(t)
will correspond to z(τ (t)), and its summary h
(t)
will correspond to h(τ (t)). To obtain the required relationship of h
(t)
to h
(t+1)
in order to remain invariant to time warping, we will use a one-step Taylor
expansion of h(τ(t)):
h(τ(t + δ)) h(τ (t)) + δ
dh(τ(t))
dt
and, once again, setting δ = 1 and substituting Equation 6.94, then discretising:
h
(t+1)
= h
(t)
+
dτ(t)
dt
R(z
(t+1)
, h
(t)
)
dτ(t)
dt
h
(t)
=
dτ(t)
dt
R(z
(t+1)
, h
(t)
) +
1
dτ(t)
dt
h
(t)
Finally, we swap
dτ(t)
dt
with the aforementioned learnable function, Γ. This
gives us the required form for our time warping-invariant RNN:
h
(t+1)
= Γ(z
(t+1)
, h
(t)
)R(z
(t+1)
, h
(t)
) + (1 Γ(z
(t+1)
, h
(t)
))h
(t)
(6.95)
We may quickly deduce that SimpleRNNs (Equation 6.81) are not time warp-
ing invariant, given that they do not feature the second term in Equation
Grids 187
6.95. Instead, they fully overwrite h
(t)
with R(z
(t+1)
, h
(t)
), which corresponds
to assuming no time warping at all;
dτ(t)
dt
= 1, i.e. τ (t) = t.
Further, our link between continuous-time RNNs and the discrete RNN
based on R rested on the accuracy of the Taylor approximation, which holds
only if the time-warping derivative is not too large, i.e.,
dτ(t)
dt
1. The intuitive
explanation of this is: if our time warping operation ever contracts time in a
way that makes time increments (t t + 1) large enough that intermediate data
changes are not sampled, the model can never hope to process time-warped
inputs in the same way as original ones—it simply would not have access to
the same information. Conversely, time dilations of any form (which, in dis-
crete terms, correspond to interspersing the input time-series with zeroes) are
perfectly allowed within our framework.
Combined with our requirement of monotonically increasing τ (
dτ(t)
dt
> 0),
we can bound the output space of Γ as 0 < Γ(z
(t+1)
, h
(t)
) < 1, which motivates
the use of the logistic sigmoid activation for Γ, e.g.:
Γ(z
(t+1)
, h
(t)
) = logistic(W
Γ
z
(t+1)
+ U
Γ
h
(t)
+ b
Γ
)
exactly matching the LSTM gating equations (e.g. Equation 6.86). The main
difference is that LSTMs compute gating vectors, whereas Equation 6.95
implies Γ should output a scalar. Vectorised gates (Hochreiter 1991) allow
to fit a different warping derivative in every dimension of h
(t)
, allowing for
reasoning over multiple time horizons simultaneously.
It is worth taking a pause here to summarise what we have done. By requir-
ing that our RNN class is invariant to (non-destructive) time warping, we have
derived the necessary form that it must have (Equation 6.95), and showed that
it exactly corresponds to the class of gated RNNs. The gates’ primary role
under this perspective is to accurately fit the derivative
dτ(t)
dt
of the warping
transformation.
The notion of class invariance is somewhat distinct from the invariances we
studied previously. Namely, once we train a gated RNN on a time-warped input
with τ
1
(t), in many cases we cannot zero-shot transfer it to a signal warped by
a different τ
2
(t). Rather, class invariance only guarantees that gated RNNs are
powerful enough to fit both of these signals in the same manner, but potentially
with vastly different model parameters. That being said, the realisation that
effective gating mechanisms are tightly related to fitting the warping derivative
can yield useful prescriptions for gated RNN optimisation, as we now briefly
demonstrate.
For example, we can often assume that the range of the dependencies we are
interested in tracking within our signal will be in the range [T
l
, T
h
] time-steps.
188 Chapter 6
By analysing the analytic solutions to Equation 6.94, it can be shown that
the characteristic forgetting time of h
(t)
by our gated RNN is proportional to
1
Γ(z
(t+1)
,h
(t)
)
. Hence, we would like our gating values to lie between
h
1
T
h
,
1
T
m
i
in
order to effectively remember information within the assumed range.
Further, if we assume that z
(t)
and h
(t)
are roughly zero-centered—which
is a common by-product of applying transformations such as layer normali-
sation (Ba, Kiros, and Hinton 2016)—we can assume that E[Γ(z
(t+1)
, h
(t)
)]
logistic(b
Γ
). Controlling the bias vector of the gating mechanism is hence a
very powerful way of controlling the effective gate value
29
.
Combining the two observations, we conclude that an appropriate range of
gating values can be obtained by initialising b
Γ
log(U(T
l
, T
h
) 1), where
U is the uniform real distribution. Such a recommendation was dubbed chrono
initialisation by Tallec and Ollivier (2018), and has been empirically shown to
improve the long-range dependency modelling of gated RNNs.
Exercises
1. We noted that a matrix is circulant if and only if it commutes with the shift operator
S. Let us prove this fundamental result for 1D grids.
(i) Write down the explicit elements of the right shift matrix S of N elements.
(ii) Let C be an arbitrary N ×N matrix. Write down the elements of the resulting matrices
SC and CS.
(iii) By requiring translation equivariance (SC = CS), deduce the relationship between c
ij
and c
i+1,j+1
.
(iv) Use this relationship to conclude that the elements c
ij
depend only on (i j) mod N,
proving that C must be a circulant.
2. Convolutional neural networks often use strided convolutions (or pooling) to
coarsen the grid. For example, a stride of s = 2 drops every second pixel of the out-
put grid. Show that an architecture with a stride of 2 is no longer equivariant to all
translations, and characterise which translations it is equivariant to.
3. Given that grids are a special case of graphs, we may attempt to express a CNN
using the language of graph neural networks (GNNs) from Chapter 5. Consider a
1D cyclical grid with n nodes as a ring graph, comprising nodes V = {0, 1, . . . , n
1} and neighbourhoods N
u
= {(u 1) mod n, u, (u + 1) mod n}, with node features
x
u
R
3
being typical RGB values.
(i) Assume we use a GCN (graph convolutional network) model over this graph:
h
u
= ϕ
x
u
,
X
v∈N
u
1
d
u
d
v
ψ(x
v
)
!
.
Grids 189
This GNN flavour has some of the closest ties to the spectral properties underly-
ing CNNs, and therefore it represents a valuable attempt. However, it sadly cannot
express all possible convolutions over images (even if we expand the neighbourhoods).
Characterise which image CNNs are representable using the GCN.
(ii) We can attempt fixing this by moving to a more expressive flavour of GNNs; recall
that the most expressive option are message passing GNNs:
h
u
= ϕ
x
u
,
M
v∈N
u
ψ(x
u
, x
v
)
!
,
for a permutation-invariant aggregator
L
. Provide a simple counter-example showing
that even this expressive GNN cannot reliably simulate all CNN filters over the grid.
(iii) What allows CNNs to bypass this restriction? Leverage this assumption to provide a
correction to the message-passing equation that can represent all possible CNNs over
the grid.
4. In this question we will study time rescalings, a special case of time warping trans-
formations. Time rescalings have the form τ(t) = αt, where α (0, 1]. It is easy to
verify that rescalings monotonically dilate time, satisfying the requirements for a
non-destructive time warping operation. We will start by studying continuous-space
RNNs, taking input signals z(t) to continuous summaries h(t).
(i) State the linear Taylor expansion of h(t) around t. Under which conditions is this
a good approximation? Use the Taylor expansion to relate h(t) to h(t + 1), using the
discrete RNN update, R(z
(t+1)
, h
(t)
) and the summary derivative
dh(t)
dt
.
(ii) Now introduce the proposed time rescaling, and derive the required condition for class
invariance to such operations.
(iii) Derive the update rule for a discrete RNN that is invariant to time rescaling.
(iv) Assume that a time-rescaling-invariant RNN has been pre-trained over data which has
been rescaled using τ
1
(t) = α
1
t. Is it possible to zero-shot transfer this RNN to make
predictions over a signal that has been rescaled by τ
2
(t) = α
2
t? If so, how?
Notes
1
Because of the periodic boundary conditions, it is a circular or cyclic convo-
lution. In signal processing, θ is often referred to as the “filter, and in CNNs,
its coefficients are learnable.
2
The left shift operator is given by S
. Obviously, shifting left and then right
(or vice versa) does not do anything, which means S is orthogonal: S
S =
SS
= I.
3
We must additionally assume distinct eigenvalues, otherwise there might be
multiple possible diagonalisations. This assumption is satisfied with our choice
of S.
190 Chapter 6
4
S is orthogonal but non-symmetric, hence, its eigenvectors are orthogonal
but the eigenvalues are complex (roots of unity).
5
Note that the eigenvectors are complex, so we need to take complex
conjugation when transposing Φ.
6
Since the Fourier transform is an orthogonal matrix (Φ
Φ = I), geomet-
rically it acts as a change of the system of coordinates that amounts to an
N-dimensional rotation. In this system of coordinates (“Fourier domain”), the
action of a circulant C matrix becomes element-wise product.
7
In graph signal processing, the eigenvectors of the graph Laplacian are
often used as an alternative of the adjacency matrix to construct the graph
Fourier transform, see Shuman et al. 2013. On grids, both matrices have joint
eigenvectors, but on graphs they results in somewhat different though related
constructions.
8
Eigenfunction is synonymous with ‘eigenvector’ and is used when referring
to eigenvectors of continuous operators.
9
The spectral characterisation of the translation group is a particular case of a
more general result in Functional Analysis, the Stone’s Theorem, which derives
an equivalent characterisation for any one-parameter unitary group.
10
Recall, C(θ) is a circulant matrix with parameters θ.
11
Note that we usually imagine x and θ
vw
as 2D matrices, but in this equa-
tion, both x and θ
vw
have their two coordinate dimensions flattened into
one—making x a vector, and C(θ
vw
) a matrix.
12
CNNs which only consist of the operations mentioned in this paragraph are
often dubbed “all-convolutional”. In contrast, many CNNs flatten the image
across the spatial axes and pass them to an MLP classifier, once sufficient
equivariant and coarsening layers have been applied. This loses translation
invariance.
13
Historically, ResNet models are predated by highway networks (Srivas-
tava, Greff, and Schmidhuber 2015), which allow for more general gating
mechanisms to control the residual information flow.
14
In this case, the ResNet is performing a Forward Euler discretisation of an
ODE:
˙
x = σ(F(x))
15
We note that normalising activations of neural networks has seen attention
even before the advent of batch normalisation. See, e.g., Lyu and Simoncelli
2008.
16
Whether the domain is considered static or dynamic concerns time scales:
e.g., a road network does change over time (as new roads are built and old
ones are demolished), but significantly slower compared to the flow of traffic.
Grids 191
Similarly, in social networks, changes in engagement (e.g. Twitter users re-
tweeting a tweet) happen at a much higher frequency than changes in the follow
graph.
17
We do not lose generality in our example; equivalent analysis can be done
e.g. for node-level outputs on a spatiotemporal graph; the only difference is in
the choice of encoder f (which will then be a permutation equivariant GNN).
18
Note that the z
(t)
vectors can be seen as points on a temporal grid: hence,
processing them with a CNN is also viable in some cases. Transformers are
also increasingly popular models for processing generic sequential inputs.
19
In spite of their name, SimpleRNNs are remarkably expressive. For exam-
ple, it was shown by Siegelmann and Sontag (1995) that such models are
Turing-complete, meaning that they can likely represent any computation we
may ever be able to execute on computers.
20
Note that this construction is extendable to grids in higher dimensions,
allowing us to, e.g., process signals living on images in a scanline fashion.
Such a construction powered a popular series of models, such as the PixelRNN
from Oord, Kalchbrenner, and Kavukcuoglu (2016).
21
Note that equivalent analyses will arise if we use a different padding vector
than 0.
22
In a very similar vein, we can derive equivariance to left-shifting by s steps
as long as t
s.
23
Contractions are functions γ : X X such that, under some norm · on
X, applying γ contracts the distances between points: for all x, y X, and
some q [0, 1), it holds that γ(x) γ(y)qx y. Iterating such a func-
tion then necessarily converges to a unique fixed point, as a direct consequence
of Banach’s Fixed Point Theorem (Banach 1922).
24
Note that we have set the activation function to tanh here; as LSTMs are
designed to ameliorate the vanishing gradient problem, it is now appropriate to
use a sigmoidal activation.
25
Note that the three gates are themselves vectors, i.e. i
(t)
, f
(t)
, o
(t)
[0, 1]
m
.
This allows them to control how much each of the m dimensions is allowed
through the gate.
26
This is still compatible with the RNN update blueprint from Equation
(6.80); simply consider the summary vector to be the concatenation of h
(t)
and c
(t)
; sometimes denoted by h
(t)
c
(t)
.
27
We focus on the continuous setting as it will be easier to reason about
manipulations of time there.
28
We will use h(t) to denote a continuous signal at time t, and h
(t)
to denote a
discrete signal at time-step t.
192 Chapter 6
29
This insight was already spotted by Gers and J. Schmidhuber (2000) and
Jozefowicz, Zaremba, and Sutskever (2015), who empirically recommended
initialising the forget-gate bias of LSTMs to a constant positive vector, such as
1.