Grids

•

Grids are a common, particular type of graphs, that come with additional

geometric priors.

•

These priors are captured by the translation group—enabling the use of

oriented features.

•

We describe how Fourier analysis provides the natural tool to fully char-

acterise translation-equivariant maps, and we present canonical instances

including convolutional and recurrent neural networks.

•

Going beyond pure translations, we demonstrate how studying equivari-

ance to time warping enforces the usage of gating in RNNs, leading to

the popular long short-term memory model.

The second type of objects we consider are grids. It is fair to say that the

impact of deep learning has been particularly dramatic in computer vision,

natural language processing, and speech recognition. These applications all

share a geometric common denominator: an underlying grid structure.

Grids are a particular case of graphs, but they come with speciﬁc structure

that sets them apart. On the one hand, grids come with a canonical arrangement

of nodes that make the permutation invariance from general graphs unneces-

sary. On the other hand, machine learning problems deﬁned over grids often

exhibit a strong geometric prior, encoded by the translation group: an object

identity does not depend on its absolute spatial location, or an aminoacid

sequence encodes similar structures independently of the location of their

associated genes in the genome.

The translation group provides a particularly simple instantiation of the

GDL blueprint, thanks to the harmonic analysis perspective brought by Fourier

analysis. Indeed, since the translation group is commutative, we can explicitly

characterize the class of linear equivariant maps (that is, convolutions) using

a shared eigendecomposition — precisely the Fourier transform. Moreover,

grids are discretizations of low-dimensional physical domains, where locality

is properly deﬁned.

168 Chapter 6

Figure 6.1

An illustration of the one-dimension grid, using boundary conditions.

In this chapter, we will ﬁrst derive the fundamental links between convo-

lutions and Fourier transforms. Then we will discuss the distinction between

the graph and the grid formalism, and ﬁnally move to describe representative

architectures for image and sequence applications.

6.1 Domain

A grid D of size N in s dimensions can be best viewed as the set Z

⊗s

, contain-

ing s-tuples of integers (j

, …, j

) modulo N. It will be convenient to view the

domain as having no boundary, and to interpret the grid is a discretization of an

underlying torus T

= R

. This domain is itself a group, with the operation

inherited by the additive modular structure of either Z

or R/Z: given u, v ∈D,

the point u + v is well-deﬁned in D. We denote this group as the translation

group.

Let us now investigate how the GDL blueprint instantiates in this setting.

For ease of exposition, we will focus on the one-dimensional s = 1 setting. Let

us recall that the key questions we need to address are: (i) characterize the class

of linear equivariants associated with the symmetry group, and (ii) implement

locality.

Circulant matrices and Convolutions We can view the one-dimensional

grid as a ring graph (illustrated in Figure 6.1), with nodes indexed by

0, 1, . . . , N – 1 modulo N, and the adjacency matrix with elements a

u,u+1 mod N

1 and zero otherwise.

As mentioned earlier, while it is tempting to reduce ourselves to the gen-

eral graph formulation seen in Chapter 5, there are two main differences that

set our present setting apart. First, each node u has identical connectivity, to

its neighbours u – 1 and u + 1, and thus structure-wise indistinguishable from

the others. In the language of group theory, we say that the grid is a homo-

geneous space: For any two points u and v in the domain, there is a group

element g ∈G that maps u to g.u = v. Next, since the nodes of the grid have

a ﬁxed ordering, we also have a ﬁxed ordering of the neighbours: we can

call u – 1 the ‘left neighbour’ and u + 1 the ‘right neighbour’. More generally,

the grid allows to deﬁne orientation, in a way that it is globally consistent

across all neighborhoods. If we use our previous approach for designing a

Grids 169

equivariant function F using a local aggregation function ϕ, we now have

f(x

) = ϕ(x

u–1

, x

u+1

) at every node of the grid: ϕ does not need to be

permutation invariant anymore. For a particular choice of a linear transforma-

tion ϕ(x

u–1

, x

u+1

) = θ

–1

u–1

+ θ

u+1

, we can write F(X) as a matrix

product,

F(X) =







–1













n–2

n–1







Note this very special multi-diagonal structure with one element repeated

along each diagonal, sometimes referred to as “weight sharing” in the machine

learning literature.

More generally, given a vector θ = (θ

, . . . , θ

n–1

), a circulant matrix C(θ) =

(θ

u–v mod n

) is obtained by appending circularly shifted versions of the vector θ.

Circulant matrices are synonymous with discrete convolutions,

(x ⋆ θ)

n–1

v=0

v mod n

(u–v) mod n

as one has C(θ)x = x ⋆ θ. A particular choice of θ = (0, 1, 0, . . . , 0)

⊤

yields a

special circulant matrix that shifts vectors to the right by one position. This

matrix is called the (right) shift or translation operator and denoted by S.

Circulant matrices can be characterised by their commutativity property: the

product of circulant matrices is commutative, i.e. C(θ)C(η) = C(η)C(θ) for

any θ and η. Since the shift is a circulant matrix, we get the familiar translation

or shift equivariance of the convolution operator,

SC(θ)x = C(θ)Sx.

Such commutativity property should not be surprising, since the underlying

symmetry group (the translation group) is Abelian. Moreover, the opposite

direction is true as well, i.e. a matrix is circulant iff it commutes with shift.

This, in turn, allows us to deﬁne convolution as a translation equivariant lin-

ear operation, and is a nice illustration of the power of geometric priors and

the overall philosophy of Geometric ML: convolution emerges from the ﬁrst

principle of translational symmetry.

Note that unlike the situation on sets and graphs, the number of linearly

independent shift-equivariant functions (convolutions) grows with the size of

the domain (since we have one degree of freedom in each diagonal of a circu-

lant matrix). However, the scale separation prior guarantees ﬁlters can be local,

170 Chapter 6

resulting in the same Θ(1)-parameter complexity per layer, as we will verify in

Section 6.2.1 when discussing the use of these principles in the implementation

of Convolutional Neural Network architectures.

Derivation of the discrete Fourier transform In Chapter 4 we introduced

Fourier transforms, and we have already mentioned its connection to convolu-

tion: the fact that the Fourier transform diagonalises the convolution operation

is an important property used in signal processing to perform convolution in

the frequency domain as an element-wise product of the Fourier transforms.

Let us verify this important property in our setting of one-dimensional grids.

For this purpose, recall a fact from linear

algebra that (diagonalisable)

matrices are joinly diagonalisable iff they mutually commute. In other words,

there exists a common eigenbasis for all the circulant matrices, in which they

differ only by their eigenvalues. We can therefore pick one circulant matrix and

compute its eigenvectors—we are assured that these will be the eigenvectors

of all other circulant matrices as well. It is convenient to pick the shift operator

S, for which the eigenvectors happen to be the discrete Fourier basis

√



1, e

2πik

, e

4πik

, . . . , e

2πi(N–1)k



⊤

, k = 0, 1, . . . , N – 1,

which we can arrange into an N ×N Fourier matrix Φ = (φ

, . . . , φ

N–1

Indeed, we verify directly that Sφ

= e

2πi/N

Multiplication by Φ

∗5

gives the Discrete Fourier Transform (DFT), and by

Φ the inverse DFT,

√

N–1

u=0

–

2πiku

√

N–1

k=0

2πiku

Since all circulant matrices are jointly diagonalisable,

they are also diago-

nalised by the Fourier transform and differ only in their eigenvalues. Since the

eigenvalues of the circulant matrix C(θ) are the Fourier transform of the ﬁlter

(see e.g. Bamieh 2018),

θ = Φ

∗

θ, we obtain the Convolution Theorem:

C(θ)x = Φ







n–1







∗

x = Φ(

θ ⊙

Because the Fourier matrix Φ has a special algebraic structure, the products

⋆

x and Φx can be computed with O(n log n) complexity using a Fast Fourier

Transform (FFT) algorithm. This is one of the reasons why frequency-domain

ﬁltering is so popular in signal processing; furthermore, the ﬁlter is typically

designed directly in the frequency domain, so the Fourier transform

θ is never

explicitly computed.

Grids 171

Besides the didactic value of the derivation of the Fourier transform and con-

volution we have done here, it provides a scheme to generalise these concepts

to graphs. Realising that the adjacency matrix of the ring graph is exactly the

shift operator, one can can develop the graph Fourier transform and an analogy

of the convolution operator by computing the eigenvectors of the adjacency

matrix (see e.g. Sandryhaila and Moura 2013).

Early attempts to develop graph neural networks by analogy to CNNs, some-

times termed ‘spectral GNNs’, exploited this exact blueprint.

We will see in

Sections 8 that this analogy has some important limitations. The ﬁrst limita-

tion comes from the fact that a grid is ﬁxed, and hence all signals on it can

be represented in the same Fourier basis. In contrast, on general graphs, the

Fourier basis depends on the structure of the graph. Hence, we cannot directly

compare Fourier transforms on two different graphs — a problem that trans-

lated into a lack of generalisation in machine learning problems. Secondly,

multi-dimensional grids, which are constructed as tensor products of one-

dimensional grids, retain the underlying structure: the Fourier basis elements

and the corresponding frequencies (eigenvalues) can be organised in multiple

dimensions. In images, for example, we can naturally talk about horizontal and

vertical frequency and ﬁlters have a notion of direction. On graphs, the struc-

ture of the Fourier domain is one-dimensional, as we can only organise the

Fourier basis functions by the magnitude of the corresponding frequencies. As

a result, graph ﬁlters are oblivious of direction or isotropic.

Derivation of the continuous Fourier transform We can revisit the previ-

ous analysis in the continuous setting where the grid is replaced by the real

line. Like in Chapter 4, we consider functions deﬁned on Ω = R and the trans-

lation operator (S

f )(u) = f (u – v) shifting f by some position v. Applying S

the Fourier basis functions φ

(u) = e

iξu

yields, by associativity of the exponent,

iξu

= e

iξ(u–v)

= e

–iξv

iξu

i.e., φu

(u) is the complex eigenvector of S

with the complex eigenvalue e

–iξv

– exactly mirroring the situation we had in the discrete setting. Since S

is a

unitary operator (i.e., ∥S

x∥

= ∥x∥

for any p and x ∈L

(R)), any eigenvalue λ

must satisfy |λ| = 1, which corresponds precisely to the eigenvalues e

–iξv

found

above. Moreover, the spectrum of the translation operator is simple, meaning

that two functions sharing the same eigenvalue must necessarily be collinear.

Indeed, suppose that S

f = e

–iξ

f for some ξ

. Taking the Fourier transform in

both sides, we obtain

∀ ξ , e

–iξv

f (ξ) = e

–iξ

f (ξ) ,

which implies that

f (ξ) = 0 for ξ ̸= ξ

, thus f = αφ

172 Chapter 6

For a general linear operator C that is translation equivariant (S

C = CS

we have

iξu

= CS

iξu

= e

–iξv

iξu

implying that Ce

iξu

is also an eigenfunction

of S

with eigenvalue e

–iξv

, from

where it follows from the simplicity of spectrum that Ce

iξu

= βφ

(u); in other

words, the Fourier basis is the eigenbasis of all translation equivariant opera-

tors. As a result, C is diagonal in the Fourier domain and can be expressed

as Ce

iξu

(ξ)e

iξu

, where

(ξ) is a transfer function acting on different

frequencies ξ. Finally, for an arbitrary function x(u), by linearity,

(Cx)(u) = C

+∞

–∞

x(ξ)e

iξu

dξ =

+∞

–∞

x(ξ)

(ξ)e

iξu

dξ

+∞

–∞

(v)x(u – v)dv = (x ⋆ p

)(u),

where p

(u) is the inverse Fourier transform of

(ξ). It thus follows that every

linear translation equivariant operator is a convolution.

6.2 Model

Now that we have completely characterized the class of linear equivariant

maps, given by convolutions x 7→x ⋆ h, and that we can furthermore impose

locality of these maps by simply constraining the support of the ﬁlter h, we

can directly deploy the GDL blueprint from Chapter 4, yielding the house-

hold CNNs (Convolutional Neural Networks) and RNNs (recurrent neural

networks) architectures.

6.2.1 Convolutional Neural Networks

Convolutional Neural Networks are perhaps the earliest and most well known

example of deep learning architectures following the blueprint of Geometric

Deep Learning. In Section 6.1 we have fully characterised the class of linear

and local translation equivariant operators, given by convolutions C(θ)x = x ⋆

θ with a localised ﬁlter θ

. Let us ﬁrst focus on scalar-valued (‘single-channel’

or ‘grayscale’) discretised images, where the domain is the grid Ω = [H] ×[W]

with u = (u

, u

) and x ∈X(Ω, R).

Any convolution with a compactly supported ﬁlter of size H

×W

can be

written as a linear combination of generators θ

1,1

, . . . , θ

, given for exam-

ple by the unit peaks θ

, u

) = δ(u

– v, u

– w). Any local linear equivariant

map is thus expressible as

F(x) =

v=1

w=1

C(θ

)x , (6.75)

Grids 173

0 1 1 1 0 0 0

0 0 1 1 1 0 0

0 0 0 1 1 1 0

0 0 0 1 1 0 0

0 0 1 1 0 0 0

0 1 1 0 0 0 0

1 1 0 0 0 0 0

1 0 1

0 1 0

1 0 1

C(θ)

z }| {

+ θ

1 4 3 4 1

1 2 4 3 3

1 2 3 4 1

1 3 3 1 1

3 3 1 1 0

x ? θ

1 0 1

0 1 0

1 0 1

×1 ×0 ×1

×0 ×1 ×0

×1 ×0 ×1

Figure 6.2

The process of convolving an image x with a ﬁlter C(θ). The ﬁlter parameters θ can

be expressed as a linear combination of generators θ

which, in coordinates, corresponds to the familiar 2D convolution (see Figure

6.2 for an overview):

F(x)

a=1

b=1

u+a,v+b

. (6.76)

Other choices of the basis θ

are also possible and will yield equivalent

operations (for potentially different choices of α

). A popular example are

directional derivatives: θ

, u

) = δ(u

, u

) – δ(u

– v, u

– w), (v, w) ̸= (0, 0)

taken together with the local average θ

, u

) =

. In fact, directional

derivatives can be considered a grid-speciﬁc analogue of diffusion processes

on graphs, which we recover if we assume each pixel to be a node connected

to its immediate neighbouring pixels in the grid.

When the scalar input channel is replaced by multiple channels (e.g., RGB

colours, or more generally an arbitrary number of feature maps), the con-

volutional ﬁlter becomes a convolutional tensor expressing arbitrary linear

combinations of input features into output feature maps. In coordinates, this

can be expressed as

F(x)

uvj

a=1

b=1

c=1

jabc

u+a,v+b,c

, j ∈[N] , (6.77)

where M and N are respectively the number of input and output channels. This

basic operation encompasses a broad class of neural network architectures,

which, as we will show in the next section, have had a profound impact across

many areas of computer vision, signal processing, and beyond. Here, rather

174 Chapter 6

−3 −2 −1 0 1 2 3

Figure 6.3

ReLU, often considered a ‘modern’ architectural choice, was already used in the

Neocognitron (Fukushima 1980). Rectiﬁcation is equivalent to the principle of demodu-

lation, which is fundamental in electrical engineering as the basis for many transmission

protocols, such as FM radio; and also has a prominent role in models for neuronal activ-

ity.

than dissecting the myriad of possible architectural variants of CNNs, we pre-

fer to focus on some of the essential innovations that enabled their widespread

use.

Efﬁcient multiscale computation As discussed in the GDL template for

general symmetries, extracting translation invariant features out of the con-

volutional operator F requires a non-linear step; e.g. the obiquituous ReLU

activation illustrated in Figure 6.3. Convolutional features are processed

through a non-linear activation function σ, acting element-wise on the input—

i.e., σ : X(Ω) →X(Ω), as σ(x)(u) = σ(x(u)). Perhaps the most popular example

at the time of writing is the Rectiﬁed Linear Unit (ReLU): σ(x) = max(x, 0).

This non-linearity effectively rectiﬁes the signals, pushing their energy towards

lower frequencies, and enabling the computation of high-order interactions

across scales by iterating the construction.

Already in the early works of Fukushima 1980 and LeCun et al. 1998, CNNs

and similar architectures had a multiscale structure, where after each convolu-

tional layer (6.77) one performs a grid coarsening P : X(Ω) →X(Ω

′

), where

the grid Ω

′

has coarser resolution than Ω. This enables multiscale ﬁlters with

effectively increasing receptive ﬁeld, yet retaining a constant number of param-

eters per scale. Several signal coarsening strategies P (referred to as pooling)

may be used, the most common are applying a low-pass anti-aliasing ﬁlter (e.g.

local average) followed by grid downsampling, or non-linear max-pooling.

In summary, a ‘vanilla’ CNN layer can be expressed as the composition of

the basic objects already introduced in our Geometric Deep Learning blueprint:

h = P(σ(F(x))) , (6.78)

Grids 175

i.e. an equivariant linear layer F, a coarsening operation P, and a non-linearity

σ. It is also possible to perform translation invariant global pooling opera-

tions within CNNs. Intuitively, this involves each pixel—which, after several

convolutions, summarises a patch centered around it—proposing the ﬁnal rep-

resentation of the image

, with the ultimate choice being guided by a form

of aggregation of these proposals. A popular choice here is the average func-

tion, as its outputs will retain similar magnitudes irrespective of the image size

(Springenberg et al. 2014).

Prominent examples following this CNN blueprint (some of which we will

discuss next) are displayed in Figure 6.4.

Deep and Residual Networks A CNN architecture, in its simplest form,

is therefore speciﬁed by hyperparameters (H

, W

, N

, p

)

k≤K

, with M

k+1

= N

and p

= 0, 1 indicating whether grid coarsening is performed or not. While

all these hyperparameters are important in practice, a particularly important

question is to understand the role of depth K in CNN architectures, and what

are the fundamental tradeoffs involved in choosing such a key hyperparameter,

especially in relation to the ﬁlter sizes (H

, W

While a rigorous answer to this question is still elusive, mounting empirical

evidence collected throughout the recent years suggests a favourable tradeoff

towards deeper (large K) yet thinner (small (H

, W

)) models

. In this context,

a crucial insight by He et al. 2016 was to reparametrise each convolutional

layer to model a perturbation of the previous features, rather than a generic

non-linear transformation:

h = P

(

x + σ(F(x))

)

. (6.79)

The resulting residual networks provide several key advantages over the pre-

vious formulation. In essence, the residual parametrisation is consistent with

the view that the deep network is a discretisation of an underlying continu-

ous dynamical system, modelled as an ordinary differential equation (ODE)

Crucially, learning a dynamical system by modeling its velocity turns out to

be much easier than learning its position directly. In our learning setup, this

translates into an optimisation landscape with more favorable geometry, lead-

ing to the ability to train much deeper architectures than was possible before.

As will be discussed in future work, learning using deep neural networks

deﬁnes a non-convex optimisation problem, which can be efﬁciently solved

using gradient-descent methods under certain simplifying regimes. The key

advantage of the ResNet parametrisation has been rigorously analysed in sim-

ple scenarios (Hardt and Ma 2016), and remains an active area of theoretical

investigation. Finally, Neural ODEs (R. T. Chen et al. 2018) are a recent popu-

lar architecture that pushes the analogy with ODEs even further, by learning the

120

SOFT

224

256

384

256

4096

1000

SOFT

Input

3x3

Conv

1x1

BatchNorm

ReLU

128

Average pooling

SOFT

64 64

128 128

I/2

256 256

I/4

512 512

I/8

1024 1024

I/16

Bottleneck Conv

512 512 512 512

I/8

256 256 256 256

I/4

128 128 128 128

I/2

64 64 64 64

Softmax

Figure 6.4

Prominent examples of CNN architectures. Top-to-bottom: LeNet (LeCun et al. 1998),

AlexNet (Krizhevsky, Sutskever, and Hinton 2012), ResNet (He et al. 2016) and U-Net

(Ronneberger, Fischer, and Brox 2015). Drawn using the PlotNeuralNet package (Iqbal

2018).

Grids 177

parameters of the ODE

x = σ(F(x)) directly and relying on standard numerical

integration.

Normalisation Another important algorithmic innovation that boosted the

empirical performance of CNNs signiﬁcantly is the notion of normalisation. In

early models of neural activity, it was hypothesised that neurons perform some

form of local ‘gain control’, where the layer coefﬁcients x

are replaced by

˜x

= σ

–1

⊙(x

– µ

). Here, µ

and σ

encode the ﬁrst and second-order moment

information of x

, respectively. Further, they can be either computed globally

or locally.

In the context of Deep Learning, this principle was widely adopted through

the batch normalisation layer (Ioffe and Szegedy 2015)

, followed by several

variants (Ba, Kiros, and Hinton 2016; Salimans and Kingma 2016; Ulyanov,

Vedaldi, and Lempitsky 2016; Cooijmans et al. 2016; Wu and He 2018).

Despite some attempts to rigorously explain the beneﬁts of normalisation in

terms of better conditioned optimisation landscapes (Santurkar et al. 2018), a

general theory that can provide guiding principles is still missing at the time of

writing.

Data augmentation While CNNs encode the geometric priors associated

with translation invariance and scale separation, they do not explicitly account

for other known transformations that preserve semantic information, e.g light-

ing or color changes, or small rotations and dilations. A pragmatic approach to

incorporate these priors with minimal architectural changes is to perform data

augmentation, where one manually performs said transformations to the input

images and adds them into the training set.

Data augmentation has been empirically successful and is widely used—not

only to train state-of-the-art vision architectures, but also to prop up several

developments in self-supervised and causal representation learning (T. Chen

et al. 2020; Grill et al. 2020; Mitrovic et al. 2020). However, it is provably

sub-optimal in terms of sample complexity (Mei, Misiakiewicz, and Monta-

nari 2021; Bietti, Venturi, and Bruna 2021); a more efﬁcient strategy considers

instead architectures with richer invariance groups, as discussed in Chapter 7.

6.2.2 Recurrent Neural Networks

Our discussion has thus far always assumed the inputs to be solely spatial

across a given domain. However, in many common use cases, the inputs can

also be considered sequential (e.g. video, text or speech). In this case, we

assume that the input consists of arbitrarily many steps, wherein at each step t

we are provided with an input signal, which we represent as X

(t)

∈X(Ω

(t)

178 Chapter 6

While in general the domain can evolve in time together with the signals

on it, it is typically assumed that the domain is kept ﬁxed across all the t, i.e.

Ω

(t)

= Ω. Here, we will exclusively focus on this case, but note that exceptions

are common. Social networks are an example where one often has to account

for the domain changing through time, as new links are regularly created as

well as erased. The domain in this setting is often referred to as a dynamic

graph (D. Xu et al. 2020; Rossi, Chamberlain, et al. 2020).

Often, the individual X

(t)

inputs will exhibit useful symmetries and hence

may be nontrivially treated by any of our previously discussed architectures.

Some common examples include: videos (Ω is a ﬁxed grid, and signals are a

sequence of frames); fMRI scans (Ω is a ﬁxed mesh representing the geometry

of the brain cortex, where different regions are activated at different times as a

response to presented stimuli); and trafﬁc ﬂow networks (Ω is a ﬁxed graph rep-

resenting the road network, on which e.g. the average trafﬁc speed is recorded

at various nodes).

Let us assume an encoder function f (X

(t)

) providing latent representations

at the level of granularity appropriate for the problem and respectful of the

symmetries of the input domain. As an example

, consider processing video

frames: that is, at each timestep, we are given a grid-structured input rep-

resented as an n ×d matrix X

(t)

, where n is the number of pixels (ﬁxed in

time) and d is the number of input channels (e.g. d = 3 for RGB frames). Fur-

ther, we are interested in analysis at the level of entire frames, in which case

it is appropriate to implement f as a translation invariant CNN, outputting a

k-dimensional representation z

(t)

= f (X

(t)

) of the frame at time-step t.

We are now left with the task of appropriately summarising a sequence of

vectors z

(t)

across all the steps. A canonical way to dynamically aggregate

this information in a way that respects the temporal progression of inputs and

also easily allows for online arrival of novel data-points, is using a Recurrent

Neural Network (RNN).

What we will show here is that RNNs are an inter-

esting geometric architecture to study in their own right, since they implement

a rather unusual type of symmetry over the inputs z

(t)

SimpleRNNs At each step, the recurrent neural network computes an m-

dimensional summary vector h

(t)

of all the input steps up to and including t.

This (partial) summary is computed conditional on the current step’s features

and the previous step’s summary, through a shared update function, R : R

→R

, as follows (see Figure 6.5 for a summary):

(t)

= R(z

(t)

, h

(t–1)

) (6.80)

and, as both z

(t)

and h

(t–1)

are ﬂat vector representations, R may be most easily

expressed as a single fully-connected neural network layer (often known as

Grids 179

(0)

R R R R

. . .

(1)

(2)

(3)

(4)

(3)

(2)

(1)

f f f f

(1)

(2)

(3)

(4)

(1)

(2)

(3)

. . .

Figure 6.5

Illustration of processing video input with RNNs. Each input video frame X

(t)

processed using a shared function f —e.g. a translation invariant CNN—into a ﬂat

representation z

(t)

. Then the RNN update function R is iterated across these vectors,

iteratively updating a summary vector h

(t)

which summarises all the inputs up to and

including z

(t)

. The computation is seeded with an initial summary vector h

(0)

, which

may be either pre-determined or learnable.

SimpleRNN

; see Elman (1990) and Jordan (1997)):

(t)

= σ(Wz

(t)

+ Uh

(t–1)

+ b) (6.81)

where W ∈R

k×m

, U ∈R

m×m

and b ∈R

are learnable parameters, and σ is

an activation function. While this introduces loops in the network’s compu-

tational graph, in practice the network is unrolled for an appropriate number

of steps, allowing for backpropagation through time (Robinson and Fallside

1987; Werbos 1988; Mozer 1989) to be applied.

The summary vectors may then be appropriately leveraged for the down-

stream task—if a prediction is required at every step of the sequence, then a

180 Chapter 6

shared predictor may be applied to each h

(t)

individually. For classifying entire

sequences, typically the ﬁnal summary, h

(T)

, is passed to a classiﬁer. Here, T

is the length of the sequence.

Specially, the initial summary vector is usually either set to the zero-vector,

i.e. h

(0)

= 0, or it is made learnable. Analysing the manner in which the ini-

tial summary vector is set also allows us to deduce an interesting form of

translation equivariance exhibited by RNNs.

Translation equivariance in RNNs Since we interpret the individual steps

t as discrete time-steps, the input vectors z

(t)

can be seen as living on a

one-dimensional

grid of time-steps. While it might be attractive to attempt

extending our translation equivariance analysis from CNNs here, it cannot be

done in a trivial manner.

To see why, let us assume that we have produced a new sequence z

′(t)

= z

(t+1)

by performing a left-shift of our sequence by one step. It might be tempting

to attempt showing h

′(t)

= h

(t+1)

, as one expects with translation equivariance;

however, this will not generally hold. Consider t = 1; directly applying and

expanding the update function, we recover the following:

′(1)

= R(z

′(1)

, h

(0)

) = R(z

(2)

, h

(0)

) (6.82)

(2)

= R(z

(2)

, h

(1)

) = R(z

(2)

, R(z

(1)

, h

(0)

)) (6.83)

Hence, unless we can guarantee that h

(0)

= R(z

(1)

, h

(0)

), we will not recover

translation equivariance. Similar analysis can then be done for steps t > 1.

Fortunately, with a slight refactoring of how we represent z, and for a

suitable choice of R, it is possible to satisfy the equality above, and hence

demonstrate a setting in which RNNs are equivariant to shifts. Our prob-

lem was largely one of boundary conditions: the equality above includes z

(1)

which our left-shift operation destroyed. To abstract this problem away, we

will observe how an RNN processes an appropriately left-padded sequence,

(t)

, deﬁned as follows:

(t)

(

0 t ≤t

′

(t–t

′

)

t > t

′

Such a sequence now allows for left-shifting

by up to t

′

steps without destroy-

ing any of the original input elements. Further, note we do not need to handle

right-shifting separately; indeed, equivariance to right shifts naturally follows

from the RNN equations.

We can now again analyse the operation of the RNN over a left-shifted

verson of

(t)

, which we denote by

′(t)

(t+1)

, as we did in Equations

Grids 181

6.82–6.83:

′(1)

= R(

′(1)

, h

(0)

) = R(

(2)

, h

(0)

)

(2)

= R(

(2)

, h

(1)

) = R(

(2)

, R(

(1)

, h

(0)

)) = R(

(2)

, R(0, h

(0)

))

where the substitution

(1)

= 0 holds as long as t

′

≥1, i.e. as long as any

padding is applied

. Now, we can guarantee equivariance to left-shifting by

one step (h

′(t)

= h

(t+1)

) as long as h

(0)

= R(0, h

(0)

Said differently, h

(0)

must be chosen to be a ﬁxed point of a function γ(h) =

R(0, h). If the update function R is conveniently chosen, then not only can we

guarantee existence of such ﬁxed points, but we can even directly obtain them

by iterating the application of R until convergence; e.g., as follows:

= 0 h

k+1

= γ(h

), (6.84)

where the index k refers to the iteration of R in our computation, as opposed

to the the index (t) denoting the time step of the RNN. If we choose R such

that γ is a contraction mapping

, such an iteration will indeed converge to

a unique ﬁxed point. Accordingly, we can then iterate Equation (6.84) until

k+1

= h

, and we can set h

(0)

= h

. Note that this computation is equivalent to

left-padding the sequence with “sufﬁciently many” zero-vectors.

Depth in RNNs It is also easy to stack multiple RNNs—simply use the h

(t)

vectors as an input sequence for a second RNN. This kind of construction

is occasionally called a “deep RNN”, which is potentially misleading. Effec-

tively, due to the repeated application of the recurrent operation, even a single

RNN “layer” has depth equal to the number of input steps.

This often introduces uniquely challenging learning dynamics when opti-

mising RNNs, as each training example induces many gradient updates to

the shared parameters of the update network. Here we will focus on perhaps

the most prominent such issue—that of vanishing and exploding gradients

(Bengio, Simard, and Frasconi 1994)—which is especially problematic in

RNNs, given their depth and parameter sharing. Further, it has single-handedly

spurred some of the most inﬂuential research on RNNs. For a more detailed

overview, we refer the reader to Pascanu, Mikolov, and Bengio (2013), who

have studied the training dynamics of RNNs in great detail, and exposed these

challenges from a variety of perspectives: analytical, geometrical, and the lens

of dynamical systems.

To illustrate vanishing gradients, consider a SimpleRNN with a sigmoidal

activation function σ, whose derivative magnitude |σ

′

| is always between 0

and 1; see Figure 6.6. Multiplying many such values results in gradients that

182 Chapter 6

−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6

0.2

0.4

0.6

0.8

Figure 6.6

Illustration of the logistic function, σ(x) =

1+exp(–x)

. Another popular choice is the hyper-

bolic tangent, σ(x) = tanh x. They are called sigmoidal due to the distinct S-shape of

their plots.

quickly tend to zero, implying that early steps in the input sequence may not

be able to have inﬂuence in updating the network parameters at all.

For example, consider the next-word prediction task (common in e.g. predic-

tive keyboards), and the input text “Petar is Serbian. He was born on . . . [long

paragraph] . . . Petar currently lives in ”. Here, predicting the next

word as “Serbia” may only be reasonably concluded by considering the very

start of the paragraph—but gradients have likely vanished by the time they

reach this input step, making learning from such examples very challenging.

Deep feedforward neural networks have also suffered from the vanishing

gradient problem, until the invention of the ReLU activation (which has gradi-

ents equal to exactly zero or one—thus ﬁxing the vanishing gradient problem).

However, in RNNs, using ReLUs may easily lead to exploding gradients, as the

output space of the update function is now unbounded, and gradient descent

will update the cell once for every input step, quickly building up the scale of

the updates. Historically, the vanishing gradient phenomenon was recognised

early on as a signiﬁcant obstacle in the use of recurrent networks. Coping with

this problem motivated the development of more sophisticated RNN layers,

which we describe next.

6.3 Case Study: Long Short-Term Memory

A key invention that signiﬁcantly reduced the effects of vanishing gradients

in RNNs is that of gating mechanisms, which allow the network to selectively

overwrite information in a data-driven way. Prominent examples of these gated

RNNs include the Long Short-Term Memory (LSTM; Hochreiter and Schmid-

huber (1997)) and the Gated Recurrent Unit (GRU; Cho et al. (2014)). Here

we will primarily discuss the LSTM—speciﬁcally, the variant presented by

Graves (2013)—in order to illustrate the operations of such models. Concepts

from LSTMs easily carry over to other gated RNNs. Throughout this section,

Grids 183

, U

, b

, U

, b

, U

, b

, U

, b

(t)

(t−1)

× +

tanh

(t)

(t−1)

(t)

LSTM

Figure 6.7

The dataﬂow of the long short-term memory (LSTM), with its components and memory

cell (M) clearly highlighted. Based on the current input z

(t)

, previous summary h

(t–1)

and

previous cell state c

(t–1)

, the LSTM predicts the updated cell state c

(t)

and summary h

(t)

it will likely be useful to refer to Figure 6.7, which illustrates all of the LSTM

operations that we will discuss in text.

The LSTM augments the recurrent computation by introducing a memory

cell, which stores cell state vectors, c

(t)

∈R

, that are preserved between com-

putational steps. The LSTM computes summary vectors, h

(t)

, directly based

on c

(t)

, and c

(t)

is, in turn, computed using z

(t)

, h

(t–1)

and c

(t–1)

. Critically, the

cell is not completely overwritten based on z

(t)

and h

(t–1)

, which would expose

the network to the same issues as the SimpleRNN. Instead, a certain quantity

of the previous cell state may be retained—and the proportion by which this

occurs is explicitly learned from data.

Just like in SimpleRNN, we compute features by using a single fully-

connected neural network layer over the current input step and previous

summary:

(t)

= tanh(W

(t)

+ U

(t–1)

+ b

) (6.85)

But, as mentioned, we do not allow all of this vector to enter the cell—hence

why we call it the vector of candidate features, and denote it as

(t)

. Instead,

the LSTM directly learns gating vectors, which are real-valued vectors in the

range [0, 1], and decide how much of the signal should be allowed to enter,

exit, and overwrite the memory cell.

Three such gates are computed, all based on z

(t)

and h

(t–1)

: the input gate

(t)

, which computes the proportion of the candidate vector allowed to enter the

cell; the forget gate f

(t)

, which computes the proportion of the previous cell

184 Chapter 6

state to be retained, and the output gate o

(t)

, which computes the proportion

of the new cell state to be used for the ﬁnal summary vector. Typically all of

these gates are also derived using a single fully connected layer, albeit with the

logistic sigmoid activation logistic(x) =

1+exp(–x)

, in order to guarantee that the

outputs are in the [0, 1] range

(t)

= logistic(W

(t)

+ U

(t–1)

+ b

) (6.86)

(t)

= logistic(W

(t)

+ U

(t–1)

+ b

) (6.87)

(t)

= logistic(W

(t)

+ U

(t–1)

+ b

) (6.88)

Finally, these gates are appropriately applied to decode the new cell state,

(t)

, which is then modulated by the output gate to produce the summary vector

(t)

, as follows:

(t)

= i

(t)

⊙

(t)

+ f

(t)

⊙c

(t–1)

(6.89)

(t)

= o

(t)

⊙tanh(c

(t)

) (6.90)

where ⊙ is element-wise vector multiplication. Applied together, Equations

(6.85)–(6.90) completely specify the update rule for the LSTM, which now

takes into account the cell vector c

(t)

as well

(t)

, c

(t)

) = R(z

(t)

, (h

(t–1)

, c

(t–1)

))

Note that, as the values of f

(t)

are derived from z

(t)

and h

(t–1)

—and therefore

directly learnable from data—the LSTM effectively learns how to appropri-

ately forget past experiences. Indeed, the values of f

(t)

directly appear in the

backpropagation update for all the LSTM parameters (W

∗

, U

∗

, b

∗

), allowing

the network to explicitly control, in a data-driven way, the degree of vanishing

for the gradients across the time steps.

Besides tackling the vanishing gradient issue head-on, it turns out that

gated RNNs also unlock a very useful form of invariance to time-warping

transformations, which remains out of reach of SimpleRNNs.

Time warping invariance of gated RNNs We will start by illustrating, in a

continuous-time setting

, what does it mean to warp time, and what is required

of a recurrent model in order to achieve invariance to such transformations.

Our exposition will largely follow the work of Tallec and Ollivier (2018), that

initially described this phenomenon—and indeed, they were among the ﬁrst to

actually study RNNs from the lens of invariances.

Let us assume a continuous time-domain signal z(t), on which we would like

to apply an RNN. To align the RNN’s discrete-time computation of summary

vectors h

(t)28

with an analogue in the continuous domain, h(t), we will observe

Grids 185

Figure 6.8

Time-warping operations can be simple, such as time rescaling; e.g. τ (t) = 0.7t (dis-

played here), which, in a discrete setting, would amount to new inputs being received

every ∼1.43 steps. However, it also admits a wide spectrum of variably-changing

sampling rates, e.g. sampling may freely accelerate or decelerate throughout the time

domain.

its linear Taylor expansion:

h(t + δ) ≈h(t) + δ

dh(t)

(6.91)

and, setting δ = 1, we recover a relationship between h(t) and h(t + 1), which is

exactly what the RNN update function R (Equation 6.80) computes. Namely,

the RNN update function satisﬁes the following differential equation:

dh(t)

= h(t + 1) – h(t) = R(z(t + 1), h(t)) – h(t) (6.92)

We would like the RNN to be resilient to the way in which the signal is sam-

pled (e.g. by changing the time unit of measurement), in order to account for

any imperfections or irregularities therein. Formally, we denote a time warping

operation τ : R

→R

, as any monotonically increasing differentiable mapping

between times. The notation τ is chosen because time warping represents an

automorphism of time. See Figure 6.8 for an illustration of warping.

Further, we state that a class of models is invariant to time warping if, for

any model of the class and any such τ, there exists another (possibly the same)

model from the class that processes the warped data in the same way as the

original model did in the non-warped case.

This is a potentially very useful property. If we have an RNN class capable

of modelling short-term dependencies well, and we can also show that this

class is invariant to time warping, then we know it is possible to train such a

model in a way that will usefully capture long-term dependencies as well (as

they would correspond to a time dilation warping of a signal with short-term

dependencies). As we will shortly see, it is no coincidence that gated RNN

models such as the LSTM were proposed to model long-range dependencies.

Achieving time warping invariance is tightly coupled with presence of gating

mechanisms, such as the input/forget/output gates of LSTMs.

186 Chapter 6

When time gets warped by τ, the signal observed by the RNN at time

t is z(τ(t)) and, to remain invariant to such warpings, it should predict an

equivalently-warped summary function h(τ(t)). Using Taylor expansion argu-

ments once more, we derive a form of Equation 6.92 for the warped time, that

the RNN update R should satisfy:

dh(τ(t))

dτ(t)

= R(z(τ (t + 1)), h(τ (t))) – h(τ (t)) (6.93)

However, the above derivative is computed with respect to the warped time

τ(t), and hence does not take into account the original signal. To make our

model take into account the warping transformation explicitly, we need to dif-

ferentiate the warped summary function with respect to t. Applying the chain

rule, this yields the following differential equation:

dh(τ(t))

dτ(t)

R(z(τ(t + 1)), h(τ (t))) –

dτ(t)

h(τ(t)) (6.94)

and, for our (continuous-time) RNN to remain invariant to any time warping

τ(t), it needs to be able to explicitly represent the derivative

dτ(t)

, which is not

assumed known upfront! We need to introduce a learnable function Γ which

approximates this derivative. For example, Γ could be a neural network taking

into account z(t + 1) and h(t) and predicting scalar outputs.

Now, remark that, from the point of view of a discrete RNN model under

time warping, its input z

(t)

will correspond to z(τ (t)), and its summary h

(t)

will correspond to h(τ (t)). To obtain the required relationship of h

(t)

to h

(t+1)

in order to remain invariant to time warping, we will use a one-step Taylor

expansion of h(τ(t)):

h(τ(t + δ)) ≈h(τ (t)) + δ

dh(τ(t))

and, once again, setting δ = 1 and substituting Equation 6.94, then discretising:

(t+1)

= h

(t)

dτ(t)

R(z

(t+1)

, h

(t)

) –

dτ(t)

(t)

dτ(t)

R(z

(t+1)

, h

(t)

) +



1 –

dτ(t)



(t)

Finally, we swap

dτ(t)

with the aforementioned learnable function, Γ. This

gives us the required form for our time warping-invariant RNN:

(t+1)

= Γ(z

(t+1)

, h

(t)

)R(z

(t+1)

, h

(t)

) + (1 – Γ(z

(t+1)

, h

(t)

))h

(t)

(6.95)

We may quickly deduce that SimpleRNNs (Equation 6.81) are not time warp-

ing invariant, given that they do not feature the second term in Equation

Grids 187

6.95. Instead, they fully overwrite h

(t)

with R(z

(t+1)

, h

(t)

), which corresponds

to assuming no time warping at all;

dτ(t)

= 1, i.e. τ (t) = t.

Further, our link between continuous-time RNNs and the discrete RNN

based on R rested on the accuracy of the Taylor approximation, which holds

only if the time-warping derivative is not too large, i.e.,

dτ(t)

≲ 1. The intuitive

explanation of this is: if our time warping operation ever contracts time in a

way that makes time increments (t →t + 1) large enough that intermediate data

changes are not sampled, the model can never hope to process time-warped

inputs in the same way as original ones—it simply would not have access to

the same information. Conversely, time dilations of any form (which, in dis-

crete terms, correspond to interspersing the input time-series with zeroes) are

perfectly allowed within our framework.

Combined with our requirement of monotonically increasing τ (

dτ(t)

> 0),

we can bound the output space of Γ as 0 < Γ(z

(t+1)

, h

(t)

) < 1, which motivates

the use of the logistic sigmoid activation for Γ, e.g.:

Γ(z

(t+1)

, h

(t)

) = logistic(W

(t+1)

+ U

(t)

+ b

)

exactly matching the LSTM gating equations (e.g. Equation 6.86). The main

difference is that LSTMs compute gating vectors, whereas Equation 6.95

implies Γ should output a scalar. Vectorised gates (Hochreiter 1991) allow

to ﬁt a different warping derivative in every dimension of h

(t)

, allowing for

reasoning over multiple time horizons simultaneously.

It is worth taking a pause here to summarise what we have done. By requir-

ing that our RNN class is invariant to (non-destructive) time warping, we have

derived the necessary form that it must have (Equation 6.95), and showed that

it exactly corresponds to the class of gated RNNs. The gates’ primary role

under this perspective is to accurately ﬁt the derivative

dτ(t)

of the warping

transformation.

The notion of class invariance is somewhat distinct from the invariances we

studied previously. Namely, once we train a gated RNN on a time-warped input

with τ

(t), in many cases we cannot zero-shot transfer it to a signal warped by

a different τ

(t). Rather, class invariance only guarantees that gated RNNs are

powerful enough to ﬁt both of these signals in the same manner, but potentially

with vastly different model parameters. That being said, the realisation that

effective gating mechanisms are tightly related to ﬁtting the warping derivative

can yield useful prescriptions for gated RNN optimisation, as we now brieﬂy

demonstrate.

For example, we can often assume that the range of the dependencies we are

interested in tracking within our signal will be in the range [T

, T

] time-steps.

188 Chapter 6

By analysing the analytic solutions to Equation 6.94, it can be shown that

the characteristic forgetting time of h

(t)

by our gated RNN is proportional to

Γ(z

(t+1)

(t)

)

. Hence, we would like our gating values to lie between

order to effectively remember information within the assumed range.

Further, if we assume that z

(t)

and h

(t)

are roughly zero-centered—which

is a common by-product of applying transformations such as layer normali-

sation (Ba, Kiros, and Hinton 2016)—we can assume that E[Γ(z

(t+1)

, h

(t)

)] ≈

logistic(b

). Controlling the bias vector of the gating mechanism is hence a

very powerful way of controlling the effective gate value

Combining the two observations, we conclude that an appropriate range of

gating values can be obtained by initialising b

∼– log(U(T

, T

) – 1), where

U is the uniform real distribution. Such a recommendation was dubbed chrono

initialisation by Tallec and Ollivier (2018), and has been empirically shown to

improve the long-range dependency modelling of gated RNNs.

Exercises

1. We noted that a matrix is circulant if and only if it commutes with the shift operator

S. Let us prove this fundamental result for 1D grids.

(i) Write down the explicit elements of the right shift matrix S of N elements.

(ii) Let C be an arbitrary N ×N matrix. Write down the elements of the resulting matrices

SC and CS.

(iii) By requiring translation equivariance (SC = CS), deduce the relationship between c

and c

i+1,j+1

(iv) Use this relationship to conclude that the elements c

depend only on (i – j) mod N,

proving that C must be a circulant.

2. Convolutional neural networks often use strided convolutions (or pooling) to

coarsen the grid. For example, a stride of s = 2 drops every second pixel of the out-

put grid. Show that an architecture with a stride of 2 is no longer equivariant to all

translations, and characterise which translations it is equivariant to.

3. Given that grids are a special case of graphs, we may attempt to express a CNN

using the language of graph neural networks (GNNs) from Chapter 5. Consider a

1D cyclical grid with n nodes as a ring graph, comprising nodes V = {0, 1, . . . , n –

1} and neighbourhoods N

= {(u – 1) mod n, u, (u + 1) mod n}, with node features

∈R

being typical RGB values.

(i) Assume we use a GCN (graph convolutional network) model over this graph:

= ϕ

v∈N

√

ψ(x

)

Grids 189

This GNN ﬂavour has some of the closest ties to the spectral properties underly-

ing CNNs, and therefore it represents a valuable attempt. However, it sadly cannot

express all possible convolutions over images (even if we expand the neighbourhoods).

Characterise which image CNNs are representable using the GCN.

(ii) We can attempt ﬁxing this by moving to a more expressive ﬂavour of GNNs; recall

that the most expressive option are message passing GNNs:

= ϕ

v∈N

ψ(x

, x

)

for a permutation-invariant aggregator

. Provide a simple counter-example showing

that even this expressive GNN cannot reliably simulate all CNN ﬁlters over the grid.

(iii) What allows CNNs to bypass this restriction? Leverage this assumption to provide a

correction to the message-passing equation that can represent all possible CNNs over

the grid.

4. In this question we will study time rescalings, a special case of time warping trans-

formations. Time rescalings have the form τ(t) = αt, where α ∈(0, 1]. It is easy to

verify that rescalings monotonically dilate time, satisfying the requirements for a

non-destructive time warping operation. We will start by studying continuous-space

RNNs, taking input signals z(t) to continuous summaries h(t).

(i) State the linear Taylor expansion of h(t) around t. Under which conditions is this

a good approximation? Use the Taylor expansion to relate h(t) to h(t + 1), using the

discrete RNN update, R(z

(t+1)

, h

(t)

) and the summary derivative

dh(t)

(ii) Now introduce the proposed time rescaling, and derive the required condition for class

invariance to such operations.

(iii) Derive the update rule for a discrete RNN that is invariant to time rescaling.

(iv) Assume that a time-rescaling-invariant RNN has been pre-trained over data which has

been rescaled using τ

(t) = α

t. Is it possible to zero-shot transfer this RNN to make

predictions over a signal that has been rescaled by τ

(t) = α

t? If so, how?

Notes

Because of the periodic boundary conditions, it is a circular or cyclic convo-

lution. In signal processing, θ is often referred to as the “ﬁlter,” and in CNNs,

its coefﬁcients are learnable.

The left shift operator is given by S

⊤

. Obviously, shifting left and then right

(or vice versa) does not do anything, which means S is orthogonal: S

⊤

S =

⊤

= I.

We must additionally assume distinct eigenvalues, otherwise there might be

multiple possible diagonalisations. This assumption is satisﬁed with our choice

of S.

190 Chapter 6

S is orthogonal but non-symmetric, hence, its eigenvectors are orthogonal

but the eigenvalues are complex (roots of unity).

Note that the eigenvectors are complex, so we need to take complex

conjugation when transposing Φ.

Since the Fourier transform is an orthogonal matrix (Φ

∗

Φ = I), geomet-

rically it acts as a change of the system of coordinates that amounts to an

N-dimensional rotation. In this system of coordinates (“Fourier domain”), the

action of a circulant C matrix becomes element-wise product.

In graph signal processing, the eigenvectors of the graph Laplacian are

often used as an alternative of the adjacency matrix to construct the graph

Fourier transform, see Shuman et al. 2013. On grids, both matrices have joint

eigenvectors, but on graphs they results in somewhat different though related

constructions.

Eigenfunction is synonymous with ‘eigenvector’ and is used when referring

to eigenvectors of continuous operators.

The spectral characterisation of the translation group is a particular case of a

more general result in Functional Analysis, the Stone’s Theorem, which derives

an equivalent characterisation for any one-parameter unitary group.

Recall, C(θ) is a circulant matrix with parameters θ.

Note that we usually imagine x and θ

as 2D matrices, but in this equa-

tion, both x and θ

have their two coordinate dimensions ﬂattened into

one—making x a vector, and C(θ

) a matrix.

CNNs which only consist of the operations mentioned in this paragraph are

often dubbed “all-convolutional”. In contrast, many CNNs ﬂatten the image

across the spatial axes and pass them to an MLP classiﬁer, once sufﬁcient

equivariant and coarsening layers have been applied. This loses translation

invariance.

Historically, ResNet models are predated by highway networks (Srivas-

tava, Greff, and Schmidhuber 2015), which allow for more general gating

mechanisms to control the residual information ﬂow.

In this case, the ResNet is performing a Forward Euler discretisation of an

ODE:

x = σ(F(x))

We note that normalising activations of neural networks has seen attention

even before the advent of batch normalisation. See, e.g., Lyu and Simoncelli

2008.

Whether the domain is considered static or dynamic concerns time scales:

e.g., a road network does change over time (as new roads are built and old

ones are demolished), but signiﬁcantly slower compared to the ﬂow of trafﬁc.

Grids 191

Similarly, in social networks, changes in engagement (e.g. Twitter users re-

tweeting a tweet) happen at a much higher frequency than changes in the follow

graph.

We do not lose generality in our example; equivalent analysis can be done

e.g. for node-level outputs on a spatiotemporal graph; the only difference is in

the choice of encoder f (which will then be a permutation equivariant GNN).

Note that the z

(t)

vectors can be seen as points on a temporal grid: hence,

processing them with a CNN is also viable in some cases. Transformers are

also increasingly popular models for processing generic sequential inputs.

In spite of their name, SimpleRNNs are remarkably expressive. For exam-

ple, it was shown by Siegelmann and Sontag (1995) that such models are

Turing-complete, meaning that they can likely represent any computation we

may ever be able to execute on computers.

Note that this construction is extendable to grids in higher dimensions,

allowing us to, e.g., process signals living on images in a scanline fashion.

Such a construction powered a popular series of models, such as the PixelRNN

from Oord, Kalchbrenner, and Kavukcuoglu (2016).

Note that equivalent analyses will arise if we use a different padding vector

than 0.

In a very similar vein, we can derive equivariance to left-shifting by s steps

as long as t

′

≥s.

Contractions are functions γ : X →X such that, under some norm ∥·∥ on

X, applying γ contracts the distances between points: for all x, y ∈X, and

some q ∈[0, 1), it holds that ∥γ(x) – γ(y)∥≤q∥x – y∥. Iterating such a func-

tion then necessarily converges to a unique ﬁxed point, as a direct consequence

of Banach’s Fixed Point Theorem (Banach 1922).

Note that we have set the activation function to tanh here; as LSTMs are

designed to ameliorate the vanishing gradient problem, it is now appropriate to

use a sigmoidal activation.

Note that the three gates are themselves vectors, i.e. i

(t)

, f

(t)

, o

(t)

∈[0, 1]

This allows them to control how much each of the m dimensions is allowed

through the gate.

This is still compatible with the RNN update blueprint from Equation

(6.80); simply consider the summary vector to be the concatenation of h

(t)

and c

(t)

; sometimes denoted by h

(t)

∥c

(t)

We focus on the continuous setting as it will be easier to reason about

manipulations of time there.

We will use h(t) to denote a continuous signal at time t, and h

(t)

to denote a

discrete signal at time-step t.

192 Chapter 6

This insight was already spotted by Gers and J. Schmidhuber (2000) and

Jozefowicz, Zaremba, and Sutskever (2015), who empirically recommended

initialising the forget-gate bias of LSTMs to a constant positive vector, such as