Foundations of Equivariant Deep Learning

•

We deﬁne symmetries as structure-preserving transformations of an

object, and study examples.

•

Symmetries of a learning problem are transformations of the input space

that leave the output invariant.

•

To exploit symmetries in deep learning, we use equivariant and invariant

layers.

•

Symmetries form groups. In this chapter we study basic notions from

group theory.

•

Groups act on data via group actions or representations. In this chapter

we study basic notions from representation theory.

•

We study equivariant mappings and equivariant networks in general, and

provide an overview of common equivariant neural network layers.

In this chapter we begin our study of the foundations of geometric deep

learning. The central concept is symmetry: a transformation that leaves all

relevant structure of an object unchanged. We will look at the way in which

symmetries arise in machine learning, and study the central algebraic structure

used in the mathematical study of symmetries: the group. We deﬁne general

and linear group actions / representations, which tell us how the elements of

an abstract group can be represented by concrete (linear) maps. Using group

representations, we can specify how a group acts on various kinds of geomet-

ric data, from images to point clouds to tensor ﬁelds. Next we discuss invariant

and equivariant maps, which are the building blocks of equivariant networks

and the primary tools we use to leverage knowledge of symmetries to improve

the statistical efﬁciency of deep learning methods. We end the chapter with

the equivariance cookbook, which provides a number of common kinds of

equivariant maps from which equivariant networks can be constructed.

Our focus in this chapter is on the algebraic (group-theoretic) aspects of

GDL, whereas the next chapter will cover the geometric ones. Concretely, this

means that in the present chapter we consider data that lives in a plain vector

62 Chapter 3

space with a group of symmetries acting on it, but without too much further

structure, while in the next chapter this vector space is taken to be the space of

signals (functions) on a space (e.g. graph, manifold, etc.). Thus, for now, we

can talk about symmetries, groups, representations and equivariance, but we

will not consider concepts such as locality and convolution, which depend on

the structure of the underlying space.

3.1 Symmetry Transformations

Let us begin by giving a general deﬁnition of symmetry, before we consider

the role it plays in geometric deep learning. Just as how the best way to deﬁne

“vector” is “element of a vector space” (deﬁning the thing by its relation to

other things of the same kind

), one could deﬁne “symmetry” as “element of

a symmetry group”. But although such a deﬁnition would be general, compact

and precise, it would not be very intuitive. So we begin with a more infor-

mal deﬁnition, and then show how the notion of group (to be deﬁned) neatly

captures the idea.

The modern concept of symmetry applies to a wide range of abstract

mathematical structures, but let us ﬁrst consider the familiar example of an

equilateral triangle (ﬁg. 3.1, left). We say that rotation by 120

◦

is a sym-

metry of the triangle, because after applying this transformation the triangle

looks exactly as it did before. The same is true for reﬂection about one of

the three bisectors, as well as arbitrary sequences of rotations and reﬂections.

Furthermore, the inverse of any of these symmetries is itself also a symmetry.

Figure 3.1

Euclidean symmetries (left) and general isometries (right) of triangle.

When drawn on paper, the original and transformed triangle look exactly

the same. However, for other ways of encoding or representing the triangle,

we may ﬁnd that there are actually some superﬁcial changes. For instance, if

we encode the triangle as an ordered list of three corner points in (x, y) coor-

dinates, then a rotation by 2π/3 of each coordinate vector leads to an overall

permutation (cyclic shift) of the three corners. Nevertheless, if we forget about

Foundations of Equivariant Deep Learning 63

the ordering, we still have the same set of corners. Since we do not consider

the order of the corners to be part of the essential structure of the triangle, the

rotated triangle (list of corner coordinates) should be considered equivalent to

the one we started with. This phenomenon is key to geometric deep learning,

because any sensible computation on such data (including for instance those

performed by a neural network) should treat equivalent inputs equivalently,

even if they have superﬁcial differences.

In this example and also in general, a symmetry transformation is a special

kind of structure-preserving map. What is meant by structure is highly context-

dependent, but in order to discuss symmetries one has to choose some notion

of structure of interest. If we are doing Euclidean geometry, the structure of

interest is Euclidean distance structure, and so here structure-preserving map

means distance-preserving map, aka isometry or Euclidean transformation.

More precisely, g : R

→R

is an isometry wrt distance metric d, if

d(gu, gv) = d(u, v),

For all u, v ∈R

. The isometries of the plane consist of rotations, reﬂections,

translations, and all possible compositions of these.

Applying an arbitrary isometry to our triangle, we obtain a new but congru-

ent (or isomorphic

) triangle (Fig. 3.1, right). When the transformed triangle

is not just congruent but actually the same as the one we started with, we call

the transformation a symmetry of the triangle. In other words, a symmetry

transformation is a way in which the triangle is congruent to itself.

With this in mind we could say in general that a symmetry is a way in which

an object is equivalent (or isomorphic) to itself, or more prosaically:

A symmetry transformation (or automorphism) of an object is an

invertible and structure-preserving map from the object to itself.

Oftentimes, we are interested in the symmetries of a space. Consider for

example the plane endowed with a Euclidean distance metric. This space has as

symmetries all isometries of the plane, i.e. all rotations, reﬂections and trans-

lations. They are symmetries because they are maps from the plane to itself,

they are invertible, and they preserve distances. In this case it happens to be the

case that all structure-preserving maps are invertible, but in general one has to

add this as an additional requirement.

The same set may be endowed with different structures, leading to different

symmetries. For instance we could see the plane as:

1. A set, so that symmetries are bijections (invertible maps; permutations),

64 Chapter 3

2. A vector space, so that symmetries are invertible linear transformations,

3. An afﬁne space

, so that symmetries are invertible afﬁne maps (linear maps

plus translation),

4. A topological space, so that symmetries are homeomorphisms (continuous

maps with continuous inverse).

5. A smooth manifold, so that symmetries are diffeomorphisms (smooth maps

with smooth inverse).

We could continue the list, regarding the plane each time as an inner product

space, Euclidean metric space, and so on. Those not familiar with these notions

should not worry: the only point we wish to make is that what counts as a

symmetry depends on exactly what we consider to be the structure of interest.

3.2 Symmetries of Learning Problems

How do symmetries appear in machine learning, and why are they relevant?

In this section we look at two kinds of symmetries that arise in supervised

learning: symmetries of the target function (manifested as invariance or equiv-

ariance) and symmetries of the parameterization. The former will be used

throughout this book to develop statistically efﬁcient networks that respect

such symmetries, whereas the latter has relevance for optimization but will

not be the focus of this book.

3.2.1 Invariance

In supervised learning (section 2.1) the goal is to use data samples to approxi-

mate the target function f : X →Y mapping from input space X to output space

Y. For concreteness we will consider the problem of classiﬁcation, where Y is

the set of class labels, but we could just as well consider a continuous output

space in case of regression.

In the previous section, we deﬁned a symmetry of an object as an invertible

structure-preserving transformation of the object. For learning problems such

as classiﬁcation, the object of interest is the input space X, and we consider

the label function f to deﬁne the structure of interest (“class structure”). A

symmetry of the learning problem is thus an invertible and label-preserving

map g : X →X. To say that g preserves the label, or leaves it invariant, is to

say that for all x ∈X, we have (f ◦g)(x) = f (x). That is, applying g and then f to

any input x ∈X gives the same result as just applying f directly. In diagramatic

Foundations of Equivariant Deep Learning 65

Figure 3.2

Rotation is a symmetry of the label function in image classiﬁcation.

form, we can express this as:

X X

f f

(3.7)

A commutative diagram like this expresses the fact that if we compose maps

(arrows) along any directed path with the same begin and end points, the result-

ing composite maps are the same. Hence this diagram expresses the fact that g

followed by f equals f .

As an example of a label symmetry, consider image classiﬁcation, where X

is the space of planar images and Y is a set of labels. Typically, the class label

does not depend on the position and orientation of the object, so a translation

or rotation of the image, will be a symmetry (see ﬁg. 3.2). Similarly, if we are

classifying a point cloud, then X is a space of point sequences. Typically, the

label does not depend on the order of the points, so any permutation of points

will be a symmetry. Additionally, it may be that rotating and translating all of

the points simultaneously does not change the label.

The examples of symmetries mentioned above are quite natural, but in gen-

eral there will be many more. For instance, if we have two inputs x, x

′

∈X with

the same label (i.e. f (x) = f (x

′

)), we can deﬁne a transformation g that swaps

them, while leaving all other points unchanged. This will be a symmetry of f ,

but it is unlikely we would know if we didn’t already know f !

If somehow we came to know all symmetries of a label function f , learning

becomes almost entirely trivial. The reason is that if we observe a data point

x with label, we immediately learn the label of all points x

′

in the same class,

because there exists a symmetry that swaps x and x

′

. So in theory we would

need only one label per class. This example nicely demonstrates how knowl-

edge of symmetries can in principle be used to improve data efﬁciency, but it

also suggests that if we knew all symmetries of the label function we wouldn’t

really need learning. In other words, for problems where we do need learning,

66 Chapter 3

Figure 3.3

One invariant and three equivariant computer vision problems: image classiﬁcation,

oriented keypoint detection, image segmentation, optical ﬂow estimation.

we should not expect to be able to know all of the symmetries a priori. In chap-

ter 4, we will consider a natural class of symmetries arising from the domain

on which the data lives, and which we often know about a priori. For now we

will simply assume knowledge of some symmetries without worrying about

the speciﬁc kind or how we have come to know them.

3.2.2 Equivariance

For invariant learning problems, the output remains unchanged when we apply

a symmetry transformation to the input. There is another class of problems

where instead the target function is equivariant, meaning that the output trans-

forms along with the input (see ﬁg. 3.3 for examples). In image segmentation

for instance, the input is an RGB image and the output is a segmentation

mask, which assigns a label to each pixel. When we shift the input, the out-

put should also shift. Another example is image registration, where the goal is

to output a transformation that puts the image in a canonical pose (e.g. putting

a brain scan in upright position and perhaps applying a deformation so that

brain regions can be compared voxel-wise). Similarly, learned or hand-crafted

keypoint detectors (e.g. SIFT; Lowe 1999; Lenc and Vedaldi 2016) should be

equivariant: when the image is transformed, the keypoint position, scale, and

orientation transforms along with it.

As these examples show, the output will in general be a different kind of

geometric quantity than the input, meaning that it lives in a different space and

transforms in a different way under the same symmetry. For instance, the same

rotation can act on an input signal (e.g. brain scan with n

voxels) and on the

output pose variable (e.g. a 3 ×3 rotation matrix). What it means to “rotate”

a voxel grid or 3 ×3 matrix is different. For this reason we will formalize the

concepts of group (e.g. rotations) and group representations / actions (how to

Foundations of Equivariant Deep Learning 67

apply a rotation to various kinds of quantities) later in this chapter. For now,

we will simply write g for a symmetry, and write ρ

(g) : X →X and ρ

(g) :

Y →Y for the way it acts on input and output space.

Equivariance of the target function f : X →Y can then be stated as:

f ◦ρ

(g) = ρ

(g) ◦f . (3.8)

This is an equality of mappings, which means that the mappings on the left

and right-hand side should agree on all inputs x ∈X. Hence, this equation says

that if we apply a symmetry transformation g to an input x using ρ

and then

apply f , we get the same output as applying f ﬁrst and then applying the same

symmetry g to the output using ρ

We can also express equivariance in diagramatic form (appendix E), so that

we can easily see the domains and codomains of the maps involved:

Y Y

X X

(g)

∗

(g)

(3.9)

To say that this diagram commutes is to say that starting in the bottom left X,

going up and then right is the same as right and then up.

We can obtain yet another way of expressing equivariance by multiplying

eq. (3.8) on the left by ρ

(g)

–1

and cancelling:

(g)

–1

◦f ◦ρ

(g) = f .

This shows most clearly that also in the case of equivariance we can think of

g as a symmetry, because it leaves f invariant when acting on the input and

output space simultaneously, similar to the invariance equation f ◦ρ

(g) = f .

Indeed one can see that invariance (eq. (3.7)) is a special case of equivariance

(eq. (3.9)) where ρ

(g) = id

is the identity map for all g ∈G.

3.2.2.1 Equivariance and Data Efﬁciency As with invariance, equivari-

ance puts a constraint on the hypothesis space F, thereby reducing the

statistical complexity of learning (chapter 2). The more symmetries we have,

the more constraints we get, and the smaller the remaining hypothesis space.

Although we have so far discussed equivariance of the network as a whole,

in practice one would build an equivariant network by composing learnable

equivariant layers, each of which is subject to a constraint that reduces the

number of parameters in that layer.

To getter a better understanding of the role of equivariance in pattern recog-

nition, consider ﬁg. 3.4. In this example, a neural network layer learns a

68 Chapter 3

style-agnostic representation of the pattern ‘A’ (two inputs x, x

′

∈X, repre-

senting two styles of ‘A,’ are mapped into the same point, y = f (x) = f (x

′

) ∈Y).

However, when the input patterns are transformed by g (in our example, a

rotation ρ

(g)x and ρ

(g)x

′

), a general neural network does not necessarily

generalize in a symmetry-consistent manner, i.e., it might be that f (ρ

(g)x) ̸=

f (ρ

(g)x

′

By imposing equivariance, we guarantee that the model generalizes in a way

that is consistent with the symmetry: from the equivariance condition,

f (ρ

(g)x) = ρ

(g)f (x)

f (ρ

(g)x

′

) = ρ

(g)f (x

′

);

together with the assumption f (x) = f (x

′

), it follows that f (ρ

(g)x) =

f (ρ

(g)x

′

Figure 3.4

Example of the role of equivariance in pattern recognition.

In addition to these heuristic arguments, there is a growing body of work

in learning theory that formally characterize the beneﬁts of equivariance (Lyle

et al. 2019; Sannai, Imaizumi, and Kawano 2021; Elesedy and Zaidi 2021;

Behboodi, Cesa, and Cohen 2022).

3.2.3 Symmetries of the Parameterization

There is another way in which symmetries appear in machine learning. When

we approximate the true label function f : X →Y by a parameterized function

f : X ×Θ →Y, we may be introducing symmetries of the parameterization.

Foundations of Equivariant Deep Learning 69

Whereas input-space symmetries act on X, a parameter-space symmetry acts

on Θ. A mapping h : Θ →Θ is a symmetry of the parameterization if for all

u ∈X and θ ∈Θ:

f (u, hθ) = f (u, θ). (3.10)

In other words, the mappings f (–, θ) and f (–, hθ) are the same; the parameter

vectors θ and hθ correspond to the same classiﬁer.

In neural networks, one can usually permute hidden units (along with their

incoming and outgoing weights), without changing the input-output mapping.

Furthermore, for certain kinds of non-linearities like ReLU(x) = max(x, 0), we

can scale the input and inverse scale the output of a nonlinearity without

changing the function: α

–1

ReLU(αx, 0) = ReLU(x).

Knowledge of parameter-space symmetries can be useful in optimization,

Bayesian inference, and model averaging (Kunin et al. 2021; Ainsworth,

Hayase, and Srinivasa 2022; Wei et al. 2022; Kurle et al. 2022; Zhao et

al. 2022; Zhao et al. 2023; Wei and Lau 2023). In some applications it is useful

for one neural network to process the weights of another, in which case the

former should respect the weight-space symmetries (Navon et al. 2023; Zhou

et al. 2023). In the rest of this book we will primarily focus on input space

symmetries and how to use them to improve statistical efﬁciency.

3.3 Symmetry Groups

Having deﬁned symmetries and seen how they emerge in machine learning, we

will now begin our study of the mathematical machinery used to think about

symmetries in a systematic and general way. We begin in this section with

the concept of a symmetry group, before studying group representations and

equivariant maps.

The set of all symmetries of an object has a number of salient properties,

which have been distilled into a few abstract group axioms. The ﬁrst property

to note is that the identity transformation (“do nothing”) is always a symme-

try, in a trivial way. Secondly, symmetries may be combined to obtain new

symmetries: if g and h are two symmetries, then their composition g ◦h (ﬁrst

applying h and then g) is also a symmetry, and this composition operation is

associative. Finally, by deﬁnition symmetries are always invertible, and the

inverse is also a symmetry. We encourage the reader to verify for themselves

that these properties indeed hold for symmetries as we have deﬁned them in

section 3.1.

It is possible to deﬁne a symmetry group as a set of symmetry transforma-

tions g : X →X, satisfying these axioms (identity, closure under composition

and inverses). However it turns out, as it often does, that in order to study a

70 Chapter 3

group it is not so important what symmetries are (i.e. what they do to ele-

ments of X), but rather how they compose with other symmetries. So the

most common deﬁnition of group does not say what the group elements are

(symmetries), but only encodes the compositional structure:

A group is a set G along with a binary operation ◦: G ×G →G

called composition or the group operation (for brevity, denoted by

juxtaposition g ◦h = gh) satisfying the following axioms:

Associativity: (gh)k = g(hk) for all g, h, k ∈G.

Identity: there exists a unique e ∈G satisfying eg = ge = g for all g ∈G.

Inverse: For each g ∈G there is a unique inverse g

–1

∈G such that

–1

= g

–1

g = e.

Note that as with function composition or matrix multiplication but unlike

multiplication of numbers, the group operation need not be commutative, i.e.

we may have gh ̸= hg. Groups for which gh = hg for all g, h ∈G are called com-

mutative or Abelian.

Groups can be ﬁnite or inﬁnite, discrete or continuous.

So-called Lie groups, like the group of rotations in R

, are smooth mani-

folds, meaning we can do calculus on them (more on this in later chapters).

Section 3.3.3 shows a number of examples of groups. It is worth checking for

each one that it indeed satisﬁes the axioms of a group.

As mentioned before, we have deﬁned a group as an abstract object, meaning

we have not stated what the group elements are (e.g. symmetry transfor-

mations of some object, as we deﬁned them in section 3.1), only how they

compose. By contrast, a concrete group (sometimes called symmetry group)

would have as elements symmetry transformations (section 3.1) g : X →X,

and the group operation is given by function composition. In the special case of

matrix groups, the group elements are matrices

A ∈G⊂R

n×n

, and the group

operation is given by matrix multiplication. Starting from a concrete group

of transformations, we obtain an abstract group by simply forgetting that the

elements are functions or matrices, while remembering the composition rule

◦: G×G→G. As it turns out, knowing how group elements compose is all

you need to study every aspect of the group itself.

In practice, almost all of the groups we consider in this book can be deﬁned

as matrix groups, so let us state the deﬁnition explicitly:

Foundations of Equivariant Deep Learning 71

An n-dimensional real matrix group is a set of matrices G⊂R

n×n

sat-

isfying the following axioms:

Closure: For all A, B ∈G, the matrix product AB is also in G.

Identity: the identity matrix I is in G.

Inverse: All A ∈G are invertible (full rank) and the matrix inverse A

–1

is also in G.

In a matrix group, the composition operation is given by matrix-matrix mul-

tiplication, which is automatically associative, and so we do not need to add

this axiom. In an abstract group where we explicitly deﬁne the composition

rule as a map ◦: G ×G→G, the composition of two elements in G is always in

G, but since the matrix product of two elements of G⊂R

n×n

need not be in G,

we need to add this closure axiom explicitly for matrix groups.

Whereas a concrete group is made up of functions g : X →X that can be

applied to elements x ∈X, the data specifying an abstract group does not tell us

how to act on geometric data. This ability can be reintroduced by deﬁning one

or more group representations (or group actions), as discussed in section 3.4.

Group representations are useful even when we are dealing with a matrix

group, because oftentimes we want the same group to act on different kinds

of geometric quantities, such as the inputs, outputs and internal features of a

neural network.

3.3.1 Subgroups

We noted in section 3.1 that what can be called a symmetry depends on what

structure we want to preserve. When we add structure, there will be fewer

symmetries; our group is reduced to a subgroup.

Let G be a group. A subgroup H⊂G is a subset of G that is itself a

group. Concretely, this means that H is closed under composition and

inverses: when we compose two elements in H the result is again in H,

and the inverse of any element in H is also in H.

For any group, the subset H= {e} is a subgroup, and similarly any group is

a (non-proper) subgroup of itself. For a slightly less trivial example, consider

the group of planar translations (R

, +), and the subgroup given by translations

along the x-axis preserving the y-axis.

72 Chapter 3

A classical example comes from the Erlangen program: Klein noted that

different geometries are associated with different groups, which in some cases

form subgroups that preserve progressively more geometric structure. For

instance, the group of Euclidean transformations is a subgroup of the afﬁne

group, which is a subgroup of the projective group. Projective, afﬁne and

Euclidean geometry are concerned with the study of invariants of their respec-

tive groups, such as distances for Euclidean geometry, ratios of surface areas

for afﬁne geometry, and cross-ratios for Projective geometry.

3.3.2 *Mappings between Groups: Homomorphisms and Isomorphisms

Whenever we study a new kind mathematical gadget, it is a good idea to think

about the appropriate notion of structure-preserving mapping between gadgets.

In the case of groups, these are called group homomorphisms. They are maps

between groups that preserve the compositional structure of a group:

Let G and H be groups. A group homomorphism is mapping ϕ : G →H

such that for all g, h ∈G:

ϕ(gh) = ϕ(g)ϕ(h).

When interpreting the equation above, note that on the left hand side we are

composing g and h using the composition operation in G, whereas on the right

hand side we are composing ϕ(g) and ϕ(h) in H. One can show that a group

homomorphism maps the identity of G to the identity of H.

If we have two group homomorphisms ϕ

: G →H and ϕ

: H→K, then

their composition ϕ

◦ϕ

: G →K is also a homomorphism. Moreover, for any

group G the identity map id

: G →G is a homomorphism, which shows that

groups and group homomorphisms form a category (appendix E).

One can show that the image of a group homomorphism, imϕ = {h ∈

H| ∃g ∈G : ϕ(g) = h}, is a subgroup of H. Similarly, the kernel ker ϕ = {g ∈

G | ϕ(g) = e

} is a subgroup of G. The interested reader may study the

ﬁrst isomorphism theorem, which relates the image and kernel of a group

homomorphism.

Group homomorphisms always preserve positive compositional relations of

the form gh = k, but they need not preserve negative relations of the form gh ̸= k.

As an extreme example, for any groups G and H, the mapping ϕ(g) = e

that

sends every element of G to the identity of H, is a group homomorphism.

When a group homomorphism does preserve distinctness, i.e. g ̸= h ⇒ϕ(g) ̸=

ϕ(h) it is called an injective group homomorphism or monomorphism. Hence

one can think of an injective group homomorphism as pointing to a G-shaped

Foundations of Equivariant Deep Learning 73

subgroup of H, labelling the elements of this subgroup by elements of G

without duplicates.

If a mapping is both injective (mapping each element of the domain to a

different element of the codomain) and surjective (reaching every element of

the codomain) it is bijective, or invertible. A bijective group homomorphism is

called a group isomorphism. Isomorphic groups are for all intents and purposes

the same, because we can translate any statement about one of them to an

equally true statement about the other, and so it is very common to consider

them as the same group.

3.3.3 Examples of Groups

Throughout this book we will encounter various kinds of symmetry groups;

commutative and non-commutative, ﬁnite and inﬁnite, discrete and continuous.

In order to build intuition for the general concept of group, and to become

familiar with the groups we will consider later on in this book, we present

in this section an overview of the most important ones. It is not necessary to

understand and become familiar with all of these; it is sufﬁcient for now to

look at a few examples to make the concept of group feel less abstract.

3.3.3.1 Finite Groups When studying a new concept it is helpful to look

at simple examples ﬁrst. What is the simplest example of a group? Looking at

the axioms of a group, we notice that the identity axiom tells us that any group

has to have an identity element, so we cannot have an empty group with zero

elements. Can we have a one-element group, with just the identity element

e? We deﬁne T = {e}, composition by e ◦e = e (the only possible choice) and

inverse by e

–1

= e (idem). By deﬁnition we have an identity, and an inverse for

all elements (just e), so the identity and inverse axioms are satisﬁed. We check

that composition is associative: (ee)e = (e)e = e = e(e) = e(ee), so indeed T is a

group, called the trivial group. As a matrix group, we may represent e as the

1 ×1 identity matrix I

= [1], or a higher dimensional identity matrix, as these

satisfy II = I and I

–1

= I just like the abstract group element e. When we have

an object whose symmetry group is trivial, it means that the object does not

have any non-trivial (non-identity) symmetries.

What about a group with two elements? The group should have an identity

e, and lets call the other one m. By the identity axiom, we should have ee = e,

em = m, and me = m, ﬁxing 3 of the 4 compositions. For the last one, we have to

choose between mm = m or mm = e. However, the ﬁrst option runs into trouble

because if we multiply both sides of the equation mm = m by m

–1

(however

we choose m

–1

) we ﬁnd m = e, a contradiction. Thus we must have mm = e and

–1

= m. We can realize this group as a matrix group with 1 ×1 matrices (i.e.

74 Chapter 3

scalars) by choosing G = {1, –1}. This also reveals the geometric meaning of

this group: it consists of the mirror reﬂection m and identity e. Like the trivial

group, the two-element group is commutative.

Perhaps the most fundamental class of ﬁnite groups are the permutation

groups. Recall that a permutation of an ordered list of n elements is way to

rearrange (“shufﬂe”) them, for example swapping two elements. We can rep-

resent a permutation as a bijective (i.e. invertible) function σ : n →n, where

n = {1, …n}. Alternatively, we can represent a permutation by a permutation

matrix P, which is a matrix with a single 1 in each row and column and all

other entries 0. When acting on a vector x ∈R

, this matrix will permute the

coordinates.

The set of all permutations of n elements form a group called the symmet-

ric group S

(not to be confused with the general term symmetry group). To

see that S

is indeed a group, note that 1) the identity id

: n →n is a per-

mutation, 2) the composite of two permutations is a permutation (closure),

and 3) a permutation has an inverse and this inverse is itself a permutation.

To specify a permutation σ, we need to choose σ(1) (n possibilities), σ(2)

(n – 1 possibilities), etc., up to σ(n) (only one choice left). Thus, the group S

has n ·(n – 1) ·(n – 2) ···2 ·1 = n! elements. We can equivalently deﬁne S

as a

concrete group of mappings σ, as a matrix group, or as an abstract group.

For the interested reader, we mentioned a few more ﬁnite groups, with-

out going into detail. The rotational symmetries of regular polygons such as

triangles, squares, pentagons and hexagons (and so on) are captured by the

cyclic groups C

, and their rotation and reﬂection symmetries by the dihe-

dral groups D

. Finite groups of symmetries in three dimensions include the

crystallographic point groups and the groups of symmetries of Platonic solids.

According to Cayley’s theorem

, any ﬁnite group with n elements can be

viewed as a subgroup of S

. Furthermore, in a major milestone in mathemat-

ics

, the ﬁnite simple groups (from which all ﬁnite groups can be made) have

been classiﬁed; all such groups belong to one of several inﬁnite families, or to

a relatively short list of sporadic groups.

3.3.3.2 Inﬁnite Discrete Groups Groups need not be ﬁnite. Consider for

instance the set of integers Z = {…, –2, –1, 0, 1, 2, …}. With composition given

by addition and 0 as the identity, this set forms a group. Geometrically, we

can think of this group as a group of discrete translations in one dimension.

Similarly, we can think of Z

= Z ×Z as a group of discrete translations in two

dimensions. In chapter 7 we will consider these groups as the symmetries of

regular grids (e.g. a grid of image pixels).

Foundations of Equivariant Deep Learning 75

In addition to discrete translational symmetries, some regular grids / lat-

tices also have discrete rotational and reﬂection symmetry, such as four-fold

rotational symmetry of a square grid and six-fold rotational symmetry of a

hexagonal grid. We will consider such groups in chapter 8 when we study

group convolution. A related class of groups are the 17 wallpaper groups,

which describe the possible symmetries of wallpaper patterns (Schattschnei-

der 1978). Their 3D analogues are called (crystallographic) space groups, and

these have been classiﬁed into 230 distinct groups (Aroyo 2017).

3.3.3.3 The Classical Matrix Groups Classical groups are a name of an

important class of groups that occur throughout mathematics, physics, and

computer science, and several of which will play an important role in this

book. They were so named by (Weyl 1939) in his famous book “The Classi-

cal Groups” on invariant polynomials, representation theory, and the Erlangen

programme. Most generally deﬁned, the classical groups are groups of invert-

ible matrices with real, complex or quaternionic

entries, possibly subject to

additional constraints (e.g., volume preservation det(A) = 1 or preservation of

certain kinds of generalized inner product). Here we will only consider certain

important examples of real and complex classical groups; many more examples

and details may be found e.g. in Goodman and Wallach (2009).

The quintessential matrix groups are the general linear groups GL(n, R)

and GL(n, C) of all n ×n real or complex invertible matrices. These form

groups, because they include the identity matrix, inverses for each element,

and because the product of two invertible matrices is again invertible. They can

be thought of as the symmetry groups of the vector spaces R

and C

, being

the invertible transformations that preserve their vector space structure.

The

classical groups (and indeed all matrix groups) are subgroups of the general

linear groups, which is the “largest” of the matrix groups. In this sense, the

general linear group plays the same role for matrix groups as the symmetric

group plays for ﬁnite groups.

From GL(n, R) we can obtain subgroups that preserve not just the linear

structure of R

, but also lengths, angles, orientation, and volume. The orthog-

onal group O(n) contains only those matrices that preserve the standard inner

product x

⊤

y. That is, we should have (Ax)

⊤

(Ay) = x

⊤

y for all x, y ∈R

, or

equivalently, A

⊤

A = I:

O(n) = {A ∈GL(n, R) : A

⊤

A = I}. (3.11)

Thus, this group consists of rotations and reﬂections.

Geometrically, the inner product represents the angle between two vectors

so orthogonal matrices preserve angles. The determinant of a matrix A tells us

76 Chapter 3

Group GL(n, R) SL(n, R) O(n) SO(n)

Constraints det(A) ̸= 0 det(A) = 1 A

⊤

A = I

det(A) = 1

⊤

A = I

Dimension

– 1

n(n–1)

Mapping

Invariants

Volume,

Orientation

Volume,

Distance,

Angle

Volume,

Orientation,

Distance,

Angle

Table 3.1

Classical real matrix groups and their properties. Each group is deﬁned as the set of

n ×n real matrices satisfying a constraint (row 1). An equality constraint may elimi-

nate degrees of freedom, leaving only a certain number of dimensions (row 2). Note

that O(n) and SO(n) have the same dimensionality, because orthogonal matrices have

det(A) = ±1 so the additional det(A) = 1 constraint for SO(n) eliminates only a discrete

degree of freedom. In the mapping row we show for each group how a generic element

acts on the standard basis e

= (1, 0) (blue) and e

= (0, 1) (red) for n = 2 dimensions.

Finally, we list the geometric properties left invariant by each group. All matrix groups

leave invariant the vector space structure, meaning the origin 0 and all linear relations

z = αx + βy, and additionally the properties listed in the last row.

how much it expands volume, with negative values indicating a reversal of ori-

entation. Elements of the orthogonal group have det(A) = ±1, i.e., it preserves

volume but not necessarily orientation (reﬂections have determinant –1).

The special linear group SL(n, R) consists of volume-preserving linear

maps:

SL(n, R) = {A ∈GL(n, R) : det(A) = 1}. (3.12)

It does not necessarily preserve angles. Similarly, the special orthogonal group

consists of volume and inner-product preserving linear maps:

SO(n) = {A ∈GL(n, R) : A

⊤

A = I, det(A) = 1}. (3.13)

In table 3.1 we show these groups and some of their properties.

The complex analogs of SL(n, R), O(n) and SO(n) are called the complex

special linear group SL(n, C), unitary group U(n) and special unitary group

SU(n). While in our book we will focus primarily on real matrix groups, we

should mention here that the unitary groups play an important role in physics:

U(1) ×SU(2) is the “electroweak symmetry” associated with electromagnetic

Foundations of Equivariant Deep Learning 77

and weak forces, whereas SU(3) appears in quantum chromodynamics that

describes strong interactions between quarks.

3.3.3.4 Motion groups Further examples of groups that are particularly

important in applications are the Euclidean motion groups. Euclidean motions

can be deﬁned as isometries (distance-preserving maps) of Euclidean space

(e.g. R

with the standard inner product). This includes translations, rota-

tions, and reﬂections, and indeed one can show that any Euclidean motion

is a sequence of these. These form the Euclidean group, denoted E(n). When

reﬂections are excluded, it is called the special Euclidean group SE(n).

It is not possible to perform a translation x 7→x + t for x, t ∈R

by mul-

tiplying x by an n ×n matrix, because A0 = 0 for any matrix. For this

reason it is common to add a homogeneous coordinate, representing x by

x = (x

, …, x

, 1) ∈R

n+1

. A Euclidean transformation can then be represented

by a matrix of shape n + 1 ×n + 1 of the following form:



R t

0 1



, (3.14)

where R ∈O(n) is an n ×n orthogonal matrix (rotation and reﬂection) and

t ∈R

is a translation vector. One may verify that multiply such a matrix by

x results in the required transformation x 7→Rx + t. Furthermore, the com-

position operation of E(n) is given by matrix multiplication, and so it can be

deﬁned as a matrix group.

By introducing uniform scaling transformations (replacing R by sR in

eq. (3.14)), we obtain the group of similarity transformations Sim(n). The

group of afﬁne transformations contains in addition to rotations, transla-

tions, and reﬂections, also scale and shear transformations. Using the same

homogeneous coordinate representation, we can conveniently encode afﬁne

transformations using the same form as eq. (3.14), but taking R ∈GL(n, R) to

be an arbitrary invertible matrix rather than an orthogonal one.

Projective transformations can also be encoded conveniently using homo-

geneous coordinates. An n-dimensional projective transformation may be

encoded as a non-singular n + 1 ×n + 1 matrix, and composition is given

by matrix multiplication. In this representation, two matrices related by a

scalar multiplication A = λB are to be considered equal. To apply a projec-

tive transformation to a vector x, one uses matrix-vector multiplication on the

homogeneous coordinate vector

x, followed by a division by the homogeneous

coordinate (which is no longer equal to 1 in general). Projective transforma-

tions are a mainstay of classical computer vision (Hartley and Zisserman 2004)

and graphics. Table 3.2 lists several motions groups and their properties.

78 Chapter 3

Group PGL(n + 1) Aff(n) (S)Sim(n) (S)E(n))

Matrix H



A t

0 1

 

sQ t

0 1

 

Q t

0 1



Domain

H ∈GL(n + 1)

A ∈GL(n)

t ∈R

Q ∈(S)O(n)

t ∈R

s ∈R

Q ∈(S)O(n)

t ∈R

Dimension

(n + 1)

– 1 n

+ n

n(n–1)

+ n + 1

n(n–1)

+ n

Mapping

Invariants

Colinearity,

Cross-ratio

Parallelism,

Volume

Ratio, Dis-

tance Ratio,

Angle

Volume,

Distance,

Angle

Table 3.2

Motion groups and their properties. Motions in n dimensions are represented by n + 1 ×

n + 1 dimensional matrices using homogeneous coordinates (see text). For each group

we show the form of the matrix, the domain of the sub-matrices, and the dimension of

the group. Note that for the projective group PGL(n + 1) (corresponding to projective

transformations of n-dimensional space), any matrix H ∈GL(n + 1) represents a valid

projective transformation, but matrices related by a scalar multiple should be identiﬁed

and thus the dimension is reduced by 1. The mapping row shows how a square grid

could be mapped by an element in the respective groups, and the invariants row lists

several invariants of the group.

3.4 Group Actions And Representations

In geometry in general, and geometric deep learning in particular, one often

wishes to perform computations involving different kinds of geometric quanti-

ties, such as scalars, vectors, and tensors, as well as ﬁelds of such quantities.

The key thing that distinguishes and in fact characterizes these and other types

of geometric quantities is not their dimensionality, but rather the way in which

they transform under symmetry transformations. The idea that the type of a

geometric quantity is determined by the way in which the group of symmetries

acts on it was championed by Weyl (1939) and is widely used in physics.

As an example, consider the special orthogonal group SO(2) consisting of

planar rotations around the origin. It is intuitively obvious that we can apply

the same rotation g ∈SO(2) to a point on the circle S

, on the plane R

, or to

a planar image (a function on the plane). We are thus dealing with one group

Foundations of Equivariant Deep Learning 79

acting on several different spaces. Moreover, the group could act in different

ways on the same space, for instance rotating clockwise or counterclockwise.

In geometric deep learning, all feature spaces are associated with a group

representation, which tells us how to transform data living in that feature space.

This includes the input space, output space, and all intermediate feature spaces.

There are two equivalent ways to give precise meaning to the idea of hav-

ing a group G act on a set X. The ﬁrst approach is to deﬁne a group action

as a map α : G ×X →X satisfying certain criteria. By currying we can obtain

from α another map ρ : G →[X, X] (again satisfying certain criteria) called a

group representation, where [X, X] is the set of mappings from X to itself.

The reason for the name is that each abstract group element g is represented in

the space X by a concrete mapping ρ(g) : X →X. It should be noted that the

term “representation” has traditionally been used mostly for the important case

where ρ(g) is a linear map acting on a vector space X. We will use the term

representation for the general concept, and where necessary make a distinc-

tion between linear or Vect-representations and general or Set-representations.

Furthermore, since group actions and representations are equivalent (up to

currying), we will also permit ourselves to use these terms interchangeably.

Let us now look at the aforementioned “certain criteria” that a group action

must satisfy. Without further restrictions, a map of the type α : G ×X →X

could do rather strange things. For example, the identity element e ∈G might

not act as the identity on X, i.e., one could have α(e, x) ̸= x. Also, if we act

ﬁrst with h and then with g, the result might be different from acting with gh

(composing g and h in G). Thus, we want a group action to respect the structure

of the group, which consists of a designated identity element and the rule of

composition:

Let G be a group and let X be a set. A group action of G on X is a

mapping α : G×X →X satisfying:

α(e, x) = x (Identity)

α(gh, x) = α(g, α(h, x)), (Composition)

for all g, h ∈G, x ∈X, and where e ∈G is the identity.

When there is little chance of confusion, group actions can also be denoted

as g ·x = α(g, x).

If we have a group action α, we can obtain a group representation ρ by

deﬁning ρ(g) = α(g, –). We can also deﬁne a representation without reference

80 Chapter 3

to an action, but again not all assignments will do; ρ must satisfy two criteria

equivalent to the ones for group actions:

Let G be a group, let X be a set, and denote by [X, X] the set of

mappings from X to X. A (Set-)representation of G on X is a map-

ping ρ : G →[X, X] that assigns to each g ∈G a mapping ρ(g) : X →X

satisfying:

ρ(e) = id

(Identity)

ρ(g) ◦ρ(h) = ρ(gh), (Composition)

for all g, h ∈G, and where e ∈G is the identity.

The ﬁrst axiom (identity) says that the identity e ∈G should be represented

by the identity mapping on X, i.e. it should do nothing

. The second axiom

(composition) ensures that ρ preserves all positive relations of the form gh = k

that hold in the group

. This includes in particular relations such as gg

–1

= e =

–1

g that express that g

–1

is the inverse of g, which implies that ρ(g) must be

an invertible map for any g ∈G.

Negative relations, of the form gh ̸= k may not be preserved, however. For

instance, for any group we can deﬁne the trivial representation

, which maps

all elements g to the identity map on X. One may check that this is indeed a

valid group action. An action is called faithful when ρ is injective (meaning

that it is difference-preserving: g ̸= h ⇒ρ(g) ̸= ρ(h)). In this case, all negative

properties are also preserved.

In many cases, G and X have additional structure, such as topological or

smooth (Lie group) structure, in which case one has to adapt the deﬁnition

in order to make the math work nicely. For instance, by requiring ρ to be a

continuous or smooth map, and taking [X, X] to be the space of continuous /

smooth maps from X to itself

In GDL, data usually live in a vector space, and most of the representations

we encounter in practice are linear, meaning that the maps ρ(g) : X →X are

all linear transformations of a vector space X. As we will see in later chapters,

this is true even though the neural networks we work with are highly non-

linear. Such linear representations in vector spaces are deﬁned analogously to

ordinary representations in sets:

Foundations of Equivariant Deep Learning 81

Let G be a group, let X be a vector space, and denote by [X, X] the

set of linear mappings from X to X. A linear representation (or Vect-

representation) of G on X is a mapping ρ : G →[X, X] that assigns to

each g ∈G a linear mapping ρ(g) : X →X satisfying:

ρ(e) = id

(Identity)

ρ(g) ◦ρ(h) = ρ(gh), (Composition)

for all g, h ∈G, and where e ∈G is the identity.

We note that one can equivalently deﬁne a (linear) group representation as a

group homomorphism (section 3.3.2) from G to the group of invertible (linear)

transformations of X.

In practice we work in a basis of X relative to which the linear maps

ρ(g) : X →X can be represented by matrices

. Hence, we can also deﬁne

ρ as a matrix-valued function. Then, composition ◦ corresponds to matrix

multiplication, and the application ρ(g)(x) is matrix-vector multiplication.

Linear representations are useful because they turn abstract group theory

into linear algebra, which is well understood and for which we have highly

efﬁcient specialized hardware. Even when working with a group of matrices,

it is useful to consider linear representations, and one should be careful to dis-

tinguish the concepts of matrix groups and linear representations. For instance,

one could consider the group SO(2) as a matrix group consisting of 2 ×2 rota-

tion matrices, and have it act via n ×n matrices ρ(g) on the space of signals

on the circle, sampled at n points. Similarly, we can consider the group of

d ×d permutation matrices, acting on the space of d ×d adjacency matrices

by d

×d

matrices ρ(g). We will cover these examples in more detail later.

We have formally deﬁned a (linear) representation as a map ρ : G →X

but it is common to refer to the representation space X as the representation

when it is clear what ρ is associated with X. Nevertheless it is important to

note that according to Weyl’s principle, the type of geometrical quantity is

determined by the representation, not by the space alone. Consider for example

the group SO(2) of 2 ×2 rotation matrices. An SO(2)-vector is a quantity x ∈

that transforms according to the standard (vector) representation ρ

(g) = g.

On the other hand, a quantity in the same space R

is called a pair of SO(2)-

scalars if it transforms according to the trivial representation ρ

(g) = I

(where

is the 2 ×2 identity matrix). Thus the representation determines the type

of geometric quantity, and as we will see a mapping between two types of

geometric quantities / representations is called an equivariant map.

82 Chapter 3

3.4.1 Examples of Representations

It is high time to consider some examples of group representations and group

actions. Most groups considered in this book are matrix groups, and for these

groups there is a natural choice of representation often called the standard

representation, where each group element (matrix) A ∈G is represented by

itself: ρ(A) = A. This deﬁnition trivially satisﬁes the identity and composition

requirements to qualify as a representation.

As an example, consider the matrix group SO(2) = {R ∈R

2×2

| RR

⊤

I, det(R) = 1} of 2 ×2 rotation matrices. Clearly, each rotation matrix R ∈

SO(2) deﬁnes a linear mapping R

→R

given by matrix vector multiplica-

tion: x 7→Rx; this is the standard representation ρ(R) = R of SO(2). Note that

we can equivalently deﬁne SO(2) as the set of rotation angles [0, 2π) with com-

position given by addition modulo 2π. In that case we cannot deﬁne ρ(θ) = θ

because θ ∈[0, 2π] is an angle and not a matrix, but we can still deﬁne an

equivalent representation:

ρ(θ) =



cos(θ) – sin(θ)

sin(θ) cos(θ)



(3.15)

Using the rules of trigonometry one can show that matrix multiplication then

mimics angle-addition: ρ(θ)ρ(θ

′

) = ρ(θ + θ

′

), and that ρ(0) = I.

Similarly, the standard representation of the permutation group S

maps each

abstract permutation to a matrix that permutes the coefﬁcients of the vector on

which it acts. For instance, if π

is the permutation that swaps the ﬁrst two

elements of a length-3 sequence, the corresponding permutation matrix is:

ρ(π

) =





0 1 0

1 0 0

0 0 1





(3.16)

This representation is useful in GDL applications involving sets and graphs

(chapter 5), where each of the n nodes/elements is equipped with a feature

∈R that we wish to permute. Indeed we ﬁnd that (ρ(π)x)

= x

π(i)

. If instead

of one scalar per element we have a d-dimensional vector x

∈R

we can stack

them into a matrix X ∈R

n×d

and act on it with the same matrix: X 7→ρ(π)X.

Equivalently, we can stack them into one column vector x ∈R

in which case

we use an equivalent representation matrix with d identical blocks:

(π) =







ρ(π)







(3.17)

Foundations of Equivariant Deep Learning 83

The permutation group S

can also act on matrices of size n ×n, by simul-

taneously permuting the rows and columns. If ρ(π) is the standard matrix

representation of S

deﬁned above, we get a representation on matrices via the

tensor representation (section 3.6.6): (ρ ⊗ρ)(π)A = ρ(π)Aρ(π)

⊤

. This repre-

sentation is useful for instance for describing the transformation behaviour of

the adjacency matrix or edge-feature matrix of a graph.

Another representation of the permutation group is the sign (or parity)

representation. Any permutation π can be decomposed into a sequence of

transpositions (swaps of a pair of elements), and although this decomposition

is not unique, the number of transpositions N(π) is always the same. The sign

representation is given by ρ(π) = (–1)

N(π)

, i.e. it is 1 for an even number of

transpositions, and –1 for an odd number of transpositions.

For any group G we can deﬁne the trivial representation ρ(g) = 1, which

maps every element to the 1 ×1 identity matrix. One may verify that this

is indeed a representation. In geometric deep learning, we use the trivial

representation for features that should be invariant.

The matrix determinant has the property that det I = 1 and det AB =

det A det B for any square matrices A and B of the same size. Hence, it deﬁnes

a representation of the general linear group and any subgroup thereof (and

hence, of any matrix group). For example, for an element of the orthogo-

nal group A ∈O(n), the determinant tells us whether A ﬂips the handedness

(det A = –1) or leaves it unchanged det A = 1). Flipping handedness twice is

the same as not changing it, and indeed (–1) ·(–1) = 1.

The examples above are all linear representations, so let us consider a Set-

representation (group action). For any group G, the composition ◦: G ×G→G

deﬁnes an action of G on the set X = G, i.e. on itself. Written as a represen-

tation, this send g ∈G to the mapping ρ(g) : G →G deﬁned by ρ(g) = g ◦–, i.e.

ρ(g)(h) = g ◦h.

A ﬁnal example that will play an important role in the next chapter and the

rest of the book, is the action of a group on a space of functions. If we have

a group G acting on any space Ω, we can naturally deﬁne an action of G on

the space of real-valued functions x : Ω →R. Denoting the action of G on Ω

by g ·u, this is deﬁned as: (ρ(g)f )(u) = f (g

–1

·u). For instance, if G is the group

of rotations and translations of the plane Ω = R

, then ρ is the way in which G

acts on the space of grayscale images on R

3.5 Equivariant Maps & Equivariant Nets

Having deﬁned groups and representations, we are ﬁnally ready to give a

proper mathematical deﬁnition of equivariance:

84 Chapter 3

Let f : X →Y be a mapping of sets (or vector spaces), G a group,

and let ρ

: G →[X, X] and ρ

: G →[Y, Y] be two (linear) represen-

tations of G acting on the domain X and codomain Y of f . Then f is

G-equivariant if for all g ∈G:

f ◦ρ

(g) = ρ

(g) ◦f . (3.18)

An equivariant (linear) map can also be called a mapping or morphism

of representations, written f : (X, ρ

) →(Y, ρ

Unlike our preliminary discussion of equivariance in section 3.2.2, this def-

inition makes it clear that the equivariance constraint should be satisﬁed for

every element in a group, and that ρ

are not arbitrary maps but should satisfy

the requirements for being a representation (section 3.4).

If we have two equivariant maps then their composite is also equivariant:

◦f

◦ρ

(g) = f

◦ρ

(g) ◦f

= ρ

(g) ◦f

◦f

(3.19)

As always in mathematics and programming, we can only compose functions if

the codomain (output space) of the ﬁrst map matches the domain (input space)

of the second. For example, we can compose f

: X

→X

and f

: X

→X

obtain f

◦f

: X

→X

(read: f

after f

), but we cannot form f

◦f

because f

expects input of type X

but f

returns output of type X

instead.

It is very important to note that when composing equivariant maps, we must

ensure that not only the domain/codomain match as sets or vector spaces, but

also as representations, or else the composite will not be equivariant. In other

words, we should view an equivariant map f

not as a mapping of sets but of

representations. So in the example above, f

maps from representation (X

, ρ

)

to (X

, ρ

), and f

maps from (X

, ρ

) to (X

, ρ

). Since the output representa-

tion of f

matches the input representation of f

, we can compose them and the

composite f

◦f

will be an equivariant map from (X

, ρ

) to (X

, ρ

Following this rule, we can create arbitrary computation graphs from

equivariant maps, i.e. create equivariant networks:

Deﬁnition 1 (Equivariant Network) Let G be a group. A G-equivariant net-

work is a computation graph (DAG), where every node is associated with a

vector space X

and representation ρ

of G in X

, and all mappings are equiv-

ariant. That is, every node is computed as a function of its parents in the graph

by a (possibly parameterized) mapping that is equivariant with respect to the

representations acting on the input and output space of the map.

Foundations of Equivariant Deep Learning 85

Thus, when designing an equivariant network, we not only have to choose

the feature spaces X

(e.g. how many dimensions it has), but also how we want

it to transform

(except for the input and output space, where the represen-

tation is usually determined by the problem). Every mapping in the network

should be equivariant, including non-linearities, normalization layers, learn-

able linear layers, etc., because even a single non-equivariant layer breaks the

equivariance of the network as a whole.

One can create an invariant network simply by making the last layer invari-

ant, i.e. choosing the representation of the output space to be the trivial

one. For example in a classical convolutional network, one could choose

a global average pooling layer (averaging over space), which produces a

translation-invariant output

Mappings in a general computation graph can have multiple inputs and pro-

duce multiple outputs. In this case a simultaneous transformation g applied to

all inputs using their representation should lead to a corresponding transfor-

mation of the outputs. Such multi-variable mappings can also be considered as

mappings with a single input and output that is the concatenation of in/output

vector spaces. In an equivariant network each of those vector spaces has a rep-

resentation associated with it, and their concatenation will transform according

to a block-diagonal representation obtained by stacking representations along

the diagonal (see section 3.6.1).

The feature spaces of equivariant networks usually consist of a number n

of copies (called the multiplicity) of a few (typically low-dimensional) repre-

sentations ρ

acting on R

. As discussed in section 3.4, each representation

corresponds to a type of geometric feature (e.g. scalar, vector, tensor). Thus,

a geometric feature space consists of a number of copies of geometric features

of various types. This means that the channels / dimensions of a geometric

feature vector are grouped at two levels: each dimension belongs to a partic-

ular geometric feature vector of dimension d

(e.g. a 3D vector), and all n

geometric feature vectors of this type belong to a single isotypic subspace of

dimension n

. For example, we could have two geometric feature vectors

1,1

, v

1,2

of type ρ

(forming an isotypic subspace of dimension 2d

), and one

feature vector v

2,1

of type ρ





1,1

1,2

2,1





7→





(g)









1,1

1,2

2,1





(3.20)

In the next chapter we will consider feature spaces consisting of signals on

a domain, in which case the geometric quantities are not just stacked “ver-

tically” as we have done here, but also“horizontally”. For example, a point

86 Chapter 3

cloud (chapter 6) consists of 3D point features (the standard representation of

SE(3)) stacked horizontally, and we have a permutation group acting along the

horizontal axis by permuting points. A planar vector ﬁeld (chapter 8) consists

of 2D vectors transforming according to the standard representation of SO(2),

and additionally the group SE(2) acts by moving vectors to a new position on

the plane as well as rotating them.

In addition to geometric features, a learnable mapping takes parameters as

input. Parameters should in general be considered as invariant quantities (trans-

forming according to the trivial representation), and so don’t require special

treatment and don’t impose any constraints on the layer.

3.6 The Equivariance Cookbook

We have seen how equivariant networks can be constructed from equivariant

building blocks, but have not discussed the building blocks themselves. Here

we consider a number of common neural network layers and their equivariant

counterparts.

3.6.1 Concatenation, Direct Sums, and Geometric Feature Channels

A common operation in neural networks is to concatenate a number of vectors

∈R

, usually denoted x = concat(x

, …, x

). When the vector spaces R

are associated with representations ρ

of the same group G, the concatenated

vector x transforms according to a block diagonal representation:

ρ(g) = block-diag(ρ

, …, ρ

) =







(g)







(3.21)

In other words, concat(ρ

(g)x

, …, ρ

(g)x

) = ρ(g) concat(x

, …, x

). For each

g ∈G, the matrix ρ(g) has blocks of size d

×d

on the diagonal, and zeros

elsewhere.

In more abstract and basis-free language, concatenating vectors corresponds

to the direct sum of vector spaces, and block-diagonal stacking of represen-

tations corresponds to a direct sum of representations. We will not go into

this here, but note that these operations are denoted by

≃R

and

ρ =

, respectively.

One example where direct sums appear in applications is when we deal with

point cloud data. Here our data consists of a number of points x

∈R

, which

we can stack into one big vector x = concat(x

, …, x

). Each point is associ-

ated with a representation of the rotation group G= SO(3), namely the standard

representation (section 3.4.1) ρ

(R)x

= Rx

(where R is a rotation matrix).

Thus, the whole point cloud transforms as ρ(R) = block-diag(ρ

, …, ρ

), i.e. a

Foundations of Equivariant Deep Learning 87

matrix with n copies of R along the diagonal. If we want to add some rotation-

invariant features, such as a timestamp, we can do so by adding a number of

copies of the trivial representation ρ

(g) = 1 along the diagonal.

Instead of stacking the x

into a ﬂat vector, we could also stack the vectors

into a matrix X ∈R

3×n

, in which case the action is given by X 7→RX. For

consistency however, it is mathematically more convenient to always work

with the matrix-vector form, even if in our implementation we may choose

a different shape array for our data.

Concatenation can be performed explicitly as a computation step in the net-

work, but can also be used to describe feature spaces consisting of multiple

geometric quantities mathematically. In conventional neural networks, feature

spaces can be described as a concatenation of scalar features or in the case

of convolutional networks, of channels. In equivariant neural networks, one

typically considers feature spaces consisting of multiple kinds of geometric

features (representations) of various type, each occuring with some multi-

plicity. For instance, SO(3) rotation equivariant network might use a certain

number of scalar features, 3D vectors, and 3 ×3 tensors as features. If the i-th

feature type has dimension d

and we have c

copies of it, the total dimension

of the representation space would be d =

3.6.2 Vector Addition, Scalar Multiplication & Linear Combinations

Another common operation is vector addition: z = x + y ∈R

. This operation

is equivariant, provided that x and y not only have the same dimension d but

also transform according to the same representation ρ. In that case we have:

ρ(g)x + ρ(g)y = ρ(g)(x + y), (3.22)

that is, vector addition is equivariant.

Similarly, multiplying by an invariant scalar (i.e. either a constant or an

input-dependent but group invariant quantity) is also equivariant. The reason

is that scalar multiplication commutes with matrix multiplication: ρ(g)αx =

αρ(g)x. Combining vector addition and scalar multiplication, we can take

linear combinations of vectors, and this is equivariant as well when the rep-

resentations are the same and scalars are invariant: z =

, for scalars

∈R.

3.6.3 Change of Basis

If we have a d-dimensional matrix representation ρ and an arbitrary non-

singular (invertible) matrix A ∈R

d×d

, we can create a new representation by

changing the basis:

′

(g) = Aρ(g)A

–1

88 Chapter 3

We verify that ρ

′

is indeed a representation:

′

(g)ρ

′

(h) = Aρ(g)A

–1

Aρ(h)A

–1

= Aρ(g)ρ(h)A

–1

= Aρ(gh)A

–1

= ρ

′

(gh).

Two representations ρ and ρ

′

related in this way are called equivalent or

isomorphic representations.

When ρ and ρ

′

are equivalent by a change of basis A, then A is an

equivariant linear map between them:

Aρ(g) = Aρ(g)A

–1

A = ρ

′

(g)A. (3.23)

3.6.4 Permutation representations and pointwise nonlinearities

A typical neural network consists mostly of learnable linear maps and ﬁxed

elementwise nonlinearities. In an equivariant network, all maps must be equiv-

ariant, including nonlinearities. Whereas the equivariance constraint applied

to learnable linear layers simply leads to a reduction in the number of parame-

ters, nonlinearities generally don’t have any parameters and so we must choose

them to be equivariant. Equivariance is expressed relative to group represen-

tations acting on the input and output space of the nonlinear map, and so

the representations and nonlinearity must be chosen together. In this section

and several that follow, we consider various kinds of nonlinearities and the

representations to which they are equivariant.

Consider a discrete group G, a feature space X = R

and a representation ρ

of G acting on it. If ρ is a permutation representation, meaning that ρ(g) is a

permutation matrix for all g ∈G, then any pointwise nonlinearity is equivari-

ant with respect to ρ acting on the input and output space

. This is because a

pointwise nonlinearity acts the same way on each coordinate, and a permuta-

tion simply shufﬂes the coordinates, so it does not matter if we shufﬂe before

or after applying the nonlinearity.

In graph and set neural networks (chapter 5), we have a permutation group

acting by permutation matrices, and so we can freely use arbitrary point-

wise nonlinearities. Similarly, discrete translations of grids (chapter 7) such

as images and audio signals are an example of permutation representations,

because a discrete shift simply moves each pixel to another position (i.e.

dimension) without changing the numerical values. Thus, pointwise nonlinear-

ities in convolutional networks do not break discrete translation equivariance.

Foundations of Equivariant Deep Learning 89

More generally, we get permutation representations when working with

scalar signals on a space Ω having a group action (chapter 4) such as a homo-

geneous space (chapter 8). However, if the domain is continuous then we

are forced to discretize and as a result aliasing can occur, which will break

equivariance to continuous transformations. Since nonlinearities can introduce

introduce high-frequency structure into the signal, they are likely to lead to

aliasing. In order to preserve equivariance to pointwise nonlinearities it is thus

necessary to take measures to avoid aliasing (Cesa, Lang, and Weiler 2022).

3.6.5 Gating and Attention

When the representation matrices ρ(g) are not permutation matrices, pointwise

nonlinearities will in general not be equivariant. For instance, consider the stan-

dard representation of a rotation θ ∈[0, 2π) as a 2 ×2 matrix (eq. (3.15)) acting

on R

. A π/2 rotation maps (x, y) 7→(–y, x) for example, and this operation will

not commute with ReLU because of the sign ﬂip. Similarly, a ﬁeld of such

quantities (a vector ﬁeld; see chapter 8) will require special nonlinearities.

In cases like these, gating nonlinearities can be used. In conventional neural

networks, gating refers to multiplying a scalar or vector feature by a scalar

value called the gate, which is typically squashed into the (0, 1) or (–1, 1) range

by a sigmoid or tanh nonlinearity:

gate(x, g) = sigmoid(g ) ⊙x. (3.24)

Here x, g ∈R

are feature vectors, and ⊙ indicates elementwise multiplica-

tion. Sometimes, a single scalar gate g ∈R is used to scale an entire feature

vector x ∈R

or a sub-vector of it. Gating is used in LSTM and GRU blocks

(Hochreiter and Schmidhuber 1997; Chung et al. 2014), squeeze-and-excite

blocks (Hu et al. 2020), and many other architectures.

In equivariant networks, gating can be used on equivariant features, provided

that the gate is invariant and we use one scalar gate to multiply a whole geo-

metric feature vector (not one gate per component of the vector). As noted in

section 3.6.2, multiplying by an invariant scalar is equivariant. Invariants could

be produced by an invariant linear map (section 3.6.8) which is a special case

of an equivariant map. Alternatively, if the representation matrices ρ(g) are

orthogonal (as is often the case) one could compute the norm of geometric fea-

ture vectors, or more generally an invariant polynomial (Hilbert 1890; Derksen

and Kemper 2002; Villar et al. 2021). Then one can apply a nonlinearity such

as sigmoid or tanh (clearly, the output will still be invariant). Finally, one can

multiply a geometric feature vector x

(transforming according to a representa-

tion ρ

) by a scalar gate. Since scalar and matrix multiplication commute, this

is always equivariant.

90 Chapter 3

Attention is an operation that like gating involves multiplicative interactions

between data-dependent quantities, but additionally involves an averaging step.

Let X ∈R

N×D

be a matrix of N feature vectors of dimension D, and let A ∈

M×N

be a matrix of logits or unnormalized attention weights. Then attention

is computed as:

attention(X, A)

n=1

softmax(A)

, (3.25)

where the softmax is applied along the N axis, thus producing a normalized

distribution over the N feature vectors X

n,:

There are two kinds of equivariance we may wish the attention operation to

have. In graph neural networks and (non-autoregressive) transformers (chap-

ter 5) on wishes this attention to be equivariant to permutations of the N feature

vectors and attention weights; this is automatically satisﬁed. In applications

involving geometry (e.g. geometric graphs, chapter 6, one may wish atten-

tion to be equivariant to a group G (e.g. rotations) acting on each feature vector

n,:

identically. Since attention computes a weighted average (section 3.6.2) of

vectors with the same representation, it is equivariant to such transformations

provided that the logits A are invariant. In the case of self-attention, where A

is computed by inner products, this is guaranteed whenever the keys and values

have the same orthogonal group representation.

3.6.6 Tensor product nonlinearities

Gating nonlinearities, where we multiply a scalar and some other geomet-

ric quantity, is a special case of a more general operation called the tensor

product of representations. Given vectors x

∈R

, their tensor product (or

outer product) T =

lives in R

×···×d

. The element at index (j

, …, j

)

is the product of the corresponding elements of the vectors: T

,…,j

i,j

For example, the tensor product of two column vectors is a matrix given by

x ⊗y = xy

⊤

, i.e. (x ⊗y)

= x

. We can linearly combine tensors, so the

tensors of a given shape form a vector space

When the vectors x

have a group representation associated with them, the

space of tensors also naturally becomes a representation. Applying g ∈G to

x, y using their representations ρ

, ρ

, we ﬁnd:

(ρ

(g)x) ⊗(ρ

(g)y) = (ρ

(g)x)(ρ

(g)y)

⊤

= ρ

(g)(xy

⊤

)ρ

(g)

⊤

. (3.26)

Thus, T = x ⊗y ∈R

×d

is mapped to ρ

(g)Tρ

(g)

⊤

, and indeed this

deﬁnes a group representation on R

×d

. We could write this as ρ(g)T =

(g)Tρ

(g)

⊤

, but it should be noted that ρ(g) is a linear map from the space

of tensors R

×d

to itself, i.e. as a matrix it has shape d

×d

Foundations of Equivariant Deep Learning 91

Mathematically it is often convenient to gloss over the particular shape

(since reshaping doesn’t really do anything), but when implementing algo-

rithms the shape does matter. If we ﬂatten t = vec(T) ∈R

, we can take ρ(g)

to be a matrix of size d

×d

. The venerable Matrix Cookbook (Petersen

and Pedersen 2012, equation 520) tells us that:

vec(ρ

(g) T ρ

(g)

⊤

) =

(

(g) ⊗ρ

(g)

)

vec(T), (3.27)

where ρ

(g) ⊗ρ

(g) denotes the tensor / Kronecker product of matrices:

A ⊗B =







B ··· a







(3.28)

for matrices A ∈R

m×n

and B ∈R

p×q

In summary, we can write the action of G on the space of matrices either as

T 7→ρ

(g)Tρ

(g)

⊤

, or in ﬂattened form as t 7→ρ

(g) ⊗ρ

(g)t. This is called

the tensor product of representations ρ

and ρ

, and can be extended to arbitrary

products

When pairwise tensor products are used as a nonlinearity, the dimensionality

is greatly increased with each application. Hence it is common to com-

bine tensor products with the Clebsch-Gordan decomposition (Kondor, Lin,

and Trivedi 2018), where a (ﬁxed) change of basis is applied to the tensor

product in order to block-diagonalize it into irreducible representations (see

section 3.6.9), after which only a limited number of low-order representations

is retained.

Another way in which the tensor representation comes up is in graph neural

networks, where each graph on n nodes is deﬁned by an adjacency matrix

A ∈R

n×n

(in practice, the entries of the matrix are binary, but one can also

consider general edge or node-pair features). When we permute the nodes of

the graph, we have to permute both the rows and columns of the adjacency

matrix. This corresponds to the tensor representation ρ

⊗ρ

, where ρ

(P) = P

is the standard matrix representation of the permutation group. That is, the

adjacency matrix transforms as A 7→PAP

⊤

3.6.7 Equivariant Normalization Layers

Normalization layers, such as Batch-, Layer-, Group- and Instance-

Normalization, are an important part of modern network architectures includ-

ing convolutional and attention-based models (Ioffe and Szegedy 2015; Ba,

Kiros, and Hinton 2016; Wu and He 2018; Ulyanov, Vedaldi, and Lempitsky

2016). Normalizing features throughout the network can improve optimization

92 Chapter 3

speed and stability, and improve generalization. Generally speaking, normal-

ization layers compute a mean µ and standard deviation σ of features, use them

to normalize the features so that they have mean zero and standard deviation

one, and then optionally apply a learned shift and scale operation. They differ

in the axes over which the mean and variance are computed, and hence the

shape of µ and σ.

Regardless of the axes over which normalization is performed, a few

simple rules sufﬁce to make normalization layers equivariant. As noted in

section 3.6.2, we are always permitted to linearly combine geometric fea-

ture vectors, provided that they all have the same representation and that the

scalar coefﬁcients are invariant. Thus, we can compute the mean of geometric

feature vectors of the same representation, and subtract it from those feature

vectors, and the result will be equivariant with the same representation again.

We should however not average over the components of a single geometric

feature vector; the components are not in themselves geometric vectors with a

representation and so their mean is not a geometric quantity either. The stan-

dard deviation is a function of the sum of squares of the (mean-subtracted)

components, and this quantity (i.e. the norm) will be invariant as long as the

representation is orthogonal.

Let us describe an equivariant layer-normalization layer. Consider an array

X ∈R

N×C×D

, where N is the batch size, C is the number of channels (multi-

plicitY), and D is the dimensionality of geometric feature vectors

. We assume

that each vector X

transforms according to the same orthogonal representa-

tion. We will normalize over the C axis separately for each N. Thus, our mean

has shape N ×D and is computed as:

c=1

ncd

That is, we compute N mean vectors of dimension D by averaging over the C

axis. We subtract this from X to obtain X

ncd

= X

ncd

– µ

. Then, we compute

the squared norm for each geometric feature vector, and compute their average:

c=1

d=1

ncd

. (3.29)

Finally, we divide: Y

ncd

√

+ϵ

It is common to apply a gain and bias after normalization. As noted, scaling

by an invariant gain is always permitted, provided one uses the same gain for all

coefﬁcients of a geometric feature vector. Applying a bias is only equivariant

if the bias vector satisﬁes ρ(g)b = b for all g ∈G, i.e. if the bias is invariant.

Foundations of Equivariant Deep Learning 93

Depending on ρ, this may mean that it has to be 0, while in other cases there

may be a space of solutions of some dimensionality.

3.6.8 Equivariant Linear Maps

One of the most common type of maps in neural networks are learnable linear

maps, such as fully connected layers, convolutions, depth-wise convolutions,

and 1 ×1 convolutions. In the linear case it is often possible to understand

the space of equivariant maps and parameterize it. In later chapters we will

consider various cases with special structure, such as planar convolution, group

convolution, and graph convolution. In this section we consider the case of a

general linear map, also known as a fully connected layer.

Consider a group G, matrix representations ρ

(g) ∈R

n×n

and ρ

(g) ∈R

m×m

of G, and a matrix W ∈R

m×n

. For W to be equivariant, it has to satisfy the

equation ρ

(g)W = Wρ

(g) for all g ∈G. This system of equations is linear

in W, and so we can solve for W to ﬁnd a basis for the space of solutions.

If W

∈R

m×n

(for i = 1, …, k) is a basis for the space of solutions, then any

equivariant matrix W can be written as a linear combination of basis matrices:

W =

i=1

, and any such linear combination is equivariant.

It is common to add a learnable bias vector to the output, computing the

afﬁne map Wx + b instead of the linear map Wx, but this is not necessarily

equivariant even if W is. We can reduce the afﬁne case to the linear one by

appending a constant 1 dimension to x. This quantity is invariant, so we add a

constant 1 block (trivial representation) of size 1 ×1 to ρ

as well, yielding a

n + 1 dimensional matrix representation. Then, solving for W as before yields

an n + 1 ×m matrix containing the equivariant weight matrix and bias vector.

Depending on the group and the representations, it might be possible to

solve the equivariance constraint analytically, and this has been done for many

groups of interests. Indeed, as we will discuss in the next section, represen-

tation theory presents us with a systematic way to understand the space of

equivariant linear maps between two representations by breaking them down

into irreducible representations for which equivariant linear maps take a very

simple form.

Nevertheless, it can sometimes be more practical to solve the equivariance

equations numerically. To do so, we would like to ﬁrst bring the equivariance

equation in the standard matrix-vector form Aw = 0, where w = vec(W ) is the

ﬂattened weight matrix and A is a matrix of coefﬁcients derived from ρ

and

. We ﬁrst write the constraint as ρ

(g)Wρ

(g)

–1

– W = 0. Next, we vectorize

(ﬂatten) using the vec operator. Using the general fact (Petersen and Pedersen

(2012), Section 10.2.2., equation 520) that vec(AXB) = (W

⊤

⊗A) vec(X),

94 Chapter 3

where ⊗ is the Kronecker product eq. (3.28), we obtain:

(ρ

(g)

⊤

⊗ρ

(g) – I

⊗I

) vec(W) = 0 (3.30)

Thus, for each group element g we get a constraint matrix (ρ

(g)

⊤

⊗ρ

(g) –

⊗I

) of size nm ×nm. Assuming the group is ﬁnite with |G| elements, we

can stack these matrices into constraint matrix A of size |G| nm ×nm. We can

then use a standard method such as SVD to solve the linear system Aw = 0 and

obtain our (ﬂattened) basis matrices w

= vec(W

). If the group is not ﬁnite, it

is often sufﬁcient to sample a large number of group elements and solve the

system for those.

Performing the linear combination W =

i=1

can be expensive when

are large matrices, or when the number of basis matrices k is large. Once

training is complete, one can perform the linear combinations once and simply

store W as parameters, as in a conventional neural network, but during train-

ing this operation must be performed during each forward pass. Fortunately,

the matrices W

can often be chosen to be sparse with non-overlapping spar-

sity patterns. As a result, the matrix W will display a weight-sharing pattern

(Ravanbakhsh, Schneider, and Poczos 2017).

One example is convolution, which, as we discuss in Chapter 7 corresponds

to a matrix where the entries on each diagonal are the same. Another example,

discussed in Chapter 5 are permutation-equivariant linear maps. Consider a

linear map W from R

to itself that is equivariant to permutations g ∈S

acting

by permutation of coordinates. One can show that such a map has only two

parameters (Zaheer et al. 2017):

W = θ

+ θ

, (3.31)

where W

= I

is the identity matrix, and W

= 11

⊤

– I

is the matrix with

zero diagonal and one everywhere else. Thus, instead of computing the linear

combination, we could in this case copy the weight θ

across the diagonal, and

copy θ

to all off-diagonal entries of W.

As noted in section 3.5 and further studied in section 3.6.1, it is common

in equivariant networks to use block-diagonal representations. In this case the

constraint for W splits into blocks as well:



(g) 0

0 ρ

(g)



(g)

–1

0 ρ

(g)

–1







(3.32)

The equivariance constraint can then be solved for each block of W separately,

since the (i, j)-th block only needs to be equivariant with respect to the i-th

block of ρ

and the j-th block of ρ

(g)W

(g) = W

. (3.33)

Foundations of Equivariant Deep Learning 95

Furthermore, once we have solved these equations for all blocks, we can form

the full matrix W by linearly combining basis blocks and parameters, which

is more efﬁcient than linearly combining full basis matrices.

3.6.9 *Irreducible Representations

We have seen that we can combine representations into a larger one by stack-

ing them block-diagonally, and we have seen that we can change the basis of

a representation to obtain an equivalent one. Using these operations, we can

create rather a lot of representations. It is natural to ask whether we can go

in the other direction: Given some arbitrary representation, can we change the

basis to make it block-diagonal (with the same block structure for all g ∈G)?

And could we do this recursively on the blocks, until the representation matri-

ces are diagonal? These questions lead us to one of the most central and useful

concepts in representation theory, namely the concept of irreducible represen-

tations (irreps). We will see how irreps provide a systematic understanding of

the space representations and the equivariant linear maps between them.

To deﬁne irreducible representations we must ﬁrst deﬁne invariant sub-

spaces. Let ρ be a representation in a vector space V. An invariant subspace is a

linear subspace W ⊂V such that for all g ∈G and w ∈W, we have ρ(g)w ∈W.

So, while vectors w ∈W are not invariant themselves (ρ(g)w ̸= w in general),

they stay inside the subspace W when acted on by any group element, and so

we say that the subspace W is invariant.

For any representation ρ on V, the subspaces W = {0} and W = V are invari-

ant, so these are called trivial invariant subspaces. A representation with a

non-trivial invariant subspace is called reducible, and one without such a

subspace is called irreducible.

If we have a reducible representation ρ with non-trivial invariant subspace

W ⊂V, we can ﬁnd a change of basis so that ρ becomes block upper-triangular:

′

(g) = Aρ(g)A

–1



(g) ρ

(g)

0 ρ

(g)



(3.34)

A vector of the form

w = (w, 0) will remain of this form: ρ

′

(g)

w = (ρ

(g)w, 0),

i.e. it remains in the subspace W.

If it is possible to ﬁnd a basis where additionally ρ

(g) = 0, i.e. where ρ

′

is block-diagonal, we say that ρ is decomposable, and otherwise it is called

indecomposable. Irreducible representations are always indecomposable, but

the converse is not true in general. However, under some additional assump-

tions, which hold for most cases of practical interest in GDL, the converse

is true, which means we can keep decomposing a representation until all the

blocks are irreducible (and thus indecomposable). In particular, this is true for

96 Chapter 3

(complex) representation of ﬁnite groups, and for ﬁnite-dimensional unitary

representations.

Since (under benign conditions) a representation can be decomposed into a

direct sum of irreps, one would like to know all irreps for the group of interest.

Then, we would have a complete understanding of the space of representations

of our group. Fortunately, the irreps of most groups we care about have been

classiﬁed, and so usually this is a matter of looking up the irreps in a reference

work such as Vilenkin and Klimyk (1991). Computing the irreps by hand can

for a group of interest can also be educational

In chapter 7 we will consider grids, and groups of shifts acting on them.

These groups are commutative, and such groups have one-dimensional (com-

plex) irreps, so that any representation of such a group can be fully diagonal-

ized. Take for instance the group of 1D cyclic shifts C

(section 3.3.3.1), which

has irreps of the form ρ

(t) = e

2πikt/n

(where i is the imaginary unit, t ∈C

is a

translation and k ∈{0, …, n – 1}).

Those familiar with Fourier analysis may recognize the irreps ρ

as the

basis functions used in the Fourier transform, and indeed representation theory

reveals a vast generalization of Fourier analysis that applies to a large class

of groups. Each group has a different set of (inequivalent) irreps ρ

indexed

by some set Λ called the spectrum. The values λ can be interpreted as (gen-

eralized) frequencies, and depending on the group, the set Λ could be ﬁnite,

countably inﬁnite (e.g. Λ = Z), uncountably inﬁnite (e.g. Λ = R or Λ = C). In

chapter 8 we will discuss this topic in more depth.

The spectrum also gives us a convenient way to describe an arbitrary

representation in a basis-independent way. If we have a fully reducible rep-

resentation ρ of a group G with spectrum Λ, we can decompose it as a direct

sum of irreps: ρ =

for some list of frequencies λ

∈Λ. An even more

compact description is to count the multiplicity m

, i.e. how often each irrep

occurs in ρ. The list of frequencies or multiplicities can be called the type of

ρ. When we wish to be more precise, we can also include the change of basis

matrix A that decomposes ρ into irreps as part of the type, so that the type

precisely identiﬁes one representation ρ.

3.6.10 *Schur’s Lemma

Knowing how representations decompose into irreps is very helpful in under-

standing the space of equivariant maps. We saw in section 3.6.8 (eq. (3.32))

that if we have two representations ρ

, ρ

that split into blocks ρ

, ρ

, then a

(ρ

, ρ

)-equivariant linear map W splits into (ρ

, ρ

)-equivariant blocks W

So by decomposing the representations into irreps, we can reduce the problem

of characterizing the space of equivariant linear maps to the case of irreps. In

Foundations of Equivariant Deep Learning 97

other words, if we know how to parameterize equivariant W

, all we need to

do is stack them to form an equivariant W. Schur’s lemma tells us how to

parameterize the space of equivariant linear maps W

between irreps i, j.

Schur’s lemma has two parts which depend on different conditions, so we

treat them separately.

Lemma 2 (Schur I) Let G be a group, and let ρ

, ρ

be irreducible repre-

sentations of G on vector spaces V

, V

over a ﬁeld K (e.g. real or complex

numbers). Then a (ρ

, ρ

)-equivariant linear map is either an isomorphism or

the zero map.

In simple terms, assuming we are working with ﬁnite-dimensional vector

spaces and concrete matrices, to say that representations ρ

, ρ

are isomorphic

or equivalent is to say that an appropriate change of basis (section 3.6.3) turns

one into the other. That is, W

is invertible. Conversely if the representations

are not equivalent, that means there is no isomorphism between them, and so

according to Schur any equivariant linear map must be zero. Thus, already we

know that the blocks W

of W corresponding to inequivalent irreps must be

zero.

Lemma 3 (Schur II) let G, ρ

, ρ

be as in Schur I, and assume additionally

that V

= V

and is ﬁnite dimensional, and that K is algebraically closed (e.g.

complex but not real numbers). Then a (ρ

, ρ

)-equivariant linear map is a

scalar multiple of the identity, i.e. W = cI where c ∈K.

Thus, at least in the case of complex representations, we know how to

parameterize the equivariant linear maps W

for irreps i, j by a single complex

parameter! The situation for real representations is a little bit more involved,

but one can show that the space of equivariant linear maps between real irreps

is either 1, 2 or 4 dimension, depending on whether the real representation is of

real, complex or quaternionic type. This can be ascertained via a map called the

Frobenius-Schur indicator. More mathematical details can be found in Serre

1977; Etingof et al. 2011. In practice, it may be most convenient to compute

the equivariant linear maps for given irreps numerically using the technique

presented in section 3.6.8, or use a library that has already implemented them

for the group of interest (Geiger and Smidt 2022; Geiger et al. 2022; Cesa,

Lang, and Weiler 2022).

3.7 Summary

To be expanded.

•

Introduced the notion of symmetry in the context of learning problems.

98 Chapter 3

•

Symmetries form a group, act on data via representations. Algebraic prop-

erties of this group can be used to deﬁne a ﬁrst ‘blueprint’ for equivariant

networks.

•

Next steps: ‘augment’ symmetries with ‘near-symmetries’ leveraging further

physical stucture in the problem.

•

Next Chapter augments the algebraic priors with geometric priors.

Exercises

1. Consider a function f : X →Y and assume that it is equivariant, i.e. that:

f (ρ

(g)x) = ρ

(g)f (x), (3.35)

for g any element of some group G and actions ρ

and ρ

of G.

Prove that if two inputs have the same output f(x) = f (x

′

), then the transformed

inputs ρ

(g)x and ρ

(g)x

′

also have the same output.

2. Consider a label function f : X →Y between ﬁnite sets X, Y. How many symme-

tries (invariances) does f have?

3. Consider an unknown label function f : X →Y between ﬁnite sets, and assume that

we know the full symmetry group G of f . What is the minimum number of labelled

examples (x, y) we need to determine f ?

4. Consider a group action α(g, x) and show that the map f

(x) = α(g, x) is inverse to

(x) = α(g

–1

, x).

5. Consider the cyclic group C

= e, r, r

, r

, where e is the identity and r is a 90-degree

rotation. Create a table with the 4 group elements along the rows and columns,

and ﬁll entry i, j with the product (group composition) of the corresponding group

elements. You have written the multiplication table for C

6. Given a permutation σ : [1, …, n] →[1, …, n], how do we produce the correspond-

ing n ×n permutation matrix P

? Show that the composition of maps σ ◦σ

′

corresponds to matrix multiplication P

′

7. Show that given matrix representations ρ

, …, ρ

of G, the map ρ(g) =

block-diag(ρ

, …, ρ

) is also a representation. (Sec. 3.6.1).

8. Consider G-invariant function f : X →Y. We say that f is complete if for any x, x

′

for which there exists no g ∈G such that gx = x

′

, we have f (x) ̸= f (x

′

). Show that if

f : X →Y is a complete invariant mapping, any invariant mapping f

′

: X →Z can

be written as f

′

(x) = f

′′

(f (x)) for some mapping f

′′

: Y →Z.

9. Show that a group homomorphism maps the identity to the identity.

10.Show that the normalization operation Y

ncd

√

+ϵ

deﬁned in Sec 3.6.7 is

equivariant.

Foundations of Equivariant Deep Learning 99

Notes

More speciﬁcally, vectors are abstract objects than can be added (vector

addition) and scaled (scalar multiplication) according to certain rules. One

often hears vectors deﬁned as “arrows” or “arrays of numbers”; mathematically

speaking, none of these deﬁnitions is correct: the former requires an additional

structure called inner product (which deﬁnes direction), the latter requires an

additional assumption of a basis (in which a vector can be represented using

coordinates).

In the context of Euclidean geometry, congruent and isomorphic mean the

same thing, but congruent is more common in elementary courses whereas

isomorphism is applied throughout mathematics.

For example, all linear maps preserve the structure of a linear space in that

they preserve linear relations: αx + βy = z ⇒α(Ax) + β(Ay) = Az. However,

only the invertible linear maps from a vector space to itself could be called

symmetries (automorphisms) of that space.

An afﬁne space is like a vector space, except that it does not have a

designated origin 0 that must be preserved by symmetries

This kind of notation is common in category theory and is explained in

Appendix E.

After the Norwegian mathematician Niels Henrik Abel (1802–1829).

We mostly work ﬁnite-dimensional real matrices, but one may also consider

groups of complex matrices, and inﬁnite-dimensional matrices.

Cayley’s theorem is a special case of a deep result in category theory called

the Yoneda Lemma.

The classiﬁcation of ﬁnite simple groups spans tens of thousand of pages

and was completed largely between 1955 and 2004.

Quaternions are non-commutative hypercomplex numbers of the form a +

bi + cj + dk. They are often used as a description of 3D rotations.

By “preserving vector space structure” we mean that linear maps A preserve

linear relations: αx + βy = z ⇒α(Ax) + β(Ay) = Az for scalars α, β and vectors

x, y, z.

More precisely, ∠(x, y) = cos

–1

⊤

y/∥x∥∥y∥).

A ﬁeld in this context is a scalar-, vector-, or tensor-valued function on some

geometric domain.

For example, physicists would often say that “a tensor is an object that

transforms as a tensor”. In representation-theoretic terms, one could say that

tensors are quantities that transform according to a tensor representation of the

general linear group.

100 Chapter 3

Recall that mappings like ρ(e) and id

are equal if and only if they give

equal outputs for all inputs, so the Identity axiom could also be written as

ρ(e)(x) = x for all x ∈X.

More precisely, to preserve these relations means that gh = k ⇒ρ(g) ◦ρ(h) =

ρ(k).

Despite its name, the trivial representation has many important uses, since

it is the representation we use for any invariant quantity.

One can give a general deﬁnition using the category-theoretic notion of an

“action of a group object internal to a category”.

When the vector space is inﬁnite dimensional, as it will be when we consider

spaces of signals / functions on a continuous domain, the “matrix” will be

inﬁnite dimensional as well.

Learning the group or its representation is an interesting and important

active area of research.

In/equivariance in CNNs holds only up to border effects and aliasing unless

speciﬁcally addressed; see Azulay and Weiss 2019; Zhang 2019; Vasconcelos

et al. 2021; Karras et al. 2021; Weiler and Cesa 2019; Cesa, Lang, and Weiler

2022

One can also consider representations by signed permutation matri-

ces, which are matrices with a single 1 or –1 in each row and column.

For such representations, one can use the CReLU nonlinearity CReLU(x) =

(ReLU(x), ReLU(–x)). See Cohen and Welling (2017) and Shang et al. (2016)

In mathematics and physics, one usually considers tensor products of vec-

tors and dual vectors (which have different representations associated with

them), leading to tensors of mixed type, with some number of upper and lower

indices. Here we will not consider dual vectors, though.

Here we assume that all geometric feature vectors are of the same dimen-

sion and have the same representation. The approach can be generalized to the

case where we have C

copies of various representations ρ

by applying the

calculation for each λ separately.

One approach would be to explicitly construct and then decompose the

regular representation (G acting on the space of functions on G). According to

the Peter-Weyl theorem, under suitable conditions, the regular representation

decomposes into a direct sum of all irreps, with multiplicity equal to their

dimension.