3
Foundations of Equivariant Deep Learning
We define symmetries as structure-preserving transformations of an
object, and study examples.
Symmetries of a learning problem are transformations of the input space
that leave the output invariant.
To exploit symmetries in deep learning, we use equivariant and invariant
layers.
Symmetries form groups. In this chapter we study basic notions from
group theory.
Groups act on data via group actions or representations. In this chapter
we study basic notions from representation theory.
We study equivariant mappings and equivariant networks in general, and
provide an overview of common equivariant neural network layers.
In this chapter we begin our study of the foundations of geometric deep
learning. The central concept is symmetry: a transformation that leaves all
relevant structure of an object unchanged. We will look at the way in which
symmetries arise in machine learning, and study the central algebraic structure
used in the mathematical study of symmetries: the group. We define general
and linear group actions / representations, which tell us how the elements of
an abstract group can be represented by concrete (linear) maps. Using group
representations, we can specify how a group acts on various kinds of geomet-
ric data, from images to point clouds to tensor fields. Next we discuss invariant
and equivariant maps, which are the building blocks of equivariant networks
and the primary tools we use to leverage knowledge of symmetries to improve
the statistical efficiency of deep learning methods. We end the chapter with
the equivariance cookbook, which provides a number of common kinds of
equivariant maps from which equivariant networks can be constructed.
Our focus in this chapter is on the algebraic (group-theoretic) aspects of
GDL, whereas the next chapter will cover the geometric ones. Concretely, this
means that in the present chapter we consider data that lives in a plain vector
62 Chapter 3
space with a group of symmetries acting on it, but without too much further
structure, while in the next chapter this vector space is taken to be the space of
signals (functions) on a space (e.g. graph, manifold, etc.). Thus, for now, we
can talk about symmetries, groups, representations and equivariance, but we
will not consider concepts such as locality and convolution, which depend on
the structure of the underlying space.
3.1 Symmetry Transformations
Let us begin by giving a general definition of symmetry, before we consider
the role it plays in geometric deep learning. Just as how the best way to define
“vector” is “element of a vector space” (defining the thing by its relation to
other things of the same kind
1
), one could define “symmetry” as “element of
a symmetry group”. But although such a definition would be general, compact
and precise, it would not be very intuitive. So we begin with a more infor-
mal definition, and then show how the notion of group (to be defined) neatly
captures the idea.
The modern concept of symmetry applies to a wide range of abstract
mathematical structures, but let us first consider the familiar example of an
equilateral triangle (fig. 3.1, left). We say that rotation by 120
is a sym-
metry of the triangle, because after applying this transformation the triangle
looks exactly as it did before. The same is true for reflection about one of
the three bisectors, as well as arbitrary sequences of rotations and reflections.
Furthermore, the inverse of any of these symmetries is itself also a symmetry.
Figure 3.1
Euclidean symmetries (left) and general isometries (right) of triangle.
When drawn on paper, the original and transformed triangle look exactly
the same. However, for other ways of encoding or representing the triangle,
we may find that there are actually some superficial changes. For instance, if
we encode the triangle as an ordered list of three corner points in (x, y) coor-
dinates, then a rotation by 2π/3 of each coordinate vector leads to an overall
permutation (cyclic shift) of the three corners. Nevertheless, if we forget about
Foundations of Equivariant Deep Learning 63
the ordering, we still have the same set of corners. Since we do not consider
the order of the corners to be part of the essential structure of the triangle, the
rotated triangle (list of corner coordinates) should be considered equivalent to
the one we started with. This phenomenon is key to geometric deep learning,
because any sensible computation on such data (including for instance those
performed by a neural network) should treat equivalent inputs equivalently,
even if they have superficial differences.
In this example and also in general, a symmetry transformation is a special
kind of structure-preserving map. What is meant by structure is highly context-
dependent, but in order to discuss symmetries one has to choose some notion
of structure of interest. If we are doing Euclidean geometry, the structure of
interest is Euclidean distance structure, and so here structure-preserving map
means distance-preserving map, aka isometry or Euclidean transformation.
More precisely, g : R
n
R
n
is an isometry wrt distance metric d, if
d(gu, gv) = d(u, v),
For all u, v R
n
. The isometries of the plane consist of rotations, reflections,
translations, and all possible compositions of these.
Applying an arbitrary isometry to our triangle, we obtain a new but congru-
ent (or isomorphic
2
) triangle (Fig. 3.1, right). When the transformed triangle
is not just congruent but actually the same as the one we started with, we call
the transformation a symmetry of the triangle. In other words, a symmetry
transformation is a way in which the triangle is congruent to itself.
With this in mind we could say in general that a symmetry is a way in which
an object is equivalent (or isomorphic) to itself, or more prosaically:
A symmetry transformation (or automorphism) of an object is an
invertible and structure-preserving map from the object to itself.
Oftentimes, we are interested in the symmetries of a space. Consider for
example the plane endowed with a Euclidean distance metric. This space has as
symmetries all isometries of the plane, i.e. all rotations, reflections and trans-
lations. They are symmetries because they are maps from the plane to itself,
they are invertible, and they preserve distances. In this case it happens to be the
case that all structure-preserving maps are invertible, but in general one has to
add this as an additional requirement.
3
The same set may be endowed with different structures, leading to different
symmetries. For instance we could see the plane as:
1. A set, so that symmetries are bijections (invertible maps; permutations),
64 Chapter 3
2. A vector space, so that symmetries are invertible linear transformations,
3. An affine space
4
, so that symmetries are invertible affine maps (linear maps
plus translation),
4. A topological space, so that symmetries are homeomorphisms (continuous
maps with continuous inverse).
5. A smooth manifold, so that symmetries are diffeomorphisms (smooth maps
with smooth inverse).
We could continue the list, regarding the plane each time as an inner product
space, Euclidean metric space, and so on. Those not familiar with these notions
should not worry: the only point we wish to make is that what counts as a
symmetry depends on exactly what we consider to be the structure of interest.
3.2 Symmetries of Learning Problems
How do symmetries appear in machine learning, and why are they relevant?
In this section we look at two kinds of symmetries that arise in supervised
learning: symmetries of the target function (manifested as invariance or equiv-
ariance) and symmetries of the parameterization. The former will be used
throughout this book to develop statistically efficient networks that respect
such symmetries, whereas the latter has relevance for optimization but will
not be the focus of this book.
3.2.1 Invariance
In supervised learning (section 2.1) the goal is to use data samples to approxi-
mate the target function f : X Y mapping from input space X to output space
Y. For concreteness we will consider the problem of classification, where Y is
the set of class labels, but we could just as well consider a continuous output
space in case of regression.
In the previous section, we defined a symmetry of an object as an invertible
structure-preserving transformation of the object. For learning problems such
as classification, the object of interest is the input space X, and we consider
the label function f to define the structure of interest (“class structure”). A
symmetry of the learning problem is thus an invertible and label-preserving
map g : X X. To say that g preserves the label, or leaves it invariant, is to
say that for all x X, we have (f g)(x) = f (x). That is, applying g and then f to
any input x X gives the same result as just applying f directly. In diagramatic
Foundations of Equivariant Deep Learning 65
Figure 3.2
Rotation is a symmetry of the label function in image classification.
form, we can express this as:
Y
X X
g
f f
(3.7)
A commutative diagram like this expresses the fact that if we compose maps
(arrows) along any directed path with the same begin and end points, the result-
ing composite maps are the same. Hence this diagram expresses the fact that g
followed by f equals f .
5
As an example of a label symmetry, consider image classification, where X
is the space of planar images and Y is a set of labels. Typically, the class label
does not depend on the position and orientation of the object, so a translation
or rotation of the image, will be a symmetry (see fig. 3.2). Similarly, if we are
classifying a point cloud, then X is a space of point sequences. Typically, the
label does not depend on the order of the points, so any permutation of points
will be a symmetry. Additionally, it may be that rotating and translating all of
the points simultaneously does not change the label.
The examples of symmetries mentioned above are quite natural, but in gen-
eral there will be many more. For instance, if we have two inputs x, x
X with
the same label (i.e. f (x) = f (x
)), we can define a transformation g that swaps
them, while leaving all other points unchanged. This will be a symmetry of f ,
but it is unlikely we would know if we didn’t already know f !
If somehow we came to know all symmetries of a label function f , learning
becomes almost entirely trivial. The reason is that if we observe a data point
x with label, we immediately learn the label of all points x
in the same class,
because there exists a symmetry that swaps x and x
. So in theory we would
need only one label per class. This example nicely demonstrates how knowl-
edge of symmetries can in principle be used to improve data efficiency, but it
also suggests that if we knew all symmetries of the label function we wouldn’t
really need learning. In other words, for problems where we do need learning,
66 Chapter 3
Figure 3.3
One invariant and three equivariant computer vision problems: image classification,
oriented keypoint detection, image segmentation, optical flow estimation.
we should not expect to be able to know all of the symmetries a priori. In chap-
ter 4, we will consider a natural class of symmetries arising from the domain
on which the data lives, and which we often know about a priori. For now we
will simply assume knowledge of some symmetries without worrying about
the specific kind or how we have come to know them.
3.2.2 Equivariance
For invariant learning problems, the output remains unchanged when we apply
a symmetry transformation to the input. There is another class of problems
where instead the target function is equivariant, meaning that the output trans-
forms along with the input (see fig. 3.3 for examples). In image segmentation
for instance, the input is an RGB image and the output is a segmentation
mask, which assigns a label to each pixel. When we shift the input, the out-
put should also shift. Another example is image registration, where the goal is
to output a transformation that puts the image in a canonical pose (e.g. putting
a brain scan in upright position and perhaps applying a deformation so that
brain regions can be compared voxel-wise). Similarly, learned or hand-crafted
keypoint detectors (e.g. SIFT; Lowe 1999; Lenc and Vedaldi 2016) should be
equivariant: when the image is transformed, the keypoint position, scale, and
orientation transforms along with it.
As these examples show, the output will in general be a different kind of
geometric quantity than the input, meaning that it lives in a different space and
transforms in a different way under the same symmetry. For instance, the same
rotation can act on an input signal (e.g. brain scan with n
3
voxels) and on the
output pose variable (e.g. a 3 ×3 rotation matrix). What it means to “rotate”
a voxel grid or 3 ×3 matrix is different. For this reason we will formalize the
concepts of group (e.g. rotations) and group representations / actions (how to
Foundations of Equivariant Deep Learning 67
apply a rotation to various kinds of quantities) later in this chapter. For now,
we will simply write g for a symmetry, and write ρ
X
(g) : X X and ρ
Y
(g) :
Y Y for the way it acts on input and output space.
Equivariance of the target function f : X Y can then be stated as:
f ρ
X
(g) = ρ
Y
(g) f . (3.8)
This is an equality of mappings, which means that the mappings on the left
and right-hand side should agree on all inputs x X. Hence, this equation says
that if we apply a symmetry transformation g to an input x using ρ
X
and then
apply f , we get the same output as applying f first and then applying the same
symmetry g to the output using ρ
Y
.
We can also express equivariance in diagramatic form (appendix E), so that
we can easily see the domains and codomains of the maps involved:
Y Y
X X
ρ
X
(g)
f
f
ρ
Y
(g)
(3.9)
To say that this diagram commutes is to say that starting in the bottom left X,
going up and then right is the same as right and then up.
We can obtain yet another way of expressing equivariance by multiplying
eq. (3.8) on the left by ρ
Y
(g)
–1
and cancelling:
ρ
Y
(g)
–1
f ρ
X
(g) = f .
This shows most clearly that also in the case of equivariance we can think of
g as a symmetry, because it leaves f invariant when acting on the input and
output space simultaneously, similar to the invariance equation f ρ
1
(g) = f .
Indeed one can see that invariance (eq. (3.7)) is a special case of equivariance
(eq. (3.9)) where ρ
Y
(g) = id
Y
is the identity map for all g G.
3.2.2.1 Equivariance and Data Efficiency As with invariance, equivari-
ance puts a constraint on the hypothesis space F, thereby reducing the
statistical complexity of learning (chapter 2). The more symmetries we have,
the more constraints we get, and the smaller the remaining hypothesis space.
Although we have so far discussed equivariance of the network as a whole,
in practice one would build an equivariant network by composing learnable
equivariant layers, each of which is subject to a constraint that reduces the
number of parameters in that layer.
To getter a better understanding of the role of equivariance in pattern recog-
nition, consider fig. 3.4. In this example, a neural network layer learns a
68 Chapter 3
style-agnostic representation of the pattern A (two inputs x, x
X, repre-
senting two styles of ‘A,’ are mapped into the same point, y = f (x) = f (x
) Y).
However, when the input patterns are transformed by g (in our example, a
rotation ρ
X
(g)x and ρ
X
(g)x
), a general neural network does not necessarily
generalize in a symmetry-consistent manner, i.e., it might be that f (ρ
X
(g)x) ̸=
f (ρ
X
(g)x
).
By imposing equivariance, we guarantee that the model generalizes in a way
that is consistent with the symmetry: from the equivariance condition,
f (ρ
X
(g)x) = ρ
Y
(g)f (x)
f (ρ
X
(g)x
) = ρ
Y
(g)f (x
);
together with the assumption f (x) = f (x
), it follows that f (ρ
X
(g)x) =
f (ρ
X
(g)x
).
A
A
A
A
X
Y
Figure 3.4
Example of the role of equivariance in pattern recognition.
In addition to these heuristic arguments, there is a growing body of work
in learning theory that formally characterize the benefits of equivariance (Lyle
et al. 2019; Sannai, Imaizumi, and Kawano 2021; Elesedy and Zaidi 2021;
Behboodi, Cesa, and Cohen 2022).
3.2.3 Symmetries of the Parameterization
There is another way in which symmetries appear in machine learning. When
we approximate the true label function f : X Y by a parameterized function
f : X ×Θ Y, we may be introducing symmetries of the parameterization.
Foundations of Equivariant Deep Learning 69
Whereas input-space symmetries act on X, a parameter-space symmetry acts
on Θ. A mapping h : Θ Θ is a symmetry of the parameterization if for all
u X and θ Θ:
f (u, hθ) = f (u, θ). (3.10)
In other words, the mappings f (–, θ) and f (–, hθ) are the same; the parameter
vectors θ and hθ correspond to the same classifier.
In neural networks, one can usually permute hidden units (along with their
incoming and outgoing weights), without changing the input-output mapping.
Furthermore, for certain kinds of non-linearities like ReLU(x) = max(x, 0), we
can scale the input and inverse scale the output of a nonlinearity without
changing the function: α
–1
ReLU(αx, 0) = ReLU(x).
Knowledge of parameter-space symmetries can be useful in optimization,
Bayesian inference, and model averaging (Kunin et al. 2021; Ainsworth,
Hayase, and Srinivasa 2022; Wei et al. 2022; Kurle et al. 2022; Zhao et
al. 2022; Zhao et al. 2023; Wei and Lau 2023). In some applications it is useful
for one neural network to process the weights of another, in which case the
former should respect the weight-space symmetries (Navon et al. 2023; Zhou
et al. 2023). In the rest of this book we will primarily focus on input space
symmetries and how to use them to improve statistical efficiency.
3.3 Symmetry Groups
Having defined symmetries and seen how they emerge in machine learning, we
will now begin our study of the mathematical machinery used to think about
symmetries in a systematic and general way. We begin in this section with
the concept of a symmetry group, before studying group representations and
equivariant maps.
The set of all symmetries of an object has a number of salient properties,
which have been distilled into a few abstract group axioms. The first property
to note is that the identity transformation (“do nothing”) is always a symme-
try, in a trivial way. Secondly, symmetries may be combined to obtain new
symmetries: if g and h are two symmetries, then their composition g h (first
applying h and then g) is also a symmetry, and this composition operation is
associative. Finally, by definition symmetries are always invertible, and the
inverse is also a symmetry. We encourage the reader to verify for themselves
that these properties indeed hold for symmetries as we have defined them in
section 3.1.
It is possible to define a symmetry group as a set of symmetry transforma-
tions g : X X, satisfying these axioms (identity, closure under composition
and inverses). However it turns out, as it often does, that in order to study a
70 Chapter 3
group it is not so important what symmetries are (i.e. what they do to ele-
ments of X), but rather how they compose with other symmetries. So the
most common definition of group does not say what the group elements are
(symmetries), but only encodes the compositional structure:
A group is a set G along with a binary operation : G ×G G
called composition or the group operation (for brevity, denoted by
juxtaposition g h = gh) satisfying the following axioms:
Associativity: (gh)k = g(hk) for all g, h, k G.
Identity: there exists a unique e G satisfying eg = ge = g for all g G.
Inverse: For each g G there is a unique inverse g
–1
g such that
gg
–1
= g
–1
g = e.
Note that as with function composition or matrix multiplication but unlike
multiplication of numbers, the group operation need not be commutative, i.e.
we may have gh ̸= hg. Groups for which gh = hg for all g, h G are called com-
mutative or Abelian.
6
Groups can be finite or infinite, discrete or continuous.
So-called Lie groups, like the group of rotations in R
n
, are smooth mani-
folds, meaning we can do calculus on them (more on this in later chapters).
Section 3.3.3 shows a number of examples of groups. It is worth checking for
each one that it indeed satisfies the axioms of a group.
As mentioned before, we have defined a group as an abstract object, meaning
we have not stated what the group elements are (e.g. symmetry transfor-
mations of some object, as we defined them in section 3.1), only how they
compose. By contrast, a concrete group (sometimes called symmetry group)
would have as elements symmetry transformations (section 3.1) g : X X,
and the group operation is given by function composition. In the special case of
matrix groups, the group elements are matrices
7
A GR
n×n
, and the group
operation is given by matrix multiplication. Starting from a concrete group
of transformations, we obtain an abstract group by simply forgetting that the
elements are functions or matrices, while remembering the composition rule
: G×GG. As it turns out, knowing how group elements compose is all
you need to study every aspect of the group itself.
In practice, almost all of the groups we consider in this book can be defined
as matrix groups, so let us state the definition explicitly:
Foundations of Equivariant Deep Learning 71
An n-dimensional real matrix group is a set of matrices G R
n×n
satisfying the following axioms:
Closure: For all A, B G, the matrix product AB is also in G.
Identity: the identity matrix I is in G.
Inverse: All A G are invertible (full rank) and the matrix inverse
A
–1
is also in G.
In a matrix group, the composition operation is given by matrix-matrix mul-
tiplication, which is automatically associative, and so we do not need to add
this axiom. In an abstract group where we explicitly define the composition
rule as a map : G ×GG, the composition of two elements in G is always in
G, but since the matrix product of two elements of GR
n×n
need not be in G,
we need to add this closure axiom explicitly for matrix groups.
Whereas a concrete group is made up of functions g : X X that can be
applied to elements x X, the data specifying an abstract group does not tell us
how to act on geometric data. This ability can be reintroduced by defining one
or more group representations (or group actions), as discussed in section 3.4.
Group representations are useful even when we are dealing with a matrix
group, because oftentimes we want the same group to act on different kinds
of geometric quantities, such as the inputs, outputs and internal features of a
neural network.
3.3.1 Subgroups
We noted in section 3.1 that what can be called a symmetry depends on what
structure we want to preserve. When we add structure, there will be fewer
symmetries; our group is reduced to a subgroup.
Let G be a group. A subgroup HG is a subset of G that is itself a
group. Concretely, this means that H is closed under composition and
inverses: when we compose two elements in H the result is again in H,
and the inverse of any element in H is also in H.
For any group, the subset H= {e} is a subgroup, and similarly any group is
a (non-proper) subgroup of itself. For a slightly less trivial example, consider
the group of planar translations (R
2
, +), and the subgroup given by translations
along the x-axis preserving the y-axis.
72 Chapter 3
A classical example comes from the Erlangen program: Klein noted that
different geometries are associated with different groups, which in some cases
form subgroups that preserve progressively more geometric structure. For
instance, the group of Euclidean transformations is a subgroup of the affine
group, which is a subgroup of the projective group. Projective, affine and
Euclidean geometry are concerned with the study of invariants of their respec-
tive groups, such as distances for Euclidean geometry, ratios of surface areas
for affine geometry, and cross-ratios for Projective geometry.
3.3.2 *Mappings between Groups: Homomorphisms and Isomorphisms
Whenever we study a new kind mathematical gadget, it is a good idea to think
about the appropriate notion of structure-preserving mapping between gadgets.
In the case of groups, these are called group homomorphisms. They are maps
between groups that preserve the compositional structure of a group:
Let G and H be groups. A group homomorphism is mapping ϕ : G H
such that for all g, h G:
ϕ(gh) = ϕ(g)ϕ(h).
When interpreting the equation above, note that on the left hand side we are
composing g and h using the composition operation in G, whereas on the right
hand side we are composing ϕ(g) and ϕ(h) in H. One can show that a group
homomorphism maps the identity of G to the identity of H.
If we have two group homomorphisms ϕ
1
: G H and ϕ
2
: HK, then
their composition ϕ
2
ϕ
1
: G K is also a homomorphism. Moreover, for any
group G the identity map id
G
: G G is a homomorphism, which shows that
groups and group homomorphisms form category (appendix E).
One can show that the image of a group homomorphism, imϕ = {h
H| g G : ϕ(g) = h}, is a subgroup of H. Similarly, the kernel ker ϕ = {g
G | ϕ(g) = e
H
} is a subgroup of G. The interested reader may study the
first isomorphism theorem, which relates the image and kernel of a group
homomorphism.
Group homomorphisms always preserve positive compositional relations of
the form gh = k, but they need not preserve negative relations of the form gh ̸= k.
As an extreme example, for any groups G and H, the mapping ϕ(g) = e
H
that
sends every element of G to the identity of H, is a group homomorphism.
When a group homomorphism does preserve distinctness, i.e. g ̸= h ϕ(g) ̸=
ϕ(h) it is called an injective group homomorphism or monomorphism. Hence
one can think of an injective group homomorphism as pointing to a G-shaped
Foundations of Equivariant Deep Learning 73
subgroup of H, labelling the elements of this subgroup by elements of G
without duplicates.
If a mapping is both injective (mapping each element of the domain to a
different element of the codomain) and surjective (reaching every element of
the codomain) it is bijective, or invertible. A bijective group homomorphism is
called a group isomorphism. Isomorphic groups are for all intents and purposes
the same, because we can translate any statement about one of them to an
equally true statement about the other, and so it is very common to consider
them as the same group.
3.3.3 Examples of Groups
Throughout this book we will encounter various kinds of symmetry groups;
commutative and non-commutative, finite and infinite, discrete and continuous.
In order to build intuition for the general concept of group, and to become
familiar with the groups we will consider later on in this book, we present
in this section an overview of the most important ones. It is not necessary to
understand and become familiar with all of these; it is sufficient for now to
look at a few examples to make the concept of group feel less abstract.
3.3.3.1 Finite Groups When studying a new concept it is helpful to look
at simple examples first. What is the simplest example of a group? Looking at
the axioms of a group, we notice that the identity axiom tells us that any group
has to have an identity element, so we cannot have an empty group with zero
elements. Can we have a one-element group, with just the identity element
e? We define T = {e}, composition by e e = e (the only possible choice) and
inverse by e
–1
= e (idem). By definition we have an identity, and an inverse for
all elements (just e), so the identity and inverse axioms are satisfied. We check
that composition is associative: (ee)e = (e)e = e = e(e) = e(ee), so indeed T is a
group, called the trivial group. As a matrix group, we may represent e as the
1 ×1 identity matrix I
1
= [1], or a higher dimensional identity matrix, as these
satisfy II = I and I
–1
= I just like the abstract group element e. When we have
an object whose symmetry group is trivial, it means that the object does not
have any non-trivial (non-identity) symmetries.
What about a group with two elements? The group should have an identity
e, and lets call the other one m. By the identity axiom, we should have ee = e,
em = m, and me = m, fixing 3 of the 4 compositions. For the last one, we have to
choose between mm = m or mm = e. However, the first option runs into trouble
because if we multiply both sides of the equation mm = m by m
–1
(however
we choose m
–1
) we find m = e, a contradiction. Thus we must have mm = e and
m
–1
= m. We can realize this group as a matrix group with 1 ×1 matrices (i.e.
74 Chapter 3
scalars) by choosing G = {1, –1}. This also reveals the geometric meaning of
this group: it consists of the mirror reflection m and identity e. Like the trivial
group, the two-element group is commutative.
Perhaps the most fundamental class of finite groups are the permutation
groups. Recall that a permutation of an ordered list of n elements is way to
rearrange (“shuffle”) them, for example swapping two elements. We can rep-
resent a permutation as a bijective (i.e. invertible) function σ : n n, where
n = {1, n}. Alternatively, we can represent a permutation by a permutation
matrix P, which is a matrix with a single 1 in each row and column and all
other entries 0. When acting on a vector x R
n
, this matrix will permute the
coordinates.
The set of all permutations of n elements form a group called the symmet-
ric group S
n
(not to be confused with the general term symmetry group). To
see that S
n
is indeed a group, note that 1) the identity id
n
: n n is a per-
mutation, 2) the composite of two permutations is a permutation (closure),
and 3) a permutation has an inverse and this inverse is itself a permutation.
To specify a permutation σ, we need to choose σ(1) (n possibilities), σ(2)
(n 1 possibilities), etc., up to σ(n) (only one choice left). Thus, the group S
n
has n ·(n 1) ·(n 2) ···2 ·1 = n! elements. We can equivalently define S
n
as a
concrete group of mappings σ, as a matrix group, or as an abstract group.
For the interested reader, we mentioned a few more finite groups, with-
out going into detail. The rotational symmetries of regular polygons such as
triangles, squares, pentagons and hexagons (and so on) are captured by the
cyclic groups C
n
, and their rotation and reflection symmetries by the dihe-
dral groups D
n
. Finite groups of symmetries in three dimensions include the
crystallographic point groups and the groups of symmetries of Platonic solids.
According to Cayley’s theorem
8
, any finite group with n elements can be
viewed as a subgroup of S
n
. Furthermore, in a mayor milestone in mathemat-
ics
9
, the finite simple groups (from which all finite groups can be made) have
been classified; all such groups belong to one of several infinite families, or to
a relatively short list of sporadic groups.
3.3.3.2 Infinite Discrete Groups Groups need not be finite. Consider for
instance the set of integers Z = {, –2, –1, 0, 1, 2, }. With composition given
by addition and 0 as the identity, this set forms a group. Geometrically, we
can think of this group as a group of discrete translations in one dimension.
Similarly, we can think of Z
2
= Z ×Z as a group of discrete translations in two
dimensions. In chapter 7 we will consider these groups as the symmetries of
regular grids (e.g. a grid of image pixels).
Foundations of Equivariant Deep Learning 75
In addition to discrete translational symmetries, some regular grids / lat-
tices also have discrete rotational and reflection symmetry, such as four-fold
rotational symmetry of a square grid and six-fold rotational symmetry of a
hexagonal grid. We will consider such groups in chapter 8 when we study
group convolution. A related class of groups are the 17 wallpaper groups,
which describe the possible symmetries of wallpaper patterns (Schattschnei-
der 1978). Their 3D analogues are called (crystallographic) space groups, and
these have been classified into 230 distinct groups (Aroyo 2017).
3.3.3.3 The Classical Matrix Groups Classical groups are a name of an
important class of groups that occur throughout mathematics, physics, and
computer science, and several of which will play an important role in this
book. They were so named by (Weyl 1939) in his famous book “The Classi-
cal Groups” on invariant polynomials, representation theory, and the Erlangen
programme. Most generally defined, the classical groups are groups of invert-
ible matrices with real, complex or quaternionic
10
entries, possibly subject to
additional constraints (e.g., volume preservation det(A) = 1 or preservation of
certain kinds of generalized inner product). Here we will only consider certain
important examples of real and complex classical groups; many more examples
and details may be found e.g. in Goodman and Wallach (2009).
The quintessential matrix groups are the general linear groups GL(n, R)
and GL(n, C) of all n ×n real or complex invertible matrices. These form
groups, because they include the identity matrix, inverses for each element,
and because the product of two invertible matrices is again invertible. They can
be thought of as the symmetry groups of the vector spaces R
n
and C
n
, being
the invertible transformations that preserve their vector space structure.
11
The
classical groups (and indeed all matrix groups) are subgroups of the general
linear groups, which is the “largest” of the matrix groups. In this sense, the
general linear group plays the same role for matrix groups as the symmetric
group plays for finite groups.
From GL(n, R) we can obtain subgroups that preserve not just the linear
structure of R
n
, but also lengths, angles, orientation, and volume. The orthog-
onal group O(n) contains only those matrices that preserve the standard inner
product x
y. That is, we should have (Ax)
(Ay) = x
y for all x, y R
n
, or
equivalently, A
A = I:
O(n) = {A GL(n, R) : A
A = I}. (3.11)
Thus, this group consists of rotations and reflections.
Geometrically, the inner product represents the angle between two vectors
12
,
so orthogonal matrices preserve angles. The determinant of a matrix A tells us
76 Chapter 3
Group GL(n, R) SL(n, R) O(n) SO(n)
Constraints det(A) ̸= 0 det(A) = 1 A
A = I
det(A) = 1
A
A = I
Dimension
n
2
n
2
1
n(n–1)
2
n(n–1)
2
Mapping
Invariants
-
Volume,
Orientation
Volume,
Distance,
Angle
Volume,
Orientation,
Distance,
Angle
Table 3.1
Classical real matrix groups and their properties. Each group is defined as the set of
n ×n real matrices satisfying a constraint (row 1). An equality constraint may elimi-
nate degrees of freedom, leaving only a certain number of dimensions (row 2). Note
that O(n) and SO(n) have the same dimensionality, because orthogonal matrices have
det(A) = ±1 so the additional det(A) = 1 constraint for SO(n) eliminates only a discrete
degree of freedom. In the mapping row we show for each group how a generic element
acts on the standard basis e
1
= (1, 0) (blue) and e
2
= (0, 1) (red) for n = 2 dimensions.
Finally, we list the geometric properties left invariant by each group. All matrix groups
leave invariant the vector space structure, meaning the origin 0 and all linear relations
z = αx + βy, and additionally the properties listed in the last row.
how much it expands volume, with negative values indicating a reversal of ori-
entation. Elements of the orthogonal group have det(A) = ±1, i.e., it preserves
volume but not necessarily orientation (reflections have determinant –1).
The special linear group SL(n, R) consists of volume-preserving linear
maps:
SL(n, R) = {A GL(n, R) : det(A) = 1}. (3.12)
It does not necessarily preserve angles. Similarly, the special orthogonal group
consists of volume and inner-product preserving linear maps:
SO(n) = {A GL(n, R) : A
A = I, det(A) = 1}. (3.13)
In table 3.1 we show these groups and some of their properties.
The complex analogs of SL(n, R), O(n) and SO(n) are called the complex
special linear group SL(n, C), unitary group U(n) and special unitary group
SU(n). While in our book we will focus primarily on real matrix groups, we
should mention here that the unitary groups play an important role in physics:
U(1) ×SU(2) is the “electroweak symmetry” associated with electromagnetic
Foundations of Equivariant Deep Learning 77
and weak forces, whereas SU(3) appears in quantum chromodynamics that
describes strong interactions between quarks.
3.3.3.4 Motion groups Further examples of groups that are particularly
important in applications are the Euclidean motion groups. Euclidean motions
can be defined as isometries (distance-preserving maps) of Euclidean space
(e.g. R
n
with the standard inner product). This includes translations, rota-
tions, and reflections, and indeed one can show that any Euclidean motion
is a sequence of these. These form the Euclidean group, denoted E(n). When
reflections are excluded, it is called the special Euclidean group SE(n).
It is not possible to perform a translation x 7→x + t for x, t R
n
by mul-
tiplying x by an n ×n matrix, because A0 = 0 for any matrix. For this
reason it is common to add a homogeneous coordinate, representing x by
¯
x = (x
1
, , x
n
, 1) R
n+1
. A Euclidean transformation can then be represented
by a matrix of shape n + 1 ×n + 1 of the following form:
R t
0 1
, (3.14)
where R O(n) is an n ×n orthogonal matrix (rotation and reflection) and
t R
n
is a translation vector. One may verify that multiply such a matrix by
¯
x results in the required transformation x 7→Rx + t. Furthermore, the com-
position operation of E(n) is given by matrix multiplication, and so it can be
defined as a matrix group.
By introducing uniform scaling transformations (replacing R by sR in
eq. (3.14)), we obtain the group of similarity transformations Sim(n). The
group of affine transformations contains in addition to rotations, transla-
tions, and reflections, also scale and shear transformations. Using the same
homogeneous coordinate representation, we can conveniently encode affine
transformations using the same form as eq. (3.14), but taking R GL(n, R) to
be an arbitrary invertible matrix rather than an orthogonal one.
Projective transformations can also be encoded conveniently using homo-
geneous coordinates. An n-dimensional projective transformation may be
encoded as a non-singular n + 1 ×n + 1 matrix, and composition is given
by matrix multiplication. In this representation, two matrices related by a
scalar multiplication A = λB are to be considered equal. To apply a projec-
tive transformation to a vector x, one uses matrix-vector multiplication on the
homogeneous coordinate vector
¯
x, followed by a division by the homogeneous
coordinate (which is no longer equal to 1 in general). Projective transforma-
tions are a mainstay of classical computer vision (Hartley and Zisserman 2004)
and graphics. Table 3.2 lists several motions groups and their properties.
78 Chapter 3
Group PGL(n + 1) Aff(n) (S)Sim(n) (S)E(n))
Matrix H
A t
0 1
sQ t
0 1
Q t
0 1
Domain
H GL(n + 1)
A GL(n)
t R
n
Q (S)O(n)
t R
n
s R
+
Q (S)O(n)
t R
n
Dimension
(n + 1)
2
1 n
2
+ n
n(n–1)
2
+ n + 1
n(n–1)
2
+ n
Mapping
Invariants
Colinearity,
Cross-ratio
Parallelism,
Volume
Volume
Ratio, Dis-
tance Ratio,
Angle
Volume,
Distance,
Angle
Table 3.2
Motion groups and their properties. Motions in n dimensions are represented by n + 1 ×
n + 1 dimensional matrices using homogeneous coordinates (see text). For each group
we show the form of the matrix, the domain of the sub-matrices, and the dimension of
the group. Note that for the projective group PGL(n + 1) (corresponding to projective
transformations of n-dimensional space), any matrix H GL(n + 1) represents a valid
projective transformation, but matrices related by a scalar multiple should be identified
and thus the dimension is reduced by 1. The mapping row shows how a square grid
could be mapped by an element in the respective groups, and the invariants row lists
several invariants of the group.
3.4 Group Actions And Representations
In geometry in general, and geometric deep learning in particular, one often
wishes to perform computations involving different kinds of geometric quanti-
ties, such as scalars, vectors, and tensors, as well as fields of such quantities.
13
The key thing that distinguishes and in fact characterizes these and other types
of geometric quantities is not their dimensionality, but rather the way in which
they transform under symmetry transformations. The idea that the type of a
geometric quantity is determined by the way in which the group of symmetries
acts on it was championed by Weyl (1939) and is widely used in physics.
14
As an example, consider the special orthogonal group SO(2) consisting of
planar rotations around the origin. It is intuitively obvious that we can apply
the same rotation g SO(2) to a point on the circle S
1
, on the plane R
2
, or to
a planar image (a function on the plane). We are thus dealing with one group
Foundations of Equivariant Deep Learning 79
acting on several different spaces. Moreover, the group could act in different
ways on the same space, for instance rotating clockwise or counterclockwise.
In geometric deep learning, all feature spaces are associated with a group
representation, which tells us how to transform data living in that feature space.
This includes the input space, output space, and all intermediate feature spaces.
There are two equivalent ways to give precise meaning to the idea of hav-
ing a group G act on a set X. The first approach is to define a group action
as a map α : G ×X X satisfying certain criteria. By currying we can obtain
from α another map ρ : G [X, X] (again satisfying certain criteria) called a
group representation, where [X, X] is the set of mappings from X to itself.
The reason for the name is that each abstract group element g is represented in
the space X by a concrete mapping ρ(g) : X X. It should be noted that the
term “representation” has traditionally been used mostly for the important case
where ρ(g) is a linear map acting on a vector space X. We will use the term
representation for the general concept, and where necessary make a distinc-
tion between linear or Vect-representations and general or Set-representations.
Furhtermore, since group actions and representations are equivalent (up to
currying), we will also permit ourselves to use these terms interchangeably.
Let us now look at the aforementioned “certain criteria” that a group action
must satisfy. Without further restrictions, a map of the type α : G ×X X
could do rather strange things. For example, the identity element e G might
not act as the identity on X, i.e., one could have α(e, x) ̸= x. Also, if we act
first with h and then with g, the result might be different from acting with gh
(composing g and h in G). Thus, we want a group action to respect the structure
of the group, which consists of a designated identity element and the rule of
composition:
Let G be a group and let X be a set. A group action of G on X is a
mapping α : G×X X satisfying:
α(e, x) = x (Identity)
α(gh, x) = α(g, α(h, x)), (Composition)
for all g, h G, x X, and where e G is the identity.
When there is little chance of confusion, group actions can also be denoted
as g ·x = α(g, x).
If we have a group action α, we can obtain a group representation ρ by
defining ρ(g) = α(g, –). We can also define a representation without reference
80 Chapter 3
to an action, but again not all assignments will do; ρ must satisfy two criteria
equivalent to the ones for group actions:
Let G be a group, let X be a set, and denote by [X, X] the set of
mappings from X to X. A (Set-)representation of G on X is a map-
ping ρ : G [X, X] that assigns to each g G a mapping ρ(g) : X X
satisfying:
ρ(e) = id
X
(Identity)
ρ(g) ρ(h) = ρ(gh), (Composition)
for all g, h G, and where e G is the identity.
The first axiom (identity) says that the identity e G should be represented
by the identity mapping on X, i.e. it should do nothing
15
. The second axiom
(composition) ensures that ρ preserves all positive relations of the form gh = k
that hold in the group
16
. This includes in particular relations such as gg
–1
= e =
g
–1
g that express that g
–1
is the inverse of g, which implies that ρ(g) must be
an invertible map for any g G.
Negative relations, of the form gh ̸= k may not be preserved, however. For
instance, for any group we can define the trivial representation
17
, which maps
all elements g to the identity map on X. One may check that this is indeed a
valid group action. An action is called faithful when ρ is injective (meaning
that it is difference-preserving: g ̸= h ρ(g) ̸= ρ(h)). In this case, all negative
properties are also preserved.
In many cases, G and X have additional structure, such as topological or
smooth (Lie group) structure, in which case one has to adapt the definition
in order to make the math work nicely. For instance, by requiring ρ to be a
continuous or smooth map, and taking [X, X] to be the space of continuous /
smooth maps from X to itself
18
.
In GDL, data usually live in a vector space, and most of the representations
we encounter in practice are linear, meaning that the maps ρ(g) : X X are
all linear transformations of a vector space X. As we will see in later chapters,
this is true even though the neural networks we work with are highly non-
linear. Such linear representations in vector spaces are defined analogously to
ordinary representations in sets:
Foundations of Equivariant Deep Learning 81
Let G be a group, let X be a vector space, and denote by [X, X] the
set of linear mappings from X to X. A linear representation (or Vect-
representation) of G on X is a mapping ρ : G [X, X] that assigns to
each g G a linear mapping ρ(g) : X X satisfying:
ρ(e) = id
X
(Identity)
ρ(g) ρ(h) = ρ(gh), (Composition)
for all g, h G, and where e G is the identity.
We note that one can equivalently define a (linear) group representation as a
group homomorphism (section 3.3.2) from G to the group of invertible (linear)
transformations of X.
In practice we work in a basis of X relative to which the linear maps
ρ(g) : X X can be represented by matrices
19
. Hence, we can also define
ρ as a matrix-valued function. Then, composition corresponds to matrix
multiplication, and the application ρ(g)(x) is matrix-vector multiplication.
Linear representations are useful because they turn abstract group theory
into linear algebra, which is well understood and for which we have highly
efficient specialized hardware. Even when working with a group of matrices,
it is useful to consider linear representations, and one should be careful to dis-
tinguish the concepts of matrix groups and linear representations. For instance,
one could consider the group SO(2) as a matrix group consisting of 2 ×2 rota-
tion matrices, and have it act via n ×n matrices ρ(g) on the space of signals
on the circle, sampled at n points. Similarly, we can consider the group of
d ×d permutation matrices, acting on the space of d ×d adjacency matrices
by d
2
×d
2
matrices ρ(g). We will cover these examples in more detail later.
We have formally defined a (linear) representation as a map ρ : G X
X
,
but it is common to refer to the representation space X as the representation
when it is clear what ρ is associated with X. Nevertheless it is important to
note that according to Weyl’s principle, the type of geometrical quantity is
determined by the representation, not by the space alone. Consider for example
the group SO(2) of 2 ×2 rotation matrices. An SO(2)-vector is a quantity x
R
2
that transforms according to the standard (vector) representation ρ
1
(g) = g.
On the other hand, a quantity in the same space R
2
is called a pair of SO(2)-
scalars if it transforms according to the trivial representation ρ
0
(g) = I
2
(where
I
2
is the 2 ×2 identity matrix). Thus the representation determines the type
of geometric quantity, and as we will see a mapping between two types of
geometric quantities / representations is called an equivariant map.
82 Chapter 3
3.4.1 Examples of Representations
It is high time to consider some examples of group representations and group
actions. Most groups considered in this book are matrix groups, and for these
groups there is a natural choice of representation often called the standard
representation, where each group element (matrix) A G is represented by
itself: ρ(A) = A. This definition trivially satisfies the identity and composition
requirements to qualify as a representation.
As an example, consider the matrix group SO(2) = {R R
2×2
| RR
=
I, det(R) = 1} of 2 ×2 rotation matrices. Clearly, each rotation matrix R
SO(2) defines a linear mapping R
2
R
2
given by matrix vector multiplica-
tion: x 7→Rx; this is the standard representation ρ(R) = R of SO(2). Note that
we can equivalently define SO(2) as the set of rotation angles [0, 2π) with com-
position given by addition modulo 2π. In that case we cannot define ρ(θ) = θ
because θ [0, 2π] is an angle and not a matrix, but we can still define an
equivalent representation:
ρ(θ) =
cos(θ) sin(θ)
sin(θ) cos(θ)
(3.15)
Using the rules of trigonometry one can show that matrix multiplication then
mimics angle-addition: ρ(θ)ρ(θ
) = ρ(θ + θ
), and that ρ(0) = I.
Similarly, the standard representation of the permutation group S
n
maps each
abstract permutation to a matrix that permutes the coefficients of the vector on
which it acts. For instance, if π
12
is the permutation that swaps the first two
elements of a length-3 sequence, the corresponding permutation matrix is:
ρ(π
12
) =
0 1 0
1 0 0
0 0 1
(3.16)
This representation is useful in GDL applications involving sets and graphs
(chapter 5), where each of the n nodes/elements is equipped with a feature
x
i
R that we wish to permute. Indeed we find that (ρ(π)x)
i
= x
π (i)
. If instead
of one scalar per element we have a d-dimensional vector x
i
R
d
we can stack
them into a matrix X R
n×d
and act on it with the same matrix: X 7→ρ(π)X.
Equivalently, we can stack them into one column vector x R
nd
in which case
we use an equivalent representation matrix with d identical blocks:
ρ
d
(π) =
ρ(π)
ρ(π)
.
.
.
ρ(π)
(3.17)
Foundations of Equivariant Deep Learning 83
The permutation group S
n
can also act on matrices of size n ×n, by simul-
taneously permuting the rows and columns. If ρ(π) is the standard matrix
representation of S
n
defined above, we get a representation on matrices via the
tensor representation (section 3.6.6): (ρ ρ)(π)A = ρ(π)Aρ(π)
. This repre-
sentation is useful for instance for describing the transformation behaviour of
the adjacency matrix or edge-feature matrix of a graph.
Another representation of the permutation group is the sign (or parity)
representation. Any permutation π can be decomposed into a sequence of
transpositions (swaps of a pair of elements), and although this decomposition
is not unique, the number of transpositions N(π) is always the same. The sign
representation is given by ρ(π) = (–1)
N(π)
, i.e. it is 1 for an even number of
transpositions, and –1 for an odd number of transpositions.
For any group G we can define the trivial representation ρ(g) = 1, which
maps every element to the 1 ×1 identity matrix. One may verify that this
is indeed a representation. In geometric deep learning, we use the trivial
representation for features that should be invariant.
The matrix determinant has the property that det I = 1 and det AB =
det A det B for any square matrices A and B of the same size. Hence, it defines
a representation of the general linear group and any subgroup thereof (and
hence, of any matrix group). For example, for an element of the orthogo-
nal group A O(n), the determinant tells us whether A flips the handedness
(det A = –1) or leaves it unchanged det A = 1). Flipping handedness twice is
the same as not changing it, and indeed (–1) ·(–1) = 1.
The examples above are all linear representations, so let us consider a Set-
representation (group action). For any group G, the composition : G ×GG
defines an action of G on the set X = G, i.e. on itself. Written as a represen-
tation, this send g G to the mapping ρ(g) : G G defined by ρ(g) = g –, i.e.
ρ(g)(h) = g h.
A final example that will play an important role in the next chapter and the
rest of the book, is the action of a group on a space of functions. If we have
a group G acting on any space , we can naturally define an action of G on
the space of real-valued functions x : R. Denoting the action of G on
by g ·u, this is defined as: (ρ(g)f )(u) = f (g
–1
·u). For instance, if G is the group
of rotations and translations of the plane = R
2
, then ρ is the way in which G
acts on the space of grayscale images on R
2
.
3.5 Equivariant Maps & Equivariant Nets
Having defined groups and representations, we are finally ready to give a
proper mathematical definition of equivariance:
84 Chapter 3
Let f : X Y be a mapping of sets (or vector spaces), G a group,
and let ρ
1
: G [X, X] and ρ
2
: G [Y, Y] be two (linear) represen-
tations of G acting on the domain X and codomain Y of f . Then f is
G-equivariant if for all g G:
f ρ
1
(g) = ρ
2
(g) f . (3.18)
An equivariant (linear) map can also be called a mapping or morphism
of representations, written f : (X, ρ
1
) (Y, ρ
2
).
Unlike our preliminary discussion of equivariance in section 3.2.2, this def-
inition makes it clear that the equivariance constraint should be satisfied for
every element in a group, and that ρ
i
are not arbitrary maps but should satisfy
the requirements for being a representation (section 3.4).
If we have two equivariant maps then their composite is also equivariant:
f
2
f
1
ρ
1
(g) = f
2
ρ
2
(g) f
1
= ρ
3
(g) f
2
f
1
(3.19)
As always in mathematics and programming, we can only compose functions if
the codomain (output space) of the first map matches the domain (input space)
of the second. For example, we can compose f
1
: X
1
X
2
and f
2
: X
2
X
3
to
obtain f
2
f
1
: X
1
X
3
(read: f
2
after f
1
), but we cannot form f
1
f
2
because f
1
expects input of type X
1
but f
2
returns output of type X
3
instead.
It is very important to note that when composing equivariant maps, we must
ensure that not only the domain/codomain match as sets or vector spaces, but
also as representations, or else the composite will not be equivariant. In other
words, we should view an equivariant map f
1
not as a mapping of sets but of
representations. So in the example above, f
1
maps from representation (X
1
, ρ
1
)
to (X
2
, ρ
2
), and f
2
maps from (X
2
, ρ
2
) to (X
3
, ρ
3
). Since the output representa-
tion of f
1
matches the input representation of f
2
, we can compose them and the
composite f
2
f
1
will be an equivariant map from (X
1
, ρ
1
) to (X
3
, ρ
3
).
Following this rule, we can create arbitrary computation graphs from
equivariant maps, i.e. create equivariant networks:
Definition 1 (Equivariant Network) Let G be a group. A G-equivariant net-
work is a computation graph (DAG), where every node is associated with a
vector space X
i
and representation ρ
i
of G in X
i
, and all mappings are equiv-
ariant. That is, every node is computed as a function of its parents in the graph
by a (possibly parameterized) mapping that is equivariant with respect to the
representations acting on the input and output space of the map.
Foundations of Equivariant Deep Learning 85
Thus, when designing an equivariant network, we not only have to choose
the feature spaces X
i
(e.g. how many dimensions it has), but also how we want
it to transform
20
(except for the input and output space, where the represen-
tation is usually determined by the problem). Every mapping in the network
should be equivariant, including non-linearities, normalization layers, learn-
able linear layers, etc., because even a single non-equivariant layer breaks the
equivariance of the network as a whole.
One can create an invariant network simply by making the last layer invari-
ant, i.e. choosing the representation of the output space to be the trivial
one. For example in a classical convolutional network, one could choose
a global average pooling layer (averaging over space), which produces a
translation-invariant output
21
.
Mappings in a general computation graph can have multiple inputs and pro-
duce multiple outputs. In this case a simultaneous transformation g applied to
all inputs using their representation should lead to a corresponding transfor-
mation of the outputs. Such multi-variable mappings can also be considered as
mappings with a single input and output that is the concatenation of in/output
vector spaces. In an equivariant network each of those vector spaces has a rep-
resentation associated with it, and their concatenation will transform according
to a block-diagonal representation obtained by stacking representations along
the diagonal (see section 3.6.1).
The feature spaces of equivariant networks usually consist of a number n
λ
of copies (called the multiplicity) of a few (typically low-dimensional) repre-
sentations ρ
λ
acting on R
d
λ
. As discussed in section 3.4, each representation
ρ
λ
corresponds to a type of geometric feature (e.g. scalar, vector, tensor). Thus,
a geometric feature space consists of a number of copies of geometric features
of various types. This means that the channels / dimensions of a geometric
feature vector are grouped at two levels: each dimension belongs to a partic-
ular geometric feature vector of dimension d
λ
(e.g. a 3D vector), and all n
λ
geometric feature vectors of this type belong to a single isotypic subspace of
dimension n
λ
d
λ
. For example, we could have two geometric feature vectors
v
1,1
, v
1,2
of type ρ
1
(forming an isotypic subspace of dimension 2d
1
), and one
feature vector v
2,1
of type ρ
2
:
v
1,1
v
1,2
v
2,1
7→
ρ
1
(g)
ρ
1
(g)
ρ
2
(g)
v
1,1
v
1,2
v
2,1
(3.20)
In the next chapter we will consider feature spaces consisting of signals on
a domain, in which case the geometric quantities are not just stacked “ver-
tically” as we have done here, but also“horizontally”. For example, a point
86 Chapter 3
cloud (chapter 6) consists of 3D point features (the standard representation of
SE(3)) stacked horizontally, and we have a permutation group acting along the
horizontal axis by permuting points. A planar vector field (chapter 8) consists
of 2D vectors transforming according to the standard representation of SO(2),
and additionally the group SE(2) acts by moving vectors to a new position on
the plane as well as rotating them.
In addition to geometric features, a learnable mapping takes parameters as
input. Parameters should in general be considered as invariant quantities (trans-
forming according to the trivial representation), and so don’t require special
treatment and don’t impose any constraints on the layer.
3.6 The Equivariance Cookbook
We have seen how equivariant networks can be constructed from equivariant
building blocks, but have not discussed the building blocks themselves. Here
we consider a number of common neural network layers and their equivariant
counterparts.
3.6.1 Concatenation, Direct Sums, and Geometric Feature Channels
A common operation in neural networks is to concatenate a number of vectors
x
i
R
d
i
, usually denoted x = concat(x
1
, , x
n
). When the vector spaces R
d
i
are associated with representations ρ
i
of the same group G, the concatenated
vector x transforms according to a block diagonal representation:
ρ(g) = block-diag(ρ
1
, , ρ
n
) =
ρ
1
(g)
.
.
.
ρ
n
(g)
(3.21)
In other words, concat(ρ
1
(g)x
1
, , ρ
n
(g)x
n
) = ρ(g) concat(x
1
, , x
n
). For each
g G, the matrix ρ(g) has blocks of size d
i
×d
i
on the diagonal, and zeros
elsewhere.
In more abstract and basis-free language, concatenating vectors corresponds
to the direct sum of vector spaces, and block-diagonal stacking of represen-
tations corresponds to a direct sum of representations. We will not go into
this here, but note that these operations are denoted by
L
i
R
d
i
R
P
i
d
i
and
ρ =
L
i
ρ
i
, respectively.
One example where direct sums appear in applications is when we deal with
point cloud data. Here our data consists of a number of points x
i
R
3
, which
we can stack into one big vector x = concat(x
1
, , x
n
). Each point is associ-
ated with a representation of the rotation group G= SO(3), namely the standard
representation (section 3.4.1) ρ
i
(R)x
i
= Rx
i
(where R is a rotation matrix).
Thus, the whole point cloud transforms as ρ(R) = block-diag(ρ
1
, , ρ
n
), i.e. a
Foundations of Equivariant Deep Learning 87
matrix with n copies of R along the diagonal. If we want to add some rotation-
invariant features, such as a timestamp, we can do so by adding a number of
copies of the trivial representation ρ
0
(g) = 1 along the diagonal.
Instead of stacking the x
i
into a flat vector, we could also stack the vectors
into a matrix X R
3×n
, in which case the action is given by X 7→RX. For
consistency however, it is mathematically more convenient to always work
with the matrix-vector form, even if in our implementation we may choose
a different shape array for our data.
Concatenation can be performed explicitly as a computation step in the net-
work, but can also be used to describe feature spaces consisting of multiple
geometric quantities mathematically. In conventional neural networks, feature
spaces can be described as a concatenation of scalar features or in the case
of convolutional networks, of channels. In equivariant neural networks, one
typically considers feature spaces consisting of multiple kinds of geometric
features (representations) of various type, each occuring with some multi-
plicity. For instance, SO(3) rotation equivariant network might use a certain
number of scalar features, 3D vectors, and 3 ×3 tensors as features. If the i-th
feature type has dimension d
i
and we have c
i
copies of it, the total dimension
of the representation space would be d =
P
i
c
i
d
i
.
3.6.2 Vector Addition, Scalar Multiplication & Linear Combinations
Another common operation is vector addition: z = x + y R
d
. This operation
is equivariant, provided that x and y not only have the same dimension d but
also transform according to the same representation ρ. In that case we have:
ρ(g)x + ρ(g)y = ρ(g)(x + y), (3.22)
that is, vector addition is equivariant.
Similarly, multiplying by an invariant scalar (i.e. either a constant or an
input-dependent but group invariant quantity) is also equivariant. The reason
is that scalar multiplication commutes with matrix multiplication: ρ(g)αx =
αρ(g)x. Combining vector addition and scalar multiplication, we can take
linear combinations of vectors, and this is equivariant as well when the rep-
resentations are the same and scalars are invariant: z =
P
i
α
i
x
i
, for scalars
α
i
R.
3.6.3 Change of Basis
If we have a d-dimensional matrix representation ρ and an arbitrary non-
singular (invertible) matrix A R
d×d
, we can create a new representation by
changing the basis:
ρ
(g) = Aρ(g)A
–1
.
88 Chapter 3
We verify that ρ
is indeed a representation:
ρ
(g)ρ
(h) = Aρ(g)A
–1
Aρ(h)A
–1
= Aρ(g)ρ(h)A
–1
= Aρ(gh)A
–1
= ρ
(gh).
Two representations ρ and ρ
related in this way are called equivalent or
isomorphic representations.
When ρ and ρ
are equivalent by a change of basis A, then A is an
equivariant linear map between them:
Aρ(g) = Aρ(g)A
–1
A = ρ
(g)A. (3.23)
3.6.4 Permutation representations and pointwise nonlinearities
A typical neural network consists mostly of learnable linear maps and fixed
elementwise nonlinearities. In an equivariant network, all maps must be equiv-
ariant, including nonlinearities. Whereas the equivariance constraint applied
to learnable linear layers simply leads to a reduction in the number of parame-
ters, nonlinearities generally don’t have any parameters and so we must choose
them to be equivariant. Equivariance is expressed relative to group represen-
tations acting on the input and output space of the nonlinear map, and so
the representations and nonlinearity must be chosen together. In this section
and several that follow, we consider various kinds of nonlinearities and the
representations to which they are equivariant.
Consider a discrete group G, a feature space X = R
d
and a representation ρ
of G acting on it. If ρ is a permutation representation, meaning that ρ(g) is a
permutation matrix for all g G, then any pointwise nonlinearity is equivari-
ant with respect to ρ acting on the input and output space
22
. This is because a
pointwise nonlinearity acts the same way on each coordinate, and a permuta-
tion simply shuffles the coordinates, so it does not matter if we shuffle before
or after applying the nonlinearity.
In graph and set neural networks (chapter 5), we have a permutation group
acting by permutation matrices, and so we can freely use arbitrary point-
wise nonlinearities. Similarly, discrete translations of grids (chapter 7) such
as images and audio signals are an example of permutation representations,
because a discrete shift simply moves each pixel to another position (i.e.
dimension) without changing the numerical values. Thus, pointwise nonlinear-
ities in convolutional networks do not break discrete translation equivariance.
Foundations of Equivariant Deep Learning 89
More generally, we get permutation representations when working with
scalar signals on a space having a group action (chapter 4) such as a homo-
geneous space (chapter 8). However, if the domain is continuous then we
are forced to discretize and as a result aliasing can occur, which will break
equivariance to continuous transformations. Since nonlinearities can introduce
introduce high-frequency structure into the signal, they are likely to lead to
aliasing. In order to preserve equivariance to pointwise nonlinearities it is thus
necessary to take measures to avoid aliasing (Cesa, Lang, and Weiler 2022).
3.6.5 Gating and Attention
When the representation matrices ρ(g) are not permutation matrices, pointwise
nonlinearities will in general not be equivariant. For instance, consider the stan-
dard representation of a rotation θ [0, 2π) as a 2 ×2 matrix (eq. (3.15)) acting
on R
2
. A π/2 rotation maps (x, y) 7→(–y, x) for example, and this operation will
not commute with ReLU because of the sign flip. Similarly, a field of such
quantities (a vector field; see chapter 8) will require special nonlinearities.
In cases like these, gating nonlinearities can be used. In conventional neural
networks, gating refers to multiplying a scalar or vector feature by a scalar
value called the gate, which is typically squashed into the (0, 1) or (–1, 1) range
by a sigmoid or tanh nonlinearity:
gate(x, g) = sigmoid(g ) x. (3.24)
Here x, g R
d
are feature vectors, and indicates elementwise multiplica-
tion. Sometimes, a single scalar gate g R is used to scale an entire feature
vector x R
d
or a sub-vector of it. Gating is used in LSTM and GRU blocks
(Hochreiter and Schmidhuber 1997; Chung et al. 2014), squeeze-and-excite
blocks (Hu et al. 2020), and many other architectures.
In equivariant networks, gating can be used on equivariant features, provided
that the gate is invariant and we use one scalar gate to multiply a whole geo-
metric feature vector (not one gate per component of the vector). As noted in
section 3.6.2, multiplying by an invariant scalar is equivariant. Invariants could
be produced by an invariant linear map (section 3.6.8) which is a special case
of an equivariant map. Alternatively, if the representation matrices ρ(g) are
orthogonal (as is often the case) one could compute the norm of geometric fea-
ture vectors, or more generally an invariant polynomial (Hilbert 1890; Derksen
and Kemper 2002; Villar et al. 2021). Then one can apply a nonlinearity such
as sigmoid or tanh (clearly, the output will still be invariant). Finally, one can
multiply a geometric feature vector x
i
(transforming according to a representa-
tion ρ
i
) by a scalar gate. Since scalar and matrix multiplication commute, this
is always equivariant.
90 Chapter 3
Attention is an operation that like gating involves multiplicative interactions
between data-dependent quantities, but additionally involves an averaging step.
Let X R
N×D
be a matrix of N feature vectors of dimension D, and let A
R
M×N
be a matrix of logits or unnormalized attention weights. Then attention
is computed as:
attention(X, A)
md
=
N
X
n=1
softmax(A)
mn
X
nd
, (3.25)
where the softmax is applied along the N axis, thus producing a normalized
distribution over the N feature vectors X
n,:
.
There are two kinds of equivariance we may wish the attention operation to
have. In graph neural networks and (non-autoregressive) transformers (chap-
ter 5) on wishes this attention to be equivariant to permutations of the N feature
vectors and attention weights; this is automatically satisfied. In applications
involving geometry (e.g. geometric graphs, chapter 6, one may wish atten-
tion to be equivariant to a group G (e.g. rotations) acting on each feature vector
X
n,:
identically. Since attention computes a weighted average (section 3.6.2) of
vectors with the same representation, it is equivariant to such transformations
provided that the logits A are invariant. In the case of self-attention, where A
is computed by inner products, this is guaranteed whenever the keys and values
have the same orthogonal group representation.
3.6.6 Tensor product nonlinearities
Gating nonlinearities, where we multiply a scalar and some other geomet-
ric quantity, is a special case of a more general operation called the tensor
product of representations. Given vectors x
i
R
d
i
, their tensor product (or
outer product) T =
N
i
x
i
lives in R
d
1
×···×d
n
. The element at index (j
1
, , j
n
)
is the product of the corresponding elements of the vectors: T
j
1
,,j
n
=
Q
i
x
i,j
i
.
For example, the tensor product of two column vectors is a matrix given by
x y = xy
, i.e. (x y)
j
1
j
2
= x
j
1
y
j
2
. We can linearly combine tensors, so the
tensors of a given shape form a vector space
23
.
When the vectors x
i
have a group representation associated with them, the
space of tensors also naturally becomes a representation. Applying g G to
x, y using their representations ρ
1
, ρ
2
, we find:
(ρ
1
(g)x) (ρ
2
(g)y) = (ρ
1
(g)x)(ρ
2
(g)y)
= ρ
1
(g)(xy
)ρ
2
(g)
. (3.26)
Thus, T = x y R
d
1
×d
2
is mapped to ρ
1
(g)Tρ
2
(g)
, and indeed this
defines a group representation on R
d
1
×d
2
. We could write this as ρ(g)T =
ρ
1
(g)Tρ
2
(g)
, but it should be noted that ρ(g) is a linear map from the space
of tensors R
d
1
×d
2
to itself, i.e. as a matrix it has shape d
1
d
2
×d
1
d
2
.
Foundations of Equivariant Deep Learning 91
Mathematically it is often convenient to gloss over the particular shape
(since reshaping doesn’t really do anything), but when implementing algo-
rithms the shape does matter. If we flatten t = vec(T) R
d
1
d
2
, we can take ρ(g)
to be a matrix of size d
1
d
2
×d
1
d
2
. The venerable Matrix Cookbook (Petersen
and Pedersen 2012, equation 520) tells us that:
vec(ρ
1
(g) T ρ
2
(g)
) =
(
ρ
2
(g) ρ
1
(g)
)
vec(T), (3.27)
where ρ
2
(g) ρ
1
(g) denotes the tensor / Kronecker product of matrices:
A B =
a
11
B ··· a
1n
B
.
.
.
.
.
.
.
.
.
a
m1
B
.
.
.
a
mn
B
(3.28)
for matrices A R
m×n
and B R
p×q
.
In summary, we can write the action of G on the space of matrices either as
T 7→ρ
1
(g)Tρ
2
(g)
, or in flattened form as t 7→ρ
2
(g) ρ
1
(g)t. This is called
the tensor product of representations ρ
1
and ρ
2
, and can be extended to arbitrary
products
N
i
ρ
i
.
When pairwise tensor products are used as a nonlinearity, the dimensionality
is greatly increased with each application. Hence it is common to com-
bine tensor products with the Clebsch-Gordan decomposition (Kondor, Lin,
and Trivedi 2018), where a (fixed) change of basis is applied to the tensor
product in order to block-diagonalize it into irreducible representations (see
section 3.6.9), after which only a limited number of low-order representations
is retained.
Another way in which the tensor representation comes up is in graph neural
networks, where each graph on n nodes is defined by an adjacency matrix
A R
n×n
(in practice, the entries of the matrix are binary, but one can also
consider general edge or node-pair features). When we permute the nodes of
the graph, we have to permute both the rows and columns of the adjacency
matrix. This corresponds to the tensor representation ρ
1
ρ
1
, where ρ
1
(P) = P
is the standard matrix representation of the permutation group. That is, the
adjacency matrix transforms as A 7→PAP
.
3.6.7 Equivariant Normalization Layers
Normalization layers, such as Batch-, Layer-, Group- and Instance-
Normalization, are an important part of modern network architectures includ-
ing convolutional and attention-based models (Ioffe and Szegedy 2015; Ba,
Kiros, and Hinton 2016; Wu and He 2018; Ulyanov, Vedaldi, and Lempitsky
2016). Normalizing features throughout the network can improve optimization
92 Chapter 3
speed and stability, and improve generalization. Generally speaking, normal-
ization layers compute a mean µ and standard deviation σ of features, use them
to normalize the features so that they have mean zero and standard deviation
one, and then optionally apply a learned shift and scale operation. They differ
in the axes over which the mean and variance are computed, and hence the
shape of µ and σ.
Regardless of the axes over which normalization is performed, a few
simple rules suffice to make normalization layers equivariant. As noted in
section 3.6.2, we are always permitted to linearly combine geometric fea-
ture vectors, provided that they all have the same representation and that the
scalar coefficients are invariant. Thus, we can compute the mean of geometric
feature vectors of the same representation, and subtract it from those feature
vectors, and the result will be equivariant with the same representation again.
We should however not average over the components of a single geometric
feature vector; the components are not in themselves geometric vectors with a
representation and so their mean is not a geometric quantity either. The stan-
dard deviation is a function of the sum of squares of the (mean-subtracted)
components, and this quantity (i.e. the norm) will be invariant as long as the
representation is orthogonal.
Let us describe an equivariant layer-normalization layer. Consider an array
X R
N×C×D
, where N is the batch size, C is the number of channels (multi-
plicitY), and D is the dimensionality of geometric feature vectors
24
. We assume
that each vector X
nc
transforms according to the same orthogonal representa-
tion. We will normalize over the C axis separately for each N. Thus, our mean
has shape N ×D and is computed as:
µ
nd
=
1
C
C
X
c=1
X
ncd
.
That is, we compute N mean vectors of dimension D by averaging over the C
axis. We subtract this from X to obtain X
ncd
= X
ncd
µ
nd
. Then, we compute
the squared norm for each geometric feature vector, and compute their average:
σ
2
n
=
1
CD
C
X
c=1
D
X
d=1
X
2
ncd
. (3.29)
Finally, we divide: Y
ncd
=
X
ncd
σ
2
n
+ϵ
.
It is common to apply a gain and bias after normalization. As noted, scaling
by an invariant gain is always permitted, provided one uses the same gain for all
coefficients of a geometric feature vector. Applying a bias is only equivariant
if the bias vector satisfies ρ(g)b = b for all g G, i.e. if the bias is invariant.
Foundations of Equivariant Deep Learning 93
Depending on ρ, this may mean that it has to be 0, while in other cases there
may be a space of solutions of some dimensionality.
3.6.8 Equivariant Linear Maps
One of the most common type of maps in neural networks are learnable linear
maps, such as fully connected layers, convolutions, depth-wise convolutions,
and 1 ×1 convolutions. In the linear case it is often possible to understand
the space of equivariant maps and parameterize it. In later chapters we will
consider various cases with special structure, such as planar convolution, group
convolution, and graph convolution. In this section we consider the case of a
general linear map, also known as a fully connected layer.
Consider a group G, matrix representations ρ
1
(g) R
n×n
and ρ
2
(g) R
m×m
of G, and a matrix W R
m×n
. For W to be equivariant, it has to satisfy the
equation ρ
2
(g)W = Wρ
1
(g) for all g G. This system of equations is linear
in W, and so we can solve for W to find a basis for the space of solutions.
If W
i
R
m×n
(for i = 1, , k) is a basis for the space of solutions, then any
equivariant matrix W can be written as a linear combination of basis matrices:
W =
P
k
i=1
θ
i
W
i
, and any such linear combination is equivariant.
It is common to add a learnable bias vector to the output, computing the
affine map Wx + b instead of the linear map Wx, but this is not necessarily
equivariant even if W is. We can reduce the affine case to the linear one by
appending a constant 1 dimension to x. This quantity is invariant, so we add a
constant 1 block (trivial representation) of size 1 ×1 to ρ
1
as well, yielding a
n + 1 dimensional matrix representation. Then, solving for W as before yields
an n + 1 ×m matrix containing the equivariant weight matrix and bias vector.
Depending on the group and the representations, it might be possible to
solve the equivariance constraint analytically, and this has been done for many
groups of interests. Indeed, as we will discuss in the next section, represen-
tation theory presents us with a systematic way to understand the space of
equivariant linear maps between two representations by breaking them down
into irreducible representations for which equivariant linear maps take a very
simple form.
Nevertheless, it can sometimes be more practical to solve the equivariance
equations numerically. To do so, we would like to first bring the equivariance
equation in the standard matrix-vector form Aw = 0, where w = vec(W ) is the
flattened weight matrix and A is a matrix of coefficients derived from ρ
1
and
ρ
2
. We first write the constraint as ρ
2
(g)Wρ
1
(g)
–1
W = 0. Next, we vectorize
(flatten) using the vec operator. Using the general fact (Petersen and Pedersen
(2012), Section 10.2.2., equation 520) that vec(AXB) = (W
A) vec(X),
94 Chapter 3
where is the Kronecker product eq. (3.28), we obtain:
(ρ
1
(g)
ρ
2
(g) I
n
I
m
) vec(W) = 0 (3.30)
Thus, for each group element g we get a constraint matrix (ρ
1
(g)
ρ
2
(g)
I
n
I
m
) of size nm ×nm. Assuming the group is finite with |G| elements, we
can stack these matrices into constraint matrix A of size |G| nm ×nm. We can
then use a standard method such as SVD to solve the linear system Aw = 0 and
obtain our (flattened) basis matrices w
i
= vec(W
i