2
Foundations of Supervised Learning
Supervised learning requires handling both approximation, estimation
and optimization errors.
Learning in high-dimensions requires strong notions of regularity.
Standard function classes based on local and global smoothness do not
scale efficiently to complex real data.
Machine learning, in essence, is concerned with predicting future outcomes
from past experience. In this chapter, we will formalize this goal, focusing
on supervised learning, arguably the simplest and most widespread machine
learning modality. Although it may be viewed as ‘merely’ doing curve-fitting,
it turns out that such fitting ability signals a profound interplay between high-
dimensional statistics, optimization and mathematical analysis.
Indeed, as the input of a machine learning system becomes more and more
complex, as in image, video or chemical data, guaranteeing that the model will
preserve its predictive power on unseen data becomes increasingly hard. This
corresponds to the colloquial curse of dimensionality, a term originally coined
by Bellman in the 50s that captures an exponential dependency between the
input dimension and the underlying cost of running a certain algorithmic or
statistic procedure to a desired performance level.
In order to beat the curse of dimensionality, it is therefore necessary to make
modeling assumptions that encode certain regularities about the data. In this
chapter we will review function classes based on classic notions of regularity
‘borrowed’ from mathematical analysis, which provide a clean mathematical
picture of high-dimensional learning —and set the backdrop for the geometric
function classes of Chapter 4.
2.1 Main Ingredients of Supervised Learning
The starting point of Supervised machine learning (and also of our GDL
journey) is a set of N observations {(x
i
, y
i
)}
N
i=1
, where x
i
X are the input
40 Chapter 2
Figure 2.1
Representative Supervised Learning tasks: (i) regression to a low-dimensional space,
(ii) Classification, (iii) structured prediction between high-dimensional spaces.
observations and y
i
Y are the labels to be predicted. The defining feature
in this setup is that X is a high-dimensional space: one canonical example
assumes X = R
d
to be a Euclidean space of large dimension d 1. The label
space Y might be either low- or high-dimensional; in the former case, Y = R
encompasses regression tasks, such as predicting the free energy of a chemical
compound, as well as classification tasks Y = {1, K}, such as image classifi-
cation. In the latter, Y X captures so-called structured prediction problems,
where the output shares some characteristics with the input; for example in
medical imaging applications, the output might indicate suspicious regions on
an IRM, or in dynamical systems, where Y = X and the goal is to predict a
high-dimensional state X
t+1
X from the current state X
t
X.
Data Distribution In order to formalize the predictive power of a model,
it is necessary to model how data (both past and future) is generated. The
standard assumption in supervised learning is that (x
i
, y
i
) are drawn i.i.d. from
an underlying data probability distribution P defined over X ×Y, which
will also be used to generate future data. It is important to emphasize that
this is a simplifying assumption that is made to establish insightful theoretical
guarantees of future performance (or generalisation, as we will see next), but
many ML algorithms can operate and produce useful results in absence of this
i.i.d. property
1
. Another important remark is that this data distribution is not
known to the Learner, so it cannot be leveraged during the training process;
instead, it has intrinsic theoretical value to analyse the prediction performance
of different learning algorithms.
Foundations of Supervised Learning 41
Figure 2.2
Cartoon illustration of the joint distribution defining a Supervised Learning problem.
The target function in this example is the conditional expectation f(x) = E
P
[y|x].
Loss Function Besides the data distribution, another essential component to
assess a ML model is the loss function : Y ×Y R, a non-negative function
defined over the label space satisfying (y, y) = 0 for any y Y. For classifica-
tion tasks, we may use (y, y
) = 1
y̸=y
, while for regression the squared loss
(y, y
) = (y y
)
2
is the ubiquitous choice, owing to its fundamental role in
statistics and signal processing. Now, given any function f : X Y, this loss
can then be used to assess the point-wise error (f (x), y) at a particular datapoint
(x, y), as well as the population risk (or error)
R(f ) := E[(f (x), y)] =
Z
X×Y
(f (x), y)dP(x, y) . (2.1)
This expectation may be re-written as R(f ) =
R
X
R
x
(f (x))dP
x
(x), where P
x
is the marginal distribution with respect to x, i.e. P
x
(A) :=
R
A×Y
dP(x, y) for
any measurable set A X, and R
x
(z) :=
R
Y
(z, y)dP
y|x
(y) is the expected loss
under the conditional distribution of y given x
2
. Therefore, the function f
(x) :=
arg min
y∈Y
R
x
(y) is by definition the one achieving smallest error, and referred
as the Bayes optimum predictor. Note that for general data distributions P,
the associated Bayes error will be strictly positive, due to the underlying
uncertainty in the conditional distribution of y given x.
42 Chapter 2
Model Class While the Bayes predictor describes the optimal choice given
any supervised learning task, it is an unfeasible solution due to several reasons.
First, it is defined via the population loss, which is unknown to the Learner.
Next, it defines a function f
that may be arbitrarily complex to describe,
or even to approximate. Towards bridging these gaps, we first introduce our
hypothesis class F: a family of functions f : X Y typically parametrised
as F = {f
θ
; θ Θ}. Most if not all the examples in this book will consider F
consisting of neural network parametrisations, where θ encodes the network
weights. Learning thus happens in two stages: First we decide on the function
class F, for instance by selecting a certain neural network architecture, and
next we determine the best hypothesis within F. The excess risk of a hypothe-
sis f F is defined as the deviation from the optimal Bayes risk R(f ) R(f
).
The infimum excess risk inf
f ∈F
R(f ) R(f
) captures the underlying ability
of the function class to represent the solution of interest. That said, optimiz-
ing the excess risk over a class F still requires oracle access to the true data
distribution. How can one overcome this limitation?
Empirical Risk Averaging the loss over the training set defines the empiri-
cal risk (or error)
ˆ
R(f ) :=
1
N
N
X
i=1
(f (x
i
), y
i
) . (2.2)
Since the training set {(x
i
, y
i
)}
i
is a random sample, the empirical risk is a ran-
dom function, whose expectation is precisely R(f ). For any fixed f ,
ˆ
R(f ) is
an average of n i.i.d. random variables (f (x
i
), y
i
). As such, from the law of
large numbers we know that
ˆ
R(f ) converges almost surely to its expectation
R(f ) as N ; moreover, under mild moment assumptions, we also have
the central-limit-theorem convergence
q
N
σ
2
f
(
ˆ
R(f ) R(f )) N(0, 1), where
σ
2
f
= Var((f (x), y)), capturing the familiar scale O(1/
N) of the statistical fluc-
tuations. This classic asymptotic behavior can be further quantified with non-
asymptotic guarantees using tools from measure concentration. For instance,
assuming |(f (x), y)| C
f
is a bounded random variable, Hoeffding’s inequality
asserts that
P
ˆ
R(f ) Rf
t
2 exp
1
2
C
–2
f
t
2
N
,
indicating that fluctuations t
1
N
are extremely unlikely.
A seemingly naïve strategy to obtain good predictors in F is thus to min-
imise the empirical risk, rather than the population risk. However, before
claiming victory by means of the previous simple argument, it is important to
realize that we need a more powerful control of statistical fluctuations. Indeed,
Foundations of Supervised Learning 43
we have so far focused on measuring the fluctuations at a fixed hypothesis f ,
but the training process is precisely about finding the right one!
The relevant notion is not to compare how individual random variables
ˆ
R(f )
fluctuate around their expectations R(f), but how the random function
ˆ
R fluc-
tuates around its expectation R. We thus need to control fluctuations uniformly
in the support F, that is, sup
f ∈F
|
ˆ
R(f ) R(f )|. As it turns out, such uniform
control can be achieved under fairly general conditions, by appropriately con-
straining the hyptothesis class, as we shall see next. The resulting Empirical
Risk Minimization (ERM) is in fact a very powerful algorithm, and the main
statistical paradigm to perform Supervised Learning.
Empirical Risk Minimisation Given a hypothesis class F = {f
θ
; θ Θ}
parametrised by θ, the ERM defines an estimator of the form f
ˆ
θ
, where
ˆ
θ arg min
θΘ
ˆ
R(f
θ
) =
1
N
N
X
i=1
(f
θ
(x
i
), y
i
) . (2.3)
In words, we are searching for a parameter in the domain Θ that best explains
the observed data, in the sense of the chosen loss . While this certainly seems a
necessary condition to get a good predictor, to what extent is it also sufficient?
By expressing the population risk R(f ) as R(f ) =
ˆ
R(f ) + (R(f )
ˆ
R(f )), we
can think of a good hypothesis as a function f such that (i) the empirical risk
ˆ
R(f ) is small, and (ii) the fluctuations R(f )
ˆ
R(f ) are small. While the first
property is what the ERM explicitly is designed to do, and can be readily
assessed by measuring the empirical risk on the available data, the second
relies on structural properties of the data distribution, the hypothesis class and
the optimization algorithm. The key observation that we develop next is that
these two sources of error are typically in tension, and one needs to trade them
off — the familiar bias-variance tradeoff.
Having small (or even zero) training error amounts to solving (approxi-
mately or exactly) in θ a system of N equations: f
θ
(x
i
) = y
i
for all i = 1N. The
ability to solve systems of equations, even when they are non-linear, depends
fundamentally on the number of degrees of freedom at our disposal, i.e. on the
intrinsic dimensionality of the model class, here roughly captured in the num-
ber of parameters encoded in θ. In other words, when the number of parameters
is large relative to N, the ERM operates in the so-called overparametrised
regime, and one should expect many parameter choices that fit the data equally
well. However, recall that we are using the empirical risk
ˆ
R as a proxy for the
population risk R, the true functional we are interested in minimising. Thus,
how to make an informed choice amongst all parameters in Θ that explain
equally well (or interpolate) the data?
44 Chapter 2
Regularization The folklore answer, Occam’s razor, suggests to pick the
‘simplest’ solutions amongst those that interpolate the data. More formally, this
is achieved in statistical learning via regularisation, or capacity control. Given
a non-negative penalty γ : Θ R encoding the ‘complexity’ of the associated
hypothesis f
θ
, we consider the regularised form of equation 2.3:
ˆ
θ
δ
arg min
θΘ;γ(θ)δ
ˆ
R(f
θ
) . (2.4)
In words, we introduce a hyper-parameter δ controlling the maximum allowed
complexity of our estimator. As the reader might anticipate, this parameter
controls the fundamental tradeoff between the ability to explain the data (bias)
and the danger of overfitting to it (variance). It is important to emphasize that
regularization and overparametrisation can (and often do) coexist. In statistical
terms, the capacity of the model class is now encoded in the single scalar δ. We
will see in Section 2.3 some examples of regularisation in Neural Networks,
with their associated norms γ(θ). Before that, let us formalize the tradeoff
achieved by regularisation.
Example: polynomial regression in univariate data Figure 2.3 displays
the familiar univariate example of fitting a non-polynomial target function
observed on n points with polynomials of increasing degree r. The bias-
variance tradeoff is illustrated by the transition from the underfitting regime
where r < n to the overfitting regime where r > n. The transition r = n captures a
singluar situation where unregularised ERM becomes ill-conditioned, and the
subsequent population risk blows up; See Figure 2.5
2.2 Decomposition of Risk
Recall that the ultimate goal of supervised learning is to perform well on
unseen data, or, more precisely, to construct an estimator
ˆ
f from the available
data that has small excess risk R(
ˆ
f ) R(f
). Let us consider an estimator
ˆ
f = f
ˆ
θ
that approximately solves the regularised ERM from from equation 2.4. Let us
now quantify the excess risk of
ˆ
f .
We start by adding and subtracting the best model (in terms of population
risk) achievable under the constraint γ(θ) δ:
R(
ˆ
f ) R(f
) = R(
ˆ
f ) inf
γ(θ)δ
R(f
θ
) + inf
γ(θ)δ
R(f
θ
) R(f
)
| {z }
approximation error
. (2.5)
We have written the excess error (which is non-negative by definition) as the
sum of two non-negative terms. The second one, termed approximation error,
measures our ability to design accurate enough hypothesis classes with the
right ‘inductive biases’, and penalizes us for reducing the capacity of our
Foundations of Supervised Learning 45
Figure 2.3
Polynomial Regression on univariate functions. Regressions with polynomials of small
degree underfit, and error is dominated by bias, while polynomials of high degree over-
fit, and error is dominated by variance.
model class via the restriction γ(θ) δ. Further, by exploiting the fact that
ˆ
f approximately solves equation 2.4, we decompose the first term as
R(
ˆ
f ) inf
γ(θ)δ
R(f
θ
) = R(
ˆ
f )
ˆ
R(
ˆ
f ) + inf
γ(θ)δ
ˆ
R(f
θ
) inf
γ(θ)δ
R(f
θ
) +
ˆ
R(
ˆ
f ) inf
γ(θ)δ
ˆ
R(f
θ
)
2 sup
γ(θ)δ
|R(f
θ
)
ˆ
R(f
θ
)|
| {z }
statistical error
+
ˆ
R(
ˆ
f ) inf
γ(θ)δ
ˆ
R(f
θ
)
| {z }
optimization error
,
since inf
γ(θ)δ
ˆ
R(f
θ
) inf
γ(θ)δ
R(f
θ
)
ˆ
R(f
θ
) R(f
θ
), where θ
arg min
γ(θ)δ
R(f
θ
).
Besides the previous approximation error, we now identify two other funda-
mental sources of error: the statistical error penalizes us for optimizing the
‘wrong’ risk functional, that is, the empirical (
ˆ
R) rather than the population
(R) objective, and is captured here through the uniform deviation between
these two functions within the ball γ(θ) δ. The remaining term is the opti-
mization error, since in general equation 2.4 does not admit closed-form
46 Chapter 2
Figure 2.4
Decomposition of Risk into Approximation, Estimation and Optimization. The approx-
imation error arises when constraining/regularising the risk over a ball F
δ
. Estimation
error corresponds to the fact that the empirical risk
ˆ
R deviates from the population risk
R. Finally, optimization error captures the inability of optimization methods to solve
generic non-convex problems.
solutions. Moreover, in general the empirical risk
ˆ
R is a non-convex func-
tion of θ, which creates potential roadblocks in the associated minimisation
problem. Figure 2.4 illustrates the resulting decomposition of risk.
The important takeaway from this error decomposition is that in any super-
vised learning task, the complexity parameter δ plays a fundamental tradeoff
between the different sources of error, even in overparametrised regimes where
the number of parameters (or neurons) is N. Generally speaking, a small
complexity parameter δ makes the statistical error smaller (since we need to
control fluctuations between two functions over a smaller domain), while mak-
ing the approximation error larger. We may thus interpret this is a bias-variance
tradeoff arising in classic model selection in statistics (Bottou and Bousquet
2007).
2.3 Over-Parametrised Regime
In the context of neural networks, or more generally models having many more
parameters than data-points, one may wonder how this tradeoff plays out.
The so-called double-descent phenomena (Belkin et al. 2019) seemingly
goes against this bias-variance tradeoff, whereby highly over-parametrised
Foundations of Supervised Learning 47
Figure 2.5
Illustration of the Double-Descent Phenomena, whereby the population risk of unreg-
ularised ERM undergoes a phase transition as the number of parameters crosses the
number of datapoints, with a second ‘descent’ phase.
neural networks exhibit excellent generalisation performance; see Figure 2.5.
In fact, the double-descent phenomena is consistent with the variance-bias
tradeoff described above, but it serves as a cautionary tale that the ‘effec-
tive’ complexity of a hypothesis class is generally not well captured by simply
counting the number of parameters.
A successful learning scheme thus needs to encode the appropriate notion of
regularity or inductive bias for f, imposed through the construction of the func-
tion class F and the use of regularisation. We briefly introduce this concept in
the following section.
Modern machine learning operates with large, high-quality datasets, which,
together with appropriate computational resources, motivate the design of
rich function classes F with the capacity to interpolate such large data. This
mindset plays well with neural networks, since even the simplest choices of
architecture yields a dense class of functions.
A set AB is said to be dense in X if its closure satisfies A{ lim
i→∞
a
i
:
a
i
A} = B. This implies that any point in X is arbitrarily close to a point in A
3
. A typical Universal Approximation result shows that the class of functions
represented e.g. by a two-layer perceptron, f (x) = c
sign(Ax + b) is dense in
the space of continuous functions on R
d
.
The capacity to approximate almost arbitrary functions is the subject of
various Universal Approximation Theorems; several such results were proved
48 Chapter 2
and popularised in the 1990s by applied mathematicians and computer scien-
tists (see e.g. Cybenko 1989; Hornik 1991; Barron 1993; Leshno et al. 1993;
Maiorov 1999; Pinkus 1999).
Figure 2.6
Multilayer Perceptrons (Rosenblatt 1958), the simplest feed-forward neural networks,
are universal approximators: with just one hidden layer, they can represent combina-
tions of step functions, allowing to approximate any continuous function with arbitrary
precision.
Universal Approximation, however, does not imply an absence of inductive
bias. Given a hypothesis space F with universal approximation, we can define
a complexity measure γ : F R
+
(the same notion that we introduced at the
end of Section 2.1), and redefine our interpolation problem as
˜
f arg min
g∈F
γ(g) s.t. g(x
i
) = f (x
i
) for i = 1, . . . , N,
i.e., we are looking for the most regular functions within our hypothesis class.
For standard function spaces, this complexity measure can be defined as a
norm
4
. In this sense, and in light of the Risk decomposition that was outlined
in Section 2.2, we can revisit the Universal Approximation Theorems with a
more skeptical eye: they ‘just’ convey the idea that the approximation error
inf
γ(f )δ
R(f ) R(f
) converges to zero as δ . Stated as such, they lack
quantitative power to understand the resulting tradeoff between approximation
and statistical errors.
In order to overcome this limitation, it is thus necessary to quantify
approximation rates. In low dimensions, splines are a workhorse for function
Foundations of Supervised Learning 49
approximation. They can be formulated as above, with a norm capturing the
classical notion of smoothness, such as the squared-norm of second-derivatives
R
+
|f
′′
(x)|
2
dx for cubic splines. This representation allows us to quantify
approximation errors by controlling the norm of the spline approximation.
In the case of neural networks, the complexity measure γ can be expressed
in terms of the network weights, i.e. γ(f
θ
) = γ(θ). The L
2
-norm of the net-
work weights, known as weight decay, or the so-called path-norm (Neyshabur,
Tomioka, and Srebro 2015) are popular choices in deep learning literature.
From a Bayesian perspective, such complexity measures can also be inter-
preted as the negative log of the prior for the function of interest. The
regularisation term γ(θ) may take different forms across different ML systems.
For instance, for linear models of the form f
θ
(x) = θ
Φ(x) with θ R
m
, we may
choose an L
p
norm γ(θ) = θ
p
p
=
P
m
j=1
|θ
j
|
p
, with p = 1, 2 capturing the familiar
settings of sparse and ridge regression, respectively.
It is important to emphasize that the regularization framework encompasses
not only methods that explicitly consider regularisation in either constrained
(as above) or penalized form (by adding the associated Lagrange multipliers,
leading to min
θΘ
ˆ
gR(f
θ
) + λγ(θ) ); but also methods where the regularization
comes implicitly through the choice of optimization scheme. For example, it is
well-known that gradient-descent on an under-determined least-squares objec-
tive will choose interpolating solutions with minimal L
2
norm. The extension
of such implicit regularisation results to modern neural networks is the subject
of current studies (see e.g. Gunasekar et al. 2017; Blanc et al. 2020; Shamir
and Vardi 2020; Razin and Cohen 2020).
The important takeaway is that regularization and overparametrisation are
compatible in modern DL via the implicit bias of gradient-based learning, and
this bias, or ‘prior’, is often captured as a certain norm in the space of functions.
All in all, a natural question arises: how to define effective priors that capture
the expected regularities and complexities of real-world prediction tasks, and
that enable a quantification of the approximation error?
2.4 The Curse of Dimensionality In High-Dimensional Learning
While interpolation in low-dimensions (with d = 1, 2 or 3) is a classic signal
processing task with very precise mathematical control of estimation errors
using increasingly sophisticated regularity classes (such as spline interpolants,
wavelets, curvelets, or ridgelets), the situation for high-dimensional problems
is entirely different. In this generic high-dimensional regime, one is often con-
fronted with an impossible tradeoff between ‘cursed’ approximation rates or
‘cursed’ estimation rates
5
.
50 Chapter 2
In order to convey the essence of the idea, let us consider a classical notion
of regularity that can be easily extended to high dimensions: 1-Lipschitz- func-
tions f : X R, i.e. functions satisfying |f (x) f (x
)| x x
for all x, x
X.
This hypothesis says that if we perturb the input x slightly (as measured by the
norm x x
), the output f (x) is not allowed to change much. This is a weaker
form of regularity than Sobolev smoothness, which additionally asks for f to
have s derivatives that are integrable. If our only knowledge of the target func-
tion f is that it is 1-Lipschitz, how many observations do we expect to require
to ensure that our estimate
˜
f will be close to f? Figure 2.7 reveals that the gen-
eral answer is necessarily exponential in the dimension d, signaling that the
Lipschitz class grows ‘too quickly’ as the input dimension increases: in many
applications with even modest dimension d, the number of samples would be
bigger than the number of atoms in the universe. The situation is not better
if one replaces the Lipschitz class by a global smoothness hypothesis, such
as the Sobolev Class H
s
(
d
)
6
. Indeed, classic results (Tsybakov 2008) estab-
lish a minimax rate of approximation and learning for the Sobolev class of the
order ϵ
d/s
, showing that the extra smoothness assumptions on f only improve
the statistical picture when s d, an unrealistic assumption in practice. Thus,
if ‘classic’ notions of regularity are not adapted to perform high-dimensional
learning, how should one replace them with effective ones?
2.5 Linear Approximation With Positive-Definite Kernels
A powerful framework to define flexible approximation spaces in high-
dimension is via positive semidefinite kernels: a psd kernel k(x, x
) is a sym-
metric mapping on X ×X satisfying
P
n
i,j=1
α
i
α
j
k(x
i
, x
j
) 0 for any α R
n
and any collection of points (x
1
, , x
n
). A kernel captures an a-priori notion of
similarity in the input space, e.g a kernel of the form k(x, x
) = x
x
in X = R
d
measures linear correlation, while a Gaussian kernel k(x, x
) = exp(–
1
2τ
2
x
x
2
) measures whether x, x
are close at a certain scale, in the sense that
k(x, x
) 1 whenever x x
τ and k(x, x
) 0 otherwise.
To each positive semidefinite kernel on X one can associate a Hilbert space
of functions defined over X with additional structure, the so-called reproduc-
ing property. In essence, there exists a (possibly infinite-dimensional) Hilbert
space H and a ‘feature map’ φ : X H such that the kernel can be viewed as
an inner product: k(x, x
) = φ(x), φ(x
)
H
. The reproducing property refers to
the fact that in this Hilbert space, function evaluations can be seen as inner-
products: for each f H and x X, one has f (x) = f , φ(x)
H
. Given that in
supervised learning one acquires information about the target function f
via
Foundations of Supervised Learning 51
Figure 2.7
We consider a Lipschitz function f (x) =
P
2
d
j=1
z
j
ϕ(x x
j
) where z
j
= ±1, x
j
R
d
is placed
in each quadrant, and ϕ a locally supported Lipschitz ‘bump’. Unless we observe the
function in most of the 2
d
quadrants, we will incur in a constant error in predicting
it. This simple geometric argument can be formalised through the notion of Maximum
Discrepancy (Luxburg and Bousquet 2004), defined for the Lipschitz class as κ(d) =
E
x,x
sup
f Lip(1)
1
N
P
l
f (x
l
)
1
N
P
l
f (x
l
)
N
–1/d
, which measures the largest expected
discrepancy between two independent N-sample expectations. Ensuring that κ(d) ϵ
requires N = Θ(ϵ
d
); the corresponding sample {x
l
}
l
defines an ϵ-net of the domain. For
a d-dimensional Euclidean domain of diameter 1, its size grows exponentially as ϵ
d
.
(noisy) evaluations in the training set, ie y
i
= f
(x
i
) + ξ
i
, the reproducing prop-
erty allows us to view these measurements as bona-fide projections of the target
onto certain directions, spanned precisely by the data features φ(x
i
) H.
The resulting Reproducing Kernel Hilbert Space (RKHS) H thus consists of
functions f
θ
: X R that are linear in θ when expressed using the represen-
tation f
θ
(x) = θ, φ(x)
H
, and comes with a Hilbert regularization norm γ(f
θ
) =
θ
H
. The Hilbertian structure enables a precise control of all sources of error,
including approximation, generalisation and optimization, even in the high-
dimensional regime. The initial choice of kernel thus captures the ‘inductive
bias’ associated with the corresponding RKHS. For instance, in the previous
Gaussian kernel example, the associated RKHS norm f
θ
H
corresponds to a
weighted Fourier norm of the form f
θ
2
H
R
|
ˆ
f (ω)|
2
exp(
τ
2
2
ω
2
)dω, which
indicates that H contains only infinitely differentiable smooth functions with
exponential Fourier decay.
In the context of Neural Networks, some popular kernels are obtained by
building feature maps φ(x) using infinitely wide neural networks, and by
approximating them using Random Features. For instance, by considering
k(x, x
) = E
wµ
[σ(w
x)σ(w
x
)], where µ is a fixed probability distribution
in R
d
and σ an arbitrary activation function, the associated Random Feature
52 Chapter 2
approximation is obtained by a Monte-Carlo estimate of k:
ˆ
k(x, x
) =
1
M
M
X
m=1
σ(w
m
x)σ(w
m
x
) , with w
m
iid
µ .
The associated (random) RKHS is thus generated by the random features
{σ(w
m
·)}
mM
, which can be interpreted in this case as a width-M shal-
low neural network with random weights w
m
. The resulting approximation
space is then obtained by linear combinations of these random features; in
other words, a shallow neural network where only the last output layer is
trained. Another popular kernel method associated with neural networks is
the so-called Neural Tangent Kernel (NTK), which also considers the gra-
dient features {
w
σ(w
m
·)}
m
and captures the training dynamics in certain
overparametrised scalings (the so-called lazy regime).
While the RKHS framework offers wide flexibility through the choice of
the kernel, it is inherently a linear approximation framework where there is
no representation (or ‘feature’) learning, since the choice of the kernel (and
therefore the associated feature maps) is made before seeing the training data.
In other words, kernel methods are unable to extract meaningful information
in high-dimensions unless one already has a preconceived idea of ‘where to
look’. To illustrate the limitations of kernels in high-dimensions, it is useful to
consider isotropic kernels, ie k(x, x
) = k(Ux, Ux
) for any rotation U O
d
. In
words, these kernels have no preconceived notion of priviledged directions in
the input space. Under these conditions, the kernel ridge regression estimator
ˆ
f
associated with a dataset {(x
i
, y
i
= f (x
i
)}
in
is essentially equivalent to perform-
ing polynomial regression on f : If n d
k
, then
ˆ
f is the best degree-k polynomial
approximation of f (Misiakiewicz and Montanari 2023). The model is thus
unable to adapt to any form of low-dimensional structure present in the target
function f , and is only effective to learn targets with global smoothness.
In conclusion, such limitation is incompatible with many aspects of modern
DL, such as the pre-training paradigm, in which a large training set is used to
learn certain features, which are then transferred to a downstream task via a
fine-tuning stage.
2.6 From Linear To Non-Linear Approximation
The alternative Feature Learning framework corresponds instead to non-linear
approximation. While the previous kernel approximation framework considers
a fixed family of feature maps {φ
w
j
(x)}
jdim(H)
and approximates a target f
as f
P
j
θ
j
φ
w
j
by adjusting the weights θ
j
, in non-linear approximation one
also adjusts the parameters w
j
defining the features. The additional degrees of
freedom brought by adjusting w
j
besides θ
j
bring additional adaptivity, that
Foundations of Supervised Learning 53
can become dramatic in the high-dimensional regime. In other words, the non-
linear approximation aspect of feature learning improves the approximation
properties of the model, at the expense of a more challenging optimization.
Indeed, the joint optimization of w
j
and θ
j
defines a non-convex problem
which is generally challenging to analyse, yet often solvable in practice using
gradient-descent techniques.
In summary, some of the limitations of linear approximation and associated
kernel methods may be overcome by considering feature learning, which has
the ability to adapt to specific structures of the input function. Yet, the ques-
tion remains: how to define these trainable features in such a way that prior
information on the target function is efficiently encoded?
2.7 Approximation With Shallow Networks And Beyond
To illustrate the difficulty of this question, it is useful to consider the simplest
instance of feature learning in NNs, given by shallow neural networks. This
architecture defines a class of functions of the form
F =
f
θ
(x) =
X
mM
α
m
σ(w
m
x + b
m
) , θ = {α
m
, w
m
, b
m
}
m
, (2.6)
where σ is an activation and θ collects the weights of the network. The model
is thus building an approximation as a linear combination of ridge functions,
which are non-linear functions of one-dimensional projections of the input.
While this class enjoys universal approximation in the overparametrised limit
M , this approximation becomes quantitative in the high-dimensional
regime (ie, with an approximation error f
f
(M)
θ
M
η
with η > 0 indepen-
dent of the input dimension) only under strong assumptions on the target. For
instance, if f
is such that it depends on a collection of low-dimensional projec-
tions of the input (see Figure 2.8), then the shallow class, when operating in the
non-linear regime, is able to break this curse of dimensionality (Bach 2017a).
However, in most real-world applications (such as computer vision, speech
analysis, physics, or chemistry), functions of interest tend to exhibit complex
long-range correlations that cannot be expressed with low-dimensional projec-
tions (Figure 2.8), making this hypothesis unrealistic. It is thus necessary to
define an alternative source of regularity, by exploiting the spatial structure of
the physical domain and the geometric priors of f , as we describe in the next
Chapters.
54 Chapter 2
Figure 2.8
If the unknown function f is presumed to be well approximated as f (x) g(Ax) for
some unknown A R
k×d
with k d, then shallow neural networks can capture this
inductive bias, see e.g. Bach 2017a. In typical applications, such dependency on low-
dimensional projections is unrealistic, as illustrated in this example: a low-pass filter
projects the input images to a low-dimensional subspace; while it conveys most of the
semantics, substantial information is lost.
Suggested Further Reading and Bibliographical Notes
For a comprehensive introduction to learning theory, we refer the reader to the
recent ‘Learning Theory from First Principles’ by Francis Bach. In particular,
the topics outlined in Section 2.1, Section 2.2 and Section 2.5 are presented
from the ground-up and with great level of detail. Further aspects of RKHS
theory and their applications to learning theory are detailed in ‘Learning with
Kernels, by Scholkopf and Smola. For further theoretical aspects of statisti-
cal learning theory, and a comprehensive review of measure concentration,
we refer the reader to ‘High-Dimensional Statistics’, by Martin Wainwright.
For topics specifically related to Neural Network approximation, described in
Section 2.4 and Section 2.7, we recommend the reader to explore ‘Deep Learn-
ing Theory’ by Matus Telgarsky. Finally, the notions of linear and non-linear
Foundations of Supervised Learning 55
approximation outlined in Section 2.6 are explored in depth in A Wavelet tour
in Signal Processing’, by Stephane Mallat (Chapter 9).
Approximation Theory for Neural Networks: The approximation properties
of generic fully connected neural networks were heavily studied in the 80s
and 90s, culminating in a fairly complete characterisation. (Baum and Haus-
sler 1988) considered the minimal size of neural networks with the ability to
fit a certain (finite) dataset. Shortly after, several works studied the denisty
of shallow neural networks in the space of continuous functions, e.g (Hecht-
Nielsen 1987; Carroll and Dickinson 1989; Cybenko 1989; Funahashi 1989;
Hornik, Stinchcombe, and White 1989), under certain assumptions on the acti-
vation function, culminating in the ‘modern’ formulation of UAT (Leshno
et al. 1993) which only asks the activation to not be a polynomial. Beyond
qualitative approximation, the quantitative study of approximation rates using
neural networks was pioneered by Barron (Barron 1993) and Maiorov and
Meir (Maiorov 1999; Meir and Maiorov 2000). Barron also identified the
fundamental separation between linear and non-linear approximation using
shallow neural networks, subsequently studied in (Bach 2017a; Ma, Wu, et
al. 2022).
Kernels and Neural Networks: The interplay between Neural Networks
and Kernel methods dates at least to Neal (Neal 1996); see also (Williams
1996), where a connection was made between infinitely-wide neural net-
works and certain Gaussian Processes. (Weston, Ratle, and Collobert 2008)
further explored the interplay between kernels and NNs, with the motivation
to leverage the tractable formalism of kernel methods to ease the optimiza-
tion challenges of deep architectures. Shortly after, Cho and Saul (Cho and
Saul 2009) computed the dot-product kernels associated with deep architec-
tures with randomly initialized isotropic weights. In parallel, Rahimi and Recht
(Rahimi and Recht 2007) introduced random feature approximations for cer-
tain kernel classes, which in retrospect provided the framework for today’s
quantitative non-asymptotic theory (Misiakiewicz and Montanari 2023; Bach
2017a, 2017b). In essence, this framework provides an explicit statistical char-
acterization of neural networks where only the last layer is allowed to train. The
general setting of non-linear NN approximation is out of the scope of kernel
methods, with the notable exception of the Neural Tangent Kernel from (Jacot,
Gabriel, and Hongler 2018). The NTK considers a linearisation of the Neural
Network at its initialization, and shows that, for certain choice of initialisation
scaling, the training dynamics of the actual network and its linearisation agree
in the overparametrised limit. This regime is however unable to account for the
feature learning abilities of NNs (Chizat, Oyallon, and Bach 2019).
56 Chapter 2
Exercises
1. Decomposition of Risk: Quadratic Function and Least Squares
Consider a linear hypothesis class of the form F = {f
θ
(x) = θ
φ(x)}, where
φ : R
d
R
m
is a fixed, finite-dimensional feature map, and θ R
m
parametrises the
hypothesis. Consider the least-square loss, hence (f
θ
(x), y) = (θ
φ(x) y)
2
.
(i) By developing the square, show that for any θ we have
ˆ
R(f
θ
) R(f
θ
) = θ
(
ˆ
Σ Σ)θ 2θ
(
ˆ
m m) + (ˆσ
2
y
σ
2
y
) ,
where
ˆ
Σ =
1
n
P
i
φ(x
i
)φ(x
i
)
, Σ = E[
ˆ
Σ],
ˆ
m =
1
n
P
i
y
i
φ(x
i
), m = E[
ˆ
m], ˆσ
2
y
=
1
n
P
i
y
2
i
, σ
2
y
=
E[ˆσ
2
y
].
(ii) Assume now that covariantes are isotropic Σ = Id and the realizable setting y|x =
θ
φ(x). Show that the excess risk is then
ˆ
R(f
θ
) R(f
θ
) = (θ θ
)
(
ˆ
Σ Σ)(θ θ
) .
(iii) Conclude that the uniform bound in the ball F
δ
= {f
θ
F; θδ} is
sup
f ∈F
δ
|
ˆ
R(f ) R(f )| = δ
2
ˆ
Σ Σ
op
.
In this example, the uniform control of the excess risk can thus be directly captured
with the operator norm of the empirical covariance fluctuations (notice that the oper-
ator norm A
op
= sup
θ∥≤1
Aθ still contains a supremum over the domain). The
additional structure of the operator norm of a linear operator leads to a tight control
of the excess risk in linear models using tools from Random Matrix Theory; see Misi-
akiewicz and Montanari 2023 for an in-depth analysis of the excess risk in linear models
(including infinite-dimensional linear models given by RKHS).
2. Generalization Bounds with Rademacher Complexity
Reference: (Bach 2024, Section 4.5) Denote z = (x, y) Z := X ×Y and H=
{(x, y) 7→(y, f (x)); f F} the class of functions from Z to R associated to the hypoth-
esis class F. In this exercise, we will obtain generalisation bounds sup
F
{R(f )
ˆ
R(f )}.
(i) Show that sup
F
{R(f )
ˆ
R(f )} = sup
H
{E[h(z)]
1
n
P
n
i=1
h(z
i
)}.
Let D = {z
1
, , z
n
} be the dataset. The Rademacher Complexity of the class H with
respect to the data distribution is defined as
R
n
(H) := E
ε,D
sup
h∈H
1
n
ε, h(D)
,
where h(D) = (h(z
1
), , h(z
n
)) and ε = (ε
1
, , ε
n
) is a vector of n iid Rademacher
variables, which take values ε
i
= ±1 with probability 1/2.
Foundations of Supervised Learning 57
(ii) Show that R
n
is a linear functional, ie R
n
(αH+ α
H
) = |α|R
n
(H) + |α
|R
n
(H
) for
α, α
R, that it is translation-invariant, ie R
n
(H+ {h
0
}) = R
n
(H), and that R
n
(H) =
R
n
(CH(H)), where CH is the convex hull.
(iii) Massart’s Lemma for finite Hypothesis class: Assume H= {h
1
, , h
m
} and that H is
a.s. bounded, ie
1
n
P
n
i=1
h(z
i
)
2
R
2
for all h H. Show that R
n
(H)
q
2 log m
n
R.
(iv) Symmetrisation Show that
E
"
sup
H
{E[h(z)]
1
n
n
X
i=1
h(z
i
)}
#
2R
n
(H) ,
and the same bound applies to E
sup
H
{
1
n
P
n
i=1
h(z
i
) E[h(z)]}
. (Introduce an inde-
pendent copy D
= {z
1
, , z
n
} of the dataset and use the identity E[h(z)] h(z
i
) =
E[h(z
i
) h(z
i
)|D].
The previous bound controls the expected generalisation error. To obtain high-
probability bounds, we now consider McDiarmid’s inequality, which exploits the a.s.
boundedness of the random variables.
Lemma 1 (McDiarmid’s inequality) Let Z
1
, , Z
n
be independent random variables
defined over Z, and f : Z
n
R such that, for all i {1, n} and all z
1
, , z
n
, z
i
, satisfies
|f (z
1
, , z
i–1
, z
i
, z
i+1
, z
n
) f (z
1
, , z
i–1
, z
i
, z
i+1
, z
n
)| c .
Then
P
f (Z
1
, , Z
n
) E[f (Z
1
, , Z
n
)]
t
2 exp(–2t
2
/(nc
2
))
(v) Apply McDiarmid’s inequality both to R and to R
n
to obtain the following high-
probabilty bound: if h(z) for all z and all h, then with probability greater than 1 δ,
for any h H,
E[h(z)]
1
n
n
X
i=1
h(z
i
) + 2
ˆ
R
n
(H) + 3
r
log(2δ
–1
)
2n
,
where we defined the empirical Rademacher complexity as
ˆ
R
n
(H) := E
ε
sup
h∈H
1
n
ε, h(D)
.
The Rademacher complexity provides a powerful tool to control the generalisation
error; see (Section 4.5) for further properties and specific examples where hypothesis
classes are given by balls in generic normed spaces.
3. Learning Lipschitz Functions in d-dimensions
Let {(x
i
, y
i
)}
i=1,,n
with x
i
Unif([–1, 1]
d
) and y
i
= f (x
i
) where f is 1-Lipschitz:
|f (x) f (y)| x y. Show that in order to learn f up to uniform accuracy ϵ; ie
E[sup
x[–1,1]
d
|f (x)
ˆ
f (x)|] ϵ, where the expectation is with respect to the randomness
of the training data, it is necessary and sufficient that n = Θ(ϵ
d
).
4. Linear vs Non-Linear Approximation in High-Dimension
58 Chapter 2
In this exercise, we will compare linear approximation schemes (associated with
Kernel methods) with non-linear approximation (associated with Neural Networks in
the feature-learning regime).
Let F = L
2
(R
d
, ν) be the space of squared-integrable functions with respect to
a data distribution ν, and consider B= {ϕ
m
}
mΛ
an orthonormal basis of F. For each
subset S Λ, we define U
S
:= span{ϕ
m
; m S}. The linear appromation of f F is
given by
ϵ
S
(f ) := min
gU
S
f g , g
S
(f ) := arg min
gU
S
f g .
Finally, given a target family F
F, the minimax N-term linear approximation error
is defined as
ϵ
N
(F
) := inf
|S|N
sup
f ∈F
ϵ
S
(f ) .
(i) Show that g
S
(f ) can be explicitly computed as g
S
(f ) =
P
mS
f , ϕ
m
ϕ
m
.
Consider now an order in Λ, and S = S
N
the nested sets S
N
= {1, N}.
(ii) Show that for f F, we have lim
N→∞
ϵ
N
(f ) = 0.
(iii) Give an example of a family F
such that ϵ
N
(
(iv) Show that if f F
is such that
P
m
m
2s
|f , ϕ
m
|
2
< for s > 1/2, then ϵ
N
(F
) =
O(N
s
). This quantifies the previous approximation guarantee in the so-called Sobolev
class.
We now define the non-linear approximation error as
ϵ
N
(F
) := sup
f ∈F
inf
|S|=N
ϵ
S
(f ) .
(v) Show that for any N and any F
, ϵ
N
(F
) ϵ
N
(F
).
Non-linear approximation schemes are thus more powerful than their linear coun-
terparts. We conclude this exercise by illustrating how the gap between linear and
non-linear approximation can be dramatic in the high-dimensional regime.
For that purpose, consider a rotationally-invariant data distribution ν, ie, such
that for any U O
d
and any measureable set A R
d
we have ν(U(A)) = ν(A). Consider
an orthonormal basis of L
2
(R
d
, ν) of the form B = {ϕ
j,k
}
jN, kN
d,j
, where the orthog-
onal subspaces V
j
= span(ϕ
j,k
; k = 1N
d,j
) are closed under rotations: if f V
j
, then
f
U
V
j
for all U. N
d,j
is thus the dimension of each invariant subspace, and it can be
assumed that N
d,j
d
j
for sufficiently large d. These invariant spaces are often referred
as harmonics. Finally, consider a function class F
of the form
F
= {f ;
X
j
j
2s
P
V
j
f
2
< ; f (x) = h(x ·θ) for some θ S
d–1
} ,
where P
V
denotes the orthogonal projection. F
thus consists of high-dimensional
functions which only depend on one-dimensional projections of the input, the so-called
single-index models, and whose link function h has polynomial decay of order s in the
harmonic decomposition.
Foundations of Supervised Learning 59
(vi) Show that, for any basis B as above, the minimax linear approximation error ϵ
N
(F
) is
minimised at S = {(j, k); j j
(N)}, where j
(N) is the largest j such that
P
j
j
N(d, j
) <
N.
(vii) Deduce that the minimax linear approximation error ϵ
N
(F
) has a staircase behavior:
it remains constant as long as j
(N) does not increase.
(viii) Using the asymtpotics N
d,j
d
j
, deduce that to reach approximation error ϵ, one needs
N d
ϵ
–1/s
terms.
(ix) In contrast, show that the non-linear approximation error, when optimized over
harmonic bases as above
ϵ
N
(F
) := sup
f ∈F
inf
B,|S|=N
ϵ
S
(f ) ,
requires only N ϵ
–1/s
terms.
5. Universal Approximation with Fourier Analysis
[From M. Telgarsky DLT Section 2.3, and Barron 1993] In this exercise, we
will derive the Universal Approximation Theorem for Shallow Neural Networks using
Fourier Representations.
(i) Verify that for f L
1
(R
d
) and C
1
, its Fourier transform
ˆ
f (ξ) =
R
f (x)e
–2πix·ξ
dx is also
in L
1
(R
d
), and moreover
Z
ξ|
ˆ
f (ξ)|dξ < .
(ii) Show the inverse Fourier representation f (x) =
R
ˆ
f (ξ)e
2πiξ·x
dξ, where the equality
holds point-wise.
(iii) Let
ˆ
f (ξ) = |
ˆ
f (ξ)|e
iφ
f
(ξ)
C be the polar representation. Show that
f (x) =
Z
cos(2π(ξ ·x + φ
f
(ξ)))|
ˆ
f (ξ)|dξ .
(iv) Using the Funtamental Theorem of Calculus, show that
cos(2π(ξ ·x + φ
f
(ξ))) = cos(2πφ
f
(ξ)) +
Z
ξ
0
sin(2π(–b + φ
f
(–ξ))) sin(2π(b + φ
f
(ξ)))
1(ξ ·x b)db .
(v) Conclude that there exist a measure µ(b, ξ) satisfying
R
|µ(b, ξ)|dbdξ 4π
R
ξ|
ˆ
f (ξ)|dξ <
such that
f (x) =
Z
µ(b, ξ)1(x ·ξ b)dbdξ ;
in other words, we can express f as an infinitely-wide shallow neural network with
threshold activation function x 7→1(x ·ξ b).
60 Chapter 2
Notes
1
The study of ML beyond the i.i.d. assumption is the topic of Transfer,
Continual, or Adversarial Learning; see notes at the end of the chapter.
2
If P admits a density (which we denote by P by abuse of language) with
respect to a base measure µ in X ×Y, so P µ , then P
y|x
admits a density with
respect to the marginal µ
y
(y) =
R
X
dP(x, y), with dP
y|x
(y) = Z
–1
P(x, y)dµ
y
(y),
with Z =
R
Y
P(x, y)dµ
y
(y). We will gloss over the measure-theoretic consid-
erations when dealing with conditional distributions, and assume throughout
that all joint distributions are absolutely continuous with respect to the base
measure of the domain.
3
Note that the denseness of a set is a property that depends on the choice
of a topology. Consider B the set of squared-integrable, bounded scalar func-
tions, and AB the set of bounded, compactly-supported functions. Then A
is dense in B under the topology induced by L
2
(R), whereas it is not dense
under the topology induced by L
(R).
4
Informally, a norm x can be regarded as a “length” of vector x. A Banach
space is a complete vector space equipped with a norm.
5
‘Cursed’ here refers to rates of approximation/estimation that have expo-
nential dependence in dimension, e.g ϵ n
c/d
where n refers to the statisti-
cal/computational resources, eg number of neurons to approximate a target, or
number of datapoints to estimate it.
6
A function f is in the Sobolev class H
s
(
d
) if f L
2
(
d
) and the generalised
s-th order derivative is square-integrable:
R
|ω|
2s+1
|
ˆ
f (ω)|
2
dω < , where
ˆ
f is the
Fourier transform of f ; see Chapter 7.