Foundations of Supervised Learning

•

Supervised learning requires handling both approximation, estimation

and optimization errors.

•

Learning in high-dimensions requires strong notions of regularity.

•

Standard function classes based on local and global smoothness do not

scale efﬁciently to complex real data.

Machine learning, in essence, is concerned with predicting future outcomes

from past experience. In this chapter, we will formalize this goal, focusing

on supervised learning, arguably the simplest and most widespread machine

learning modality. Although it may be viewed as ‘merely’ doing curve-ﬁtting,

it turns out that such ﬁtting ability signals a profound interplay between high-

dimensional statistics, optimization and mathematical analysis.

Indeed, as the input of a machine learning system becomes more and more

complex, as in image, video or chemical data, guaranteeing that the model will

preserve its predictive power on unseen data becomes increasingly hard. This

corresponds to the colloquial curse of dimensionality, a term originally coined

by Bellman in the 50s that captures an exponential dependency between the

input dimension and the underlying cost of running a certain algorithmic or

statistic procedure to a desired performance level.

In order to beat the curse of dimensionality, it is therefore necessary to make

modeling assumptions that encode certain regularities about the data. In this

chapter we will review function classes based on classic notions of regularity

‘borrowed’ from mathematical analysis, which provide a clean mathematical

picture of high-dimensional learning —and set the backdrop for the geometric

function classes of Chapter 4.

2.1 Main Ingredients of Supervised Learning

The starting point of Supervised machine learning (and also of our GDL

journey) is a set of N observations {(x

, y

)}

i=1

, where x

∈X are the input

40 Chapter 2

Figure 2.1

Representative Supervised Learning tasks: (i) regression to a low-dimensional space,

(ii) Classiﬁcation, (iii) structured prediction between high-dimensional spaces.

observations and y

∈Y are the labels to be predicted. The deﬁning feature

in this setup is that X is a high-dimensional space: one canonical example

assumes X = R

to be a Euclidean space of large dimension d ≫1. The label

space Y might be either low- or high-dimensional; in the former case, Y = R

encompasses regression tasks, such as predicting the free energy of a chemical

compound, as well as classiﬁcation tasks Y = {1, K}, such as image classiﬁ-

cation. In the latter, Y ≈X captures so-called structured prediction problems,

where the output shares some characteristics with the input; for example in

medical imaging applications, the output might indicate suspicious regions on

an IRM, or in dynamical systems, where Y = X and the goal is to predict a

high-dimensional state X

t+1

∈X from the current state X

∈X.

Data Distribution In order to formalize the predictive power of a model,

it is necessary to model how data (both past and future) is generated. The

standard assumption in supervised learning is that (x

, y

) are drawn i.i.d. from

an underlying data probability distribution P deﬁned over X ×Y, which

will also be used to generate future data. It is important to emphasize that

this is a simplifying assumption that is made to establish insightful theoretical

guarantees of future performance (or generalisation, as we will see next), but

many ML algorithms can operate and produce useful results in absence of this

i.i.d. property

. Another important remark is that this data distribution is not

known to the Learner, so it cannot be leveraged during the training process;

instead, it has intrinsic theoretical value to analyse the prediction performance

of different learning algorithms.

Foundations of Supervised Learning 41

Figure 2.2

Cartoon illustration of the joint distribution deﬁning a Supervised Learning problem.

The target function in this example is the conditional expectation f(x) = E

[y|x].

Loss Function Besides the data distribution, another essential component to

assess a ML model is the loss function ℓ : Y ×Y →R, a non-negative function

deﬁned over the label space satisfying ℓ(y, y) = 0 for any y ∈Y. For classiﬁca-

tion tasks, we may use ℓ(y, y

′

) = 1

y̸=y

′

, while for regression the squared loss

ℓ(y, y

′

) = (y – y

′

)

is the ubiquitous choice, owing to its fundamental role in

statistics and signal processing. Now, given any function f : X →Y, this loss

can then be used to assess the point-wise error ℓ(f (x), y) at a particular datapoint

(x, y), as well as the population risk (or error)

R(f ) := E[ℓ(f (x), y)] =

X×Y

ℓ(f (x), y)dP(x, y) . (2.1)

This expectation may be re-written as R(f ) =

(f (x))dP

(x), where P

is the marginal distribution with respect to x, i.e. P

(A) :=

A×Y

dP(x, y) for

any measurable set A ⊆X, and R

(z) :=

ℓ(z, y)dP

y|x

(y) is the expected loss

under the conditional distribution of y given x

. Therefore, the function f

∗

(x) :=

arg min

y∈Y

(y) is by deﬁnition the one achieving smallest error, and referred

as the Bayes optimum predictor. Note that for general data distributions P,

the associated Bayes error will be strictly positive, due to the underlying

uncertainty in the conditional distribution of y given x.

42 Chapter 2

Model Class While the Bayes predictor describes the optimal choice given

any supervised learning task, it is an unfeasible solution due to several reasons.

First, it is deﬁned via the population loss, which is unknown to the Learner.

Next, it deﬁnes a function f

∗

that may be arbitrarily complex to describe,

or even to approximate. Towards bridging these gaps, we ﬁrst introduce our

hypothesis class F: a family of functions f : X →Y typically parametrised

as F = {f

; θ ∈Θ}. Most if not all the examples in this book will consider F

consisting of neural network parametrisations, where θ encodes the network

weights. Learning thus happens in two stages: First we decide on the function

class F, for instance by selecting a certain neural network architecture, and

next we determine the best hypothesis within F. The excess risk of a hypothe-

sis f ∈F is deﬁned as the deviation from the optimal Bayes risk R(f ) – R(f

∗

The inﬁmum excess risk inf

f ∈F

R(f ) – R(f

∗

) captures the underlying ability

of the function class to represent the solution of interest. That said, optimiz-

ing the excess risk over a class F still requires oracle access to the true data

distribution. How can one overcome this limitation?

Empirical Risk Averaging the loss over the training set deﬁnes the empiri-

cal risk (or error)

R(f ) :=

i=1

ℓ(f (x

), y

) . (2.2)

Since the training set {(x

, y

)}

is a random sample, the empirical risk is a ran-

dom function, whose expectation is precisely R(f ). For any ﬁxed f ,

R(f ) is

an average of n i.i.d. random variables ℓ(f (x

), y

). As such, from the law of

large numbers we know that

R(f ) converges almost surely to its expectation

R(f ) as N →∞; moreover, under mild moment assumptions, we also have

the central-limit-theorem convergence

(

R(f ) – R(f )) →N(0, 1), where

= Var(ℓ(f (x), y)), capturing the familiar scale O(1/

√

N) of the statistical ﬂuc-

tuations. This classic asymptotic behavior can be further quantiﬁed with non-

asymptotic guarantees using tools from measure concentration. For instance,

assuming |ℓ(f (x), y)| ≤C

is a bounded random variable, Hoeffding’s inequality

asserts that





R(f ) – R(f )



≥t



≤2 exp



–

–2



indicating that ﬂuctuations t ≫

√

are extremely unlikely.

A seemingly naïve strategy to obtain good predictors in F is thus to min-

imise the empirical risk, rather than the population risk. However, before

claiming victory by means of the previous simple argument, it is important to

realize that we need a more powerful control of statistical ﬂuctuations. Indeed,

Foundations of Supervised Learning 43

we have so far focused on measuring the ﬂuctuations at a ﬁxed hypothesis f ,

but the training process is precisely about ﬁnding the right one!

The relevant notion is not to compare how individual random variables

R(f )

ﬂuctuate around their expectations R(f), but how the random function

R ﬂuc-

tuates around its expectation R. We thus need to control ﬂuctuations uniformly

in the support F, that is, sup

f ∈F

R(f ) – R(f )|. As it turns out, such uniform

control can be achieved under fairly general conditions, by appropriately con-

straining the hyptothesis class, as we shall see next. The resulting Empirical

Risk Minimization (ERM) is in fact a very powerful algorithm, and the main

statistical paradigm to perform Supervised Learning.

Empirical Risk Minimisation Given a hypothesis class F = {f

; θ ∈Θ}

parametrised by θ, the ERM deﬁnes an estimator of the form f

, where

θ ∈arg min

θ∈Θ

R(f

) =

i=1

ℓ(f

), y

) . (2.3)

In words, we are searching for a parameter in the domain Θ that best explains

the observed data, in the sense of the chosen loss ℓ. While this certainly seems a

necessary condition to get a good predictor, to what extent is it also sufﬁcient?

By expressing the population risk R(f ) as R(f ) =

R(f ) + (R(f ) –

R(f )), we

can think of a good hypothesis as a function f such that (i) the empirical risk

R(f ) is small, and (ii) the ﬂuctuations R(f ) –

R(f ) are small. While the ﬁrst

property is what the ERM explicitly is designed to do, and can be readily

assessed by measuring the empirical risk on the available data, the second

relies on structural properties of the data distribution, the hypothesis class and

the optimization algorithm, and can not be empirically veriﬁed from the train-

ing without these assumptions. The key observation that we develop next is

that these two sources of error are typically in tension, and one needs to trade

them off — the familiar bias-variance tradeoff.

Having small (or even zero) training error amounts to solving (approxi-

mately or exactly) in θ a system of N equations: f

) = y

for all i = 1…N. The

ability to solve systems of equations, even when they are non-linear, depends

fundamentally on the number of degrees of freedom at our disposal, i.e. on the

intrinsic dimensionality of the model class, here roughly captured in the num-

ber of parameters encoded in θ. In other words, when the number of parameters

is large relative to N, the ERM operates in the so-called overparametrised

regime, and one should expect many parameter choices that ﬁt the data equally

well. However, recall that we are using the empirical risk

R as a proxy for the

population risk R, the true functional we are interested in minimising. Thus,

44 Chapter 2

how to make an informed choice amongst all parameters in Θ that explain

equally well (or interpolate) the data?

Regularization The folklore answer, Occam’s razor, suggests to pick the

‘simplest’ solutions amongst those that interpolate the data. More formally, this

is achieved in statistical learning via regularisation, or capacity control. Given

a non-negative penalty γ : Θ →R encoding the ‘complexity’ of the associated

hypothesis f

, we consider the regularised form of equation 2.3:

∈ arg min

θ∈Θ;γ(θ)≤δ

R(f

) . (2.4)

In words, we introduce a hyper-parameter δ controlling the maximum allowed

complexity of our estimator. As the reader might anticipate, this parameter

controls the fundamental tradeoff between the ability to explain the data (bias)

and the danger of overﬁtting to it (variance). It is important to emphasize that

regularization and overparametrisation can (and often do) coexist. In statistical

terms, the capacity of the model class is now encoded in the single scalar δ. We

will see in Section 2.3 some examples of regularisation in Neural Networks,

with their associated norms γ(θ). Before that, let us formalize the tradeoff

achieved by regularisation.

Example: polynomial regression in univariate data Figure 2.3 displays

the familiar univariate example of ﬁtting a non-polynomial target function

observed on n points with polynomials of increasing degree r. The bias-

variance tradeoff is illustrated by the transition from the underﬁtting regime

where r < n to the overﬁtting regime where r > n. The transition r = n captures a

singluar situation where unregularised ERM becomes ill-conditioned, and the

subsequent population risk blows up; See Figure 2.5

2.2 Decomposition of Risk

Recall that the ultimate goal of supervised learning is to perform well on

unseen data, or, more precisely, to construct an estimator

f from the available

data that has small excess risk R(

f ) – R(f

∗

). Let us consider an estimator

f = f

that approximately solves the regularised ERM from from equation 2.4. Let us

now quantify the excess risk of

f .

We start by adding and subtracting the best model (in terms of population

risk) achievable under the constraint γ(θ) ≤δ:

f ) – R(f

∗

) = R(

f ) – inf

γ(θ)≤δ

R(f

) + inf

γ(θ)≤δ

R(f

) – R(f

∗

)

| {z }

approximation error

. (2.5)

Foundations of Supervised Learning 45

Figure 2.3

Polynomial Regression on univariate functions. Regressions with polynomials of small

degree underﬁt, and error is dominated by bias, while polynomials of high degree over-

ﬁt, and error is dominated by variance.

We have written the excess error (which is non-negative by deﬁnition) as the

sum of two non-negative terms. The second one, termed approximation error,

measures our ability to design accurate enough hypothesis classes with the

right ‘inductive biases’, and penalizes us for reducing the capacity of our

model class via the restriction γ(θ) ≤δ. Further, by exploiting the fact that

f approximately solves equation 2.4, we decompose the ﬁrst term as

f ) – inf

γ(θ)≤δ

R(f

) = R(

f ) –

f ) + inf

γ(θ)≤δ

R(f

) – inf

γ(θ)≤δ

R(f

) +

f ) – inf

γ(θ)≤δ

R(f

)

≤2 sup

γ(θ)≤δ

|R(f

) –

R(f

| {z }

statistical error

f ) – inf

γ(θ)≤δ

R(f

)

| {z }

optimization error

since inf

γ(θ)≤δ

R(f

) – inf

γ(θ)≤δ

R(f

) ≤

R(f

∗

) – R(f

∗

), where θ

∗

∈arg min

γ(θ)≤δ

R(f

Besides the previous approximation error, we now identify two other funda-

mental sources of error: the statistical error penalizes us for optimizing the

46 Chapter 2

Figure 2.4

Decomposition of Risk into Approximation, Estimation and Optimization. The approx-

imation error arises when constraining/regularising the risk over a ball F

. Estimation

error corresponds to the fact that the empirical risk

R deviates from the population risk

R. Finally, optimization error captures the inability of optimization methods to solve

generic non-convex problems.

‘wrong’ risk functional, that is, the empirical (

R) rather than the population

(R) objective, and is captured here through the uniform deviation between

these two functions within the ball γ(θ) ≤δ. The remaining term is the opti-

mization error, since in general equation 2.4 does not admit closed-form

solutions. Moreover, in general the empirical risk

R is a non-convex func-

tion of θ, which creates potential roadblocks in the associated minimisation

problem. Figure 2.4 illustrates the resulting decomposition of risk.

The important takeaway from this error decomposition is that in any super-

vised learning task, the complexity parameter δ plays a fundamental tradeoff

between the different sources of error, even in overparametrised regimes where

the number of parameters (or neurons) is ≫N. Generally speaking, a small

complexity parameter δ makes the statistical error smaller (since we need to

control ﬂuctuations between two functions over a smaller domain), while mak-

ing the approximation error larger. We may thus interpret this is a bias-variance

tradeoff arising in classic model selection in statistics (Bottou and Bousquet

2007).

Foundations of Supervised Learning 47

Figure 2.5

Illustration of the Double-Descent Phenomena, whereby the population risk of unreg-

ularised ERM undergoes a phase transition as the number of parameters crosses the

number of datapoints, with a second ‘descent’ phase.

2.3 Over-Parametrised Regime

In the context of neural networks, or more generally models having many more

parameters than data-points, one may wonder how this tradeoff plays out.

The so-called double-descent phenomena (Belkin et al. 2019) seemingly

goes against this bias-variance tradeoff, whereby highly over-parametrised

neural networks exhibit excellent generalisation performance; see Figure 2.5.

In fact, the double-descent phenomena is consistent with the variance-bias

tradeoff described above, but it serves as a cautionary tale that the ‘effec-

tive’ complexity of a hypothesis class is generally not well captured by simply

counting the number of parameters.

A successful learning scheme thus needs to encode the appropriate notion of

regularity or inductive bias for f, imposed through the construction of the func-

tion class F and the use of regularisation. We brieﬂy introduce this concept in

the following section.

Modern machine learning operates with large, high-quality datasets, which,

together with appropriate computational resources, motivate the design of

rich function classes F with the capacity to interpolate such large data. This

mindset plays well with neural networks, since even the simplest choices of

architecture yields a dense class of functions.

A set A⊂B is said to be dense in X if its closure satisﬁes A∪{ lim

i→∞

∈A} = B. This implies that any point in X is arbitrarily close to a point in A

. A typical Universal Approximation result shows that the class of functions

48 Chapter 2

represented e.g. by a two-layer perceptron, f (x) = c

⊤

sign(Ax + b) is dense in

the space of continuous functions on R

The capacity to approximate almost arbitrary functions is the subject of

various Universal Approximation Theorems; several such results were proved

and popularised in the 1990s by applied mathematicians and computer scien-

tists (see e.g. Cybenko 1989; Hornik 1991; Barron 1993; Leshno et al. 1993;

Maiorov 1999; Pinkus 1999).

Figure 2.6

Multilayer Perceptrons (Rosenblatt 1958), the simplest feed-forward neural networks,

are universal approximators: with just one hidden layer, they can represent combina-

tions of step functions, allowing to approximate any continuous function with arbitrary

precision.

Universal Approximation, however, does not imply an absence of inductive

bias. Given a hypothesis space F with universal approximation, we can deﬁne

a complexity measure γ : F →R

(the same notion that we introduced at the

end of Section 2.1), and redeﬁne our interpolation problem as

f ∈arg min

g∈F

γ(g) s.t. g(x

) = f (x

) for i = 1, . . . , N,

i.e., we are looking for the most regular functions within our hypothesis class.

For standard function spaces, this complexity measure can be deﬁned as a

norm

. In this sense, and in light of the Risk decomposition that was outlined

in Section 2.2, we can revisit the Universal Approximation Theorems with a

more skeptical eye: they ‘just’ convey the idea that the approximation error

Foundations of Supervised Learning 49

inf

γ(f )≤δ

R(f ) – R(f

∗

) converges to zero as δ →∞. Stated as such, they lack

quantitative power to understand the resulting tradeoff between approximation

and statistical errors.

In order to overcome this limitation, it is thus necessary to quantify

approximation rates. In low dimensions, splines are a workhorse for function

approximation. They can be formulated as above, with a norm capturing the

classical notion of smoothness, such as the squared-norm of second-derivatives

+∞

–∞

′′

(x)|

dx for cubic splines. This representation allows us to quantify

approximation errors by controlling the norm of the spline approximation.

In the case of neural networks, the complexity measure γ can be expressed

in terms of the network weights, i.e. γ(f

) = γ(θ). The L

-norm of the net-

work weights, known as weight decay, or the so-called path-norm (Neyshabur,

Tomioka, and Srebro 2015) are popular choices in deep learning literature.

From a Bayesian perspective, such complexity measures can also be inter-

preted as the negative log of the prior for the function of interest. The

regularisation term γ(θ) may take different forms across different ML systems.

For instance, for linear models of the form f

(x) = θ

⊤

Φ(x) with θ ∈R

, we may

choose an L

norm γ(θ) = ∥θ∥

j=1

|θ

, with p = 1, 2 capturing the familiar

settings of sparse and ridge regression, respectively.

It is important to emphasize that the regularization framework encompasses

not only methods that explicitly consider regularisation in either constrained

(as above) or penalized form (by adding the associated Lagrange multipliers,

leading to min

θ∈Θ

gR(f

) + λγ(θ) ); but also methods where the regularization

comes implicitly through the choice of optimization scheme. For example, it is

well-known that gradient-descent on an under-determined least-squares objec-

tive will choose interpolating solutions with minimal L

norm. The extension

of such implicit regularisation results to modern neural networks is the subject

of current studies (see e.g. Gunasekar et al. 2017; Blanc et al. 2020; Shamir

and Vardi 2020; Razin and Cohen 2020).

The important takeaway is that regularization and overparametrisation are

compatible in modern DL via the implicit bias of gradient-based learning, and

this bias, or ‘prior’, is often captured as a certain norm in the space of functions.

All in all, a natural question arises: how to deﬁne effective priors that capture

the expected regularities and complexities of real-world prediction tasks, and

that enable a quantiﬁcation of the approximation error?

2.4 The Curse of Dimensionality In High-Dimensional Learning

While interpolation in low-dimensions (with d = 1, 2 or 3) is a classic signal

processing task with very precise mathematical control of estimation errors

using increasingly sophisticated regularity classes (such as spline interpolants,

50 Chapter 2

wavelets, curvelets, or ridgelets), the situation for high-dimensional problems

is entirely different. In this generic high-dimensional regime, one is often con-

fronted with an impossible tradeoff between ‘cursed’ approximation rates or

‘cursed’ estimation rates

In order to convey the essence of the idea, let us consider a classical notion

of regularity that can be easily extended to high dimensions: 1-Lipschitz- func-

tions f : X →R, i.e. functions satisfying |f (x) – f (x

′

)| ≤∥x – x

′

∥for all x, x

′

∈X.

This hypothesis says that if we perturb the input x slightly (as measured by the

norm ∥x – x

′

∥), the output f (x) is not allowed to change much. This is a weaker

form of regularity than Sobolev smoothness, which additionally asks for f to

have s derivatives that are integrable. If our only knowledge of the target func-

tion f is that it is 1-Lipschitz, how many observations do we expect to require

to ensure that our estimate

f will be close to f? Figure 2.7 reveals that the gen-

eral answer is necessarily exponential in the dimension d, signaling that the

Lipschitz class grows ‘too quickly’ as the input dimension increases: in many

applications with even modest dimension d, the number of samples would be

bigger than the number of atoms in the universe. The situation is not better

if one replaces the Lipschitz class by a global smoothness hypothesis, such

as the Sobolev Class H

(Ω

)

. Indeed, classic results (Tsybakov 2008) estab-

lish a minimax rate of approximation and learning for the Sobolev class of the

order ϵ

–d/s

, showing that the extra smoothness assumptions on f only improve

the statistical picture when s ∝d, an unrealistic assumption in practice. Thus,

if ‘classic’ notions of regularity are not adapted to perform high-dimensional

learning, how should one replace them with effective ones?

2.5 Linear Approximation With Positive-Deﬁnite Kernels

A powerful framework to deﬁne ﬂexible approximation spaces in high-

dimension is via positive semideﬁnite kernels: a psd kernel k(x, x

′

) is a sym-

metric mapping on X ×X satisfying

i,j=1

k(x

, x

) ≥0 for any α ∈R

and any collection of points (x

, …, x

). A kernel captures an a-priori notion of

similarity in the input space, e.g a kernel of the form k(x, x

′

) = x

⊤

′

in X = R

measures linear correlation, while a Gaussian kernel k(x, x

′

) = exp(–

2τ

∥x –

′

∥

) measures whether x, x

′

are close at a certain scale, in the sense that

k(x, x

′

) ≈1 whenever ∥x – x

′

∥≲ τ and k(x, x

′

) ≈0 otherwise.

To each positive semideﬁnite kernel on X one can associate a Hilbert space

of functions deﬁned over X with additional structure, the so-called reproduc-

ing property. In essence, there exists a (possibly inﬁnite-dimensional) Hilbert

space H and a ‘feature map’ φ : X →H such that the kernel can be viewed as

an inner product: k(x, x

′

) = ⟨φ(x), φ(x

′

)⟩

. The reproducing property refers to

Foundations of Supervised Learning 51

Figure 2.7

We consider a Lipschitz function f (x) =

j=1

ϕ(x – x

) where z

= ±1, x

∈R

is placed

in each quadrant, and ϕ a locally supported Lipschitz ‘bump’. Unless we observe the

function in most of the 2

quadrants, we will incur in a constant error in predicting

it. This simple geometric argument can be formalised through the notion of Maximum

Discrepancy (Luxburg and Bousquet 2004), deﬁned for the Lipschitz class as κ(d) =

x,x

′

sup

f ∈Lip(1)



f (x

) –

f (x

′

)



≃N

–1/d

, which measures the largest expected

discrepancy between two independent N-sample expectations. Ensuring that κ(d) ≃ϵ

requires N = Θ(ϵ

–d

); the corresponding sample {x

}

deﬁnes an ϵ-net of the domain. For

a d-dimensional Euclidean domain of diameter 1, its size grows exponentially as ϵ

–d

the fact that in this Hilbert space, function evaluations can be seen as inner-

products: for each f ∈H and x ∈X, one has f (x) = ⟨f , φ(x)⟩

. Given that in

supervised learning one acquires information about the target function f

∗

via

(noisy) evaluations in the training set, ie y

= f

∗

) + ξ

, the reproducing prop-

erty allows us to view these measurements as bona-ﬁde projections of the target

onto certain directions, spanned precisely by the data features φ(x

) ∈H.

The resulting Reproducing Kernel Hilbert Space (RKHS) H thus consists of

functions f

: X →R that are linear in θ when expressed using the represen-

tation f

(x) = ⟨θ, φ(x)⟩

, and comes with a Hilbert regularization norm γ(f

) =

∥θ∥

. The Hilbertian structure enables a precise control of all sources of error,

including approximation, generalisation and optimization, even in the high-

dimensional regime. The initial choice of kernel thus captures the ‘inductive

bias’ associated with the corresponding RKHS. For instance, in the previous

Gaussian kernel example, the associated RKHS norm ∥f

∥

corresponds to a

weighted Fourier norm of the form ∥f

∥

∝

f (ω)|

exp(

∥ω∥

)dω, which

indicates that H contains only inﬁnitely differentiable smooth functions with

exponential Fourier decay.

In the context of Neural Networks, some popular kernels are obtained by

building feature maps φ(x) using inﬁnitely wide neural networks, and by

52 Chapter 2

approximating them using Random Features. For instance, by considering

k(x, x

′

) = E

w∼µ

[σ(w

⊤

x)σ(w

⊤

′

)], where µ is a ﬁxed probability distribution

in R

and σ an arbitrary activation function, the associated Random Feature

approximation is obtained by a Monte-Carlo estimate of k:

k(x, x

′

) =

m=1

σ(w

⊤

x)σ(w

⊤

′

) , with w

∼

iid

µ .

The associated (random) RKHS is thus generated by the random features

{σ(w

⊤

·)}

m≤M

, which can be interpreted in this case as a width-M shal-

low neural network with random weights w

. The resulting approximation

space is then obtained by linear combinations of these random features; in

other words, a shallow neural network where only the last output layer is

trained. Another popular kernel method associated with neural networks is

the so-called Neural Tangent Kernel (NTK), which also considers the gra-

dient features {∇

σ(w

⊤

·)}

and captures the training dynamics in certain

overparametrised scalings (the so-called lazy regime).

While the RKHS framework offers wide ﬂexibility through the choice of

the kernel, it is inherently a linear approximation framework where there is

no representation (or ‘feature’) learning, since the choice of the kernel (and

therefore the associated feature maps) is made before seeing the training data.

In other words, kernel methods are unable to extract meaningful information

in high-dimensions unless one already has a preconceived idea of ‘where to

look’. To illustrate the limitations of kernels in high-dimensions, it is useful to

consider isotropic kernels, ie k(x, x

′

) = k(Ux, Ux

′

) for any rotation U ∈O

. In

words, these kernels have no preconceived notion of priviledged directions in

the input space. Under these conditions, the kernel ridge regression estimator

associated with a dataset {(x

, y

= f (x

)}

i≤n

is essentially equivalent to perform-

ing polynomial regression on f : If n ≃d

, then

f is the best degree-k polynomial

approximation of f (Misiakiewicz and Montanari 2023). The model is thus

unable to adapt to any form of low-dimensional structure present in the target

function f , and is only effective to learn targets with global smoothness.

In conclusion, such limitation is incompatible with many aspects of modern

DL, such as the pre-training paradigm, in which a large training set is used to

learn certain features, which are then transferred to a downstream task via a

ﬁne-tuning stage.

2.6 From Linear To Non-Linear Approximation

The alternative Feature Learning framework corresponds instead to non-linear

approximation. While the previous kernel approximation framework considers

a ﬁxed family of feature maps {φ

(x)}

j≤dim(H)

and approximates a target f

∗

Foundations of Supervised Learning 53

as f

∗

≈

by adjusting the weights θ

, in non-linear approximation one

also adjusts the parameters w

deﬁning the features. The additional degrees of

freedom brought by adjusting w

besides θ

bring additional adaptivity, that

can become dramatic in the high-dimensional regime. In other words, the non-

linear approximation aspect of feature learning improves the approximation

properties of the model, at the expense of a more challenging optimization.

Indeed, the joint optimization of w

and θ

deﬁnes a non-convex problem

which is generally challenging to analyse, yet often solvable in practice using

gradient-descent techniques.

In summary, some of the limitations of linear approximation and associated

kernel methods may be overcome by considering feature learning, which has

the ability to adapt to speciﬁc structures of the input function. Yet, the ques-

tion remains: how to deﬁne these trainable features in such a way that prior

information on the target function is efﬁciently encoded?

2.7 Approximation With Shallow Networks And Beyond

To illustrate the difﬁculty of this question, it is useful to consider the simplest

instance of feature learning in NNs, given by shallow neural networks. This

architecture deﬁnes a class of functions of the form

F =







(x) =

m≤M

σ(w

⊤

x + b

) , θ = {α

, w

, b

}







, (2.6)

where σ is an activation and θ collects the weights of the network. The model

is thus building an approximation as a linear combination of ridge functions,

which are non-linear functions of one-dimensional projections of the input.

While this class enjoys universal approximation in the overparametrised limit

M →∞, this approximation becomes quantitative in the high-dimensional

regime (ie, with an approximation error ∥f

∗

– f

(M)

∥≃M

–η

with η > 0 indepen-

dent of the input dimension) only under strong assumptions on the target. For

instance, if f

∗

is such that it depends on a collection of low-dimensional projec-

tions of the input (see Figure 2.8), then the shallow class, when operating in the

non-linear regime, is able to break this curse of dimensionality (Bach 2017a).

However, in most real-world applications (such as computer vision, speech

analysis, physics, or chemistry), functions of interest tend to exhibit complex

long-range correlations that cannot be expressed with low-dimensional projec-

tions (Figure 2.8), making this hypothesis unrealistic. It is thus necessary to

deﬁne an alternative source of regularity, by exploiting the spatial structure of

the physical domain and the geometric priors of f , as we describe in the next

Chapters.

54 Chapter 2

Figure 2.8

If the unknown function f is presumed to be well approximated as f (x) ≈g(Ax) for

some unknown A ∈R

k×d

with k ≪d, then shallow neural networks can capture this

inductive bias, see e.g. Bach 2017a. In typical applications, such dependency on low-

dimensional projections is unrealistic, as illustrated in this example: a low-pass ﬁlter

projects the input images to a low-dimensional subspace; while it conveys most of the

semantics, substantial information is lost.

Suggested Further Reading and Bibliographical Notes

For a comprehensive introduction to learning theory, we refer the reader to the

recent ‘Learning Theory from First Principles’ by Francis Bach. In particular,

the topics outlined in Section 2.1, Section 2.2 and Section 2.5 are presented

from the ground-up and with great level of detail. Further aspects of RKHS

theory and their applications to learning theory are detailed in ‘Learning with

Kernels, by Scholkopf and Smola. For further theoretical aspects of statisti-

cal learning theory, and a comprehensive review of measure concentration,

we refer the reader to ‘High-Dimensional Statistics’, by Martin Wainwright.

For topics speciﬁcally related to Neural Network approximation, described in

Section 2.4 and Section 2.7, we recommend the reader to explore ‘Deep Learn-

ing Theory’ by Matus Telgarsky. Finally, the notions of linear and non-linear

Foundations of Supervised Learning 55

approximation outlined in Section 2.6 are explored in depth in ‘A Wavelet tour

in Signal Processing’, by Stephane Mallat (Chapter 9).

Approximation Theory for Neural Networks: The approximation properties

of generic fully connected neural networks were heavily studied in the 80s

and 90s, culminating in a fairly complete characterisation. (Baum and Haus-

sler 1988) considered the minimal size of neural networks with the ability to

ﬁt a certain (ﬁnite) dataset. Shortly after, several works studied the denisty

of shallow neural networks in the space of continuous functions, e.g (Hecht-

Nielsen 1987; Carroll and Dickinson 1989; Cybenko 1989; Funahashi 1989;

Hornik, Stinchcombe, and White 1989), under certain assumptions on the acti-

vation function, culminating in the ‘modern’ formulation of UAT (Leshno

et al. 1993) which only asks the activation to not be a polynomial. Beyond

qualitative approximation, the quantitative study of approximation rates using

neural networks was pioneered by Barron (Barron 1993) and Maiorov and

Meir (Maiorov 1999; Meir and Maiorov 2000). Barron also identiﬁed the

fundamental separation between linear and non-linear approximation using

shallow neural networks, subsequently studied in (Bach 2017a; Ma, Wu, et

al. 2022).

Kernels and Neural Networks: The interplay between Neural Networks

and Kernel methods dates at least to Neal (Neal 1996); see also (Williams

1996), where a connection was made between inﬁnitely-wide neural net-

works and certain Gaussian Processes. (Weston, Ratle, and Collobert 2008)

further explored the interplay between kernels and NNs, with the motivation

to leverage the tractable formalism of kernel methods to ease the optimiza-

tion challenges of deep architectures. Shortly after, Cho and Saul (Cho and

Saul 2009) computed the dot-product kernels associated with deep architec-

tures with randomly initialized isotropic weights. In parallel, Rahimi and Recht

(Rahimi and Recht 2007) introduced random feature approximations for cer-

tain kernel classes, which in retrospect provided the framework for today’s

quantitative non-asymptotic theory (Misiakiewicz and Montanari 2023; Bach

2017a, 2017b). In essence, this framework provides an explicit statistical char-

acterization of neural networks where only the last layer is allowed to train. The

general setting of non-linear NN approximation is out of the scope of kernel

methods, with the notable exception of the Neural Tangent Kernel from (Jacot,

Gabriel, and Hongler 2018). The NTK considers a linearisation of the Neural

Network at its initialization, and shows that, for certain choice of initialisation

scaling, the training dynamics of the actual network and its linearisation agree

in the overparametrised limit. This regime is however unable to account for the

feature learning abilities of NNs (Chizat, Oyallon, and Bach 2019).

56 Chapter 2

Exercises

1. Decomposition of Risk: Quadratic Function and Least Squares

Consider a linear hypothesis class of the form F = {f

(x) = θ

⊤

φ(x)}, where

φ : R

→R

is a ﬁxed, ﬁnite-dimensional feature map, and θ ∈R

parametrises the

hypothesis. Consider the least-square loss, hence ℓ(f

(x), y) = (θ

⊤

φ(x) – y)

(i) By developing the square, show that for any θ we have

R(f

) – R(f

) = θ

⊤

(

Σ – Σ)θ – 2θ

⊤

(

m – m) + (ˆσ

– σ

) ,

where

Σ =

φ(x

)φ(x

)

⊤

, Σ = E[

Σ],

m =

φ(x

), m = E[

m], ˆσ

, σ

E[ˆσ

(ii) Assume now that covariantes are isotropic Σ = Id and the realizable setting y|x =

⊤

∗

φ(x). Show that the excess risk is then

R(f

) – R(f

) = (θ – θ

∗

)

⊤

(

Σ – Σ)(θ – θ

∗

) .

(iii) Conclude that the uniform bound in the ball F

= {f

∈F; ∥θ∥≤δ} is

sup

f ∈F

R(f ) – R(f )| = δ

∥

Σ – Σ∥

In this example, the uniform control of the excess risk can thus be directly captured

with the operator norm of the empirical covariance ﬂuctuations (notice that the oper-

ator norm ∥A∥

= sup

∥θ∥≤1

∥Aθ∥ still contains a supremum over the domain). The

additional structure of the operator norm of a linear operator leads to a tight control

of the excess risk in linear models using tools from Random Matrix Theory; see Misi-

akiewicz and Montanari 2023 for an in-depth analysis of the excess risk in linear models

(including inﬁnite-dimensional linear models given by RKHS).

2. Generalization Bounds with Rademacher Complexity

Reference: (Bach 2024, Section 4.5) Denote z = (x, y) ∈Z := X ×Y and H=

{(x, y) 7→ℓ(y, f (x)); f ∈F} the class of functions from Z to R associated to the hypoth-

esis class F. In this exercise, we will obtain generalisation bounds sup

{R(f ) –

R(f )}.

(i) Show that sup

{R(f ) –

R(f )} = sup

{E[h(z)] –

i=1

h(z

)}.

Let D = {z

, …, z

} be the dataset. The Rademacher Complexity of the class H with

respect to the data distribution is deﬁned as

(H) := E

ε,D



sup

h∈H

⟨ε, h(D)⟩



where h(D) = (h(z

), …, h(z

)) and ε = (ε

, …, ε

) is a vector of n iid Rademacher

variables, which take values ε

= ±1 with probability 1/2.

Foundations of Supervised Learning 57

(ii) Show that R

is a linear functional, ie R

(αH+ α

′

) = |α|R

(H) + |α

′

) for

α, α

′

∈R, that it is translation-invariant, ie R

(H+ {h

}) = R

(H), and that R

(H) =

(CH(H)), where CH is the convex hull.

(iii) Massart’s Lemma for ﬁnite Hypothesis class: Assume H= {h

, …, h

} and that H is

a.s. bounded, ie

i=1

h(z

)

≤R

for all h ∈H. Show that R

(H) ≤

2 log m

(iv) Symmetrisation Show that

sup

{E[h(z)] –

i=1

h(z

)}

≤2R

(H) ,

and the same bound applies to E



sup

{

i=1

h(z

) – E[h(z)]}



. (Introduce an inde-

pendent copy D

′

= {z

′

, …, z

′

} of the dataset and use the identity E[h(z)] – h(z

) =

E[h(z

′

) – h(z

)|D].

The previous bound controls the expected generalisation error. To obtain high-

probability bounds, we now consider McDiarmid’s inequality, which exploits the a.s.

boundedness of the random variables.

Lemma 1 (McDiarmid’s inequality) Let Z

, …, Z

be independent random variables

deﬁned over Z, and f : Z

→R such that, for all i ∈{1, n} and all z

, …, z

, z

′

, satisﬁes

|f (z

, …, z

i–1

, z

i+1

, …z

) – f (z

, …, z

i–1

, z

′

, z

i+1

, …z

)| ≤c .

Then





f (Z

, …, Z

) – E[f (Z

, …, Z

)]



≥t



≤2 exp(–2t

/(nc

))

(v) Apply McDiarmid’s inequality both to R and to R

to obtain the following high-

probabilty bound: if h(z) ≤ℓ for all z and all h, then with probability greater than 1 – δ,

for any h ∈H,

E[h(z)] ≤

i=1

h(z

) + 2

(H) + 3ℓ

log(2δ

–1

)

where we deﬁned the empirical Rademacher complexity as

(H) := E



sup

h∈H

⟨ε, h(D)⟩



The Rademacher complexity provides a powerful tool to control the generalisation

error; see (Section 4.5) for further properties and speciﬁc examples where hypothesis

classes are given by balls in generic normed spaces.

3. Learning Lipschitz Functions in d-dimensions

Let {(x

, y

)}

i=1,…,n

with x

∼Unif([–1, 1]

) and y

= f (x

) where f is 1-Lipschitz:

|f (x) – f (y)| ≤∥x – y∥. Show that in order to learn f up to uniform accuracy ϵ; ie

E[sup

x∈[–1,1]

|f (x) –

f (x)|] ≤ϵ, where the expectation is with respect to the randomness

of the training data, it is necessary and sufﬁcient that n = Θ(ϵ

–d

4. Linear vs Non-Linear Approximation in High-Dimension

58 Chapter 2

In this exercise, we will compare linear approximation schemes (associated with

Kernel methods) with non-linear approximation (associated with Neural Networks in

the feature-learning regime).

Let F = L

, ν) be the space of squared-integrable functions with respect to

a data distribution ν, and consider B= {ϕ

}

m∈Λ

an orthonormal basis of F. For each

subset S ⊂Λ, we deﬁne U

:= span{ϕ

; m ∈S}. The linear appromation of f ∈F is

given by

(f ) := min

g∈U

∥f – g∥ , g

(f ) := arg min

g∈U

∥f – g∥ .

Finally, given a target family F

∗

⊂F, the minimax N-term linear approximation error

is deﬁned as

∗

) := inf

|S|≤N

sup

f ∈F

∗

(f ) .

(i) Show that g

(f ) can be explicitly computed as g

(f ) =

m∈S

⟨f , ϕ

⟩ϕ

Consider now an order in Λ, and S = S

the nested sets S

= {1, N}.

(ii) Show that for f ∈F, we have lim

N→∞

(f ) = 0.

(iii) Give an example of a family F

∗

such that ϵ

(

(iv) Show that if f ∈F

∗

is such that

|⟨f , ϕ

⟩|

< ∞ for s > 1/2, then ϵ

∗

) =

O(N

–s

). This quantiﬁes the previous approximation guarantee in the so-called Sobolev

class.

We now deﬁne the non-linear approximation error as

∗

) := sup

f ∈F

∗

inf

|S|=N

(f ) .

(v) Show that for any N and any F

∗

, ϵ

∗

) ≤ϵ

∗

Non-linear approximation schemes are thus more powerful than their linear coun-

terparts. We conclude this exercise by illustrating how the gap between linear and

non-linear approximation can be dramatic in the high-dimensional regime.

For that purpose, consider a rotationally-invariant data distribution ν, ie, such

that for any U ∈O

and any measureable set A ⊂R

we have ν(U(A)) = ν(A). Consider

an orthonormal basis of L

, ν) of the form B = {ϕ

j,k

}

j∈N, k≤N

d,j

, where the orthog-

onal subspaces V

= span(ϕ

j,k

; k = 1…N

d,j

) are closed under rotations: if f ∈V

, then

∈V

for all U. N

d,j

is thus the dimension of each invariant subspace, and it can be

assumed that N

d,j

≃d

for sufﬁciently large d. These invariant spaces are often referred

as harmonics. Finally, consider a function class F

∗

of the form

∗

= {f ;

∥P

f ∥

< ∞; f (x) = h(x ·θ) for some θ ∈S

d–1

} ,

where P

denotes the orthogonal projection. F

∗

thus consists of high-dimensional

functions which only depend on one-dimensional projections of the input, the so-called

single-index models, and whose link function h has polynomial decay of order s in the

harmonic decomposition.

Foundations of Supervised Learning 59

(vi) Show that, for any basis B as above, the minimax linear approximation error ϵ

∗

) is

minimised at S = {(j, k); j ≤j

∗

(N)}, where j

∗

(N) is the largest j such that

′

≤j

N(d, j

′

) <

(vii) Deduce that the minimax linear approximation error ϵ

∗

) has a staircase behavior:

it remains constant as long as j

∗

(N) does not increase.

(viii) Using the asymtpotics N

d,j

≃d

, deduce that to reach approximation error ϵ, one needs

N ≃d

–1/s

terms.

(ix) In contrast, show that the non-linear approximation error, when optimized over

harmonic bases as above

∗

) := sup

f ∈F

∗

inf

B,|S|=N

(f ) ,

requires only N ≃ϵ

–1/s

terms.

5. Universal Approximation with Fourier Analysis

[From M. Telgarsky DLT Section 2.3, and Barron 1993] In this exercise, we

will derive the Universal Approximation Theorem for Shallow Neural Networks using

Fourier Representations.

(i) Verify that for f ∈L

) and C

, its Fourier transform

f (ξ) =

f (x)e

–2πix·ξ

dx is also

in L

), and moreover

∥ξ∥|

f (ξ)|dξ < ∞ .

(ii) Show the inverse Fourier representation f (x) =

f (ξ)e

2πiξ·x

dξ, where the equality

holds point-wise.

(iii) Let

f (ξ) = |

f (ξ)|e

iφ

(ξ)

∈C be the polar representation. Show that

f (x) =

cos(2π(ξ ·x + φ

(ξ)))|

f (ξ)|dξ .

(iv) Using the Funtamental Theorem of Calculus, show that

cos(2π(ξ ·x + φ

(ξ))) = cos(2πφ

(ξ)) +

∥ξ∥



sin(2π(–b + φ

(–ξ))) – sin(2π(b + φ

(ξ)))



1(ξ ·x ≥b)db .

(v) Conclude that there exist a measure µ(b, ξ) satisfying

|µ(b, ξ)|dbdξ ≤4π

∥ξ∥|

f (ξ)|dξ <

∞ such that

f (x) =

µ(b, ξ)1(x ·ξ ≥b)dbdξ ;

in other words, we can express f as an inﬁnitely-wide shallow neural network with

threshold activation function x 7→1(x ·ξ ≥b).

60 Chapter 2

Notes

The study of ML beyond the i.i.d. assumption is the topic of Transfer,

Continual, or Adversarial Learning; see notes at the end of the chapter.

If P admits a density (which we denote by P by abuse of language) with

respect to a base measure µ in X ×Y, so P ≪µ , then P

y|x

admits a density with

respect to the marginal µ

(y) =

dP(x, y), with dP

y|x

(y) = Z

–1

P(x, y)dµ

(y),

with Z =

P(x, y)dµ

(y). We will gloss over the measure-theoretic consid-

erations when dealing with conditional distributions, and assume throughout

that all joint distributions are absolutely continuous with respect to the base

measure of the domain.

Note that the denseness of a set is a property that depends on the choice

of a topology. Consider B the set of squared-integrable, bounded scalar func-

tions, and A⊂B the set of bounded, compactly-supported functions. Then A

is dense in B under the topology induced by L

(R), whereas it is not dense

under the topology induced by L

∞

(R).

Informally, a norm ∥x∥ can be regarded as a “length” of vector x. A Banach

space is a complete vector space equipped with a norm.

‘Cursed’ here refers to rates of approximation/estimation that have expo-

nential dependence in dimension, e.g ϵ ≃n

–c/d

where n refers to the statisti-

cal/computational resources, eg number of neurons to approximate a target, or

number of datapoints to estimate it.

A function f is in the Sobolev class H

(Ω

) if f ∈L

(Ω

) and the generalised

s-th order derivative is square-integrable:

|ω|

2s+1

f (ω)|

dω < ∞, where

f is the

Fourier transform of f ; see Chapter 7.