Foundations of Supervised Learning 43
we have so far focused on measuring the fluctuations at a fixed hypothesis f ,
but the training process is precisely about finding the right one!
The relevant notion is not to compare how individual random variables
ˆ
R(f )
fluctuate around their expectations R(f), but how the random function
ˆ
R fluc-
tuates around its expectation R. We thus need to control fluctuations uniformly
in the support F, that is, sup
f ∈F
|
ˆ
R(f ) – R(f )|. As it turns out, such uniform
control can be achieved under fairly general conditions, by appropriately con-
straining the hyptothesis class, as we shall see next. The resulting Empirical
Risk Minimization (ERM) is in fact a very powerful algorithm, and the main
statistical paradigm to perform Supervised Learning.
Empirical Risk Minimisation Given a hypothesis class F = {f
θ
; θ ∈Θ}
parametrised by θ, the ERM defines an estimator of the form f
ˆ
θ
, where
ˆ
θ ∈arg min
θ∈Θ
ˆ
R(f
θ
) =
1
N
N
X
i=1
ℓ(f
θ
(x
i
), y
i
) . (2.3)
In words, we are searching for a parameter in the domain Θ that best explains
the observed data, in the sense of the chosen loss ℓ. While this certainly seems a
necessary condition to get a good predictor, to what extent is it also sufficient?
By expressing the population risk R(f ) as R(f ) =
ˆ
R(f ) + (R(f ) –
ˆ
R(f )), we
can think of a good hypothesis as a function f such that (i) the empirical risk
ˆ
R(f ) is small, and (ii) the fluctuations R(f ) –
ˆ
R(f ) are small. While the first
property is what the ERM explicitly is designed to do, and can be readily
assessed by measuring the empirical risk on the available data, the second
relies on structural properties of the data distribution, the hypothesis class and
the optimization algorithm. The key observation that we develop next is that
these two sources of error are typically in tension, and one needs to trade them
off — the familiar bias-variance tradeoff.
Having small (or even zero) training error amounts to solving (approxi-
mately or exactly) in θ a system of N equations: f
θ
(x
i
) = y
i
for all i = 1…N. The
ability to solve systems of equations, even when they are non-linear, depends
fundamentally on the number of degrees of freedom at our disposal, i.e. on the
intrinsic dimensionality of the model class, here roughly captured in the num-
ber of parameters encoded in θ. In other words, when the number of parameters
is large relative to N, the ERM operates in the so-called overparametrised
regime, and one should expect many parameter choices that fit the data equally
well. However, recall that we are using the empirical risk
ˆ
R as a proxy for the
population risk R, the true functional we are interested in minimising. Thus,
how to make an informed choice amongst all parameters in Θ that explain
equally well (or interpolate) the data?