Foundations of Supervised Learning 55
approximation outlined in Section 2.6 are explored in depth in ‘A Wavelet tour
in Signal Processing’, by Stephane Mallat (Chapter 9).
Approximation Theory for Neural Networks: The approximation properties
of generic fully connected neural networks were heavily studied in the 80s
and 90s, culminating in a fairly complete characterisation. (Baum and Haus-
sler 1988) considered the minimal size of neural networks with the ability to
fit a certain (finite) dataset. Shortly after, several works studied the denisty
of shallow neural networks in the space of continuous functions, e.g (Hecht-
Nielsen 1987; Carroll and Dickinson 1989; Cybenko 1989; Funahashi 1989;
Hornik, Stinchcombe, and White 1989), under certain assumptions on the acti-
vation function, culminating in the ‘modern’ formulation of UAT (Leshno
et al. 1993) which only asks the activation to not be a polynomial. Beyond
qualitative approximation, the quantitative study of approximation rates using
neural networks was pioneered by Barron (Barron 1993) and Maiorov and
Meir (Maiorov 1999; Meir and Maiorov 2000). Barron also identified the
fundamental separation between linear and non-linear approximation using
shallow neural networks, subsequently studied in (Bach 2017a; Ma, Wu, et
al. 2022).
Kernels and Neural Networks: The interplay between Neural Networks
and Kernel methods dates at least to Neal (Neal 1996); see also (Williams
1996), where a connection was made between infinitely-wide neural net-
works and certain Gaussian Processes. (Weston, Ratle, and Collobert 2008)
further explored the interplay between kernels and NNs, with the motivation
to leverage the tractable formalism of kernel methods to ease the optimiza-
tion challenges of deep architectures. Shortly after, Cho and Saul (Cho and
Saul 2009) computed the dot-product kernels associated with deep architec-
tures with randomly initialized isotropic weights. In parallel, Rahimi and Recht
(Rahimi and Recht 2007) introduced random feature approximations for cer-
tain kernel classes, which in retrospect provided the framework for today’s
quantitative non-asymptotic theory (Misiakiewicz and Montanari 2023; Bach
2017a, 2017b). In essence, this framework provides an explicit statistical char-
acterization of neural networks where only the last layer is allowed to train. The
general setting of non-linear NN approximation is out of the scope of kernel
methods, with the notable exception of the Neural Tangent Kernel from (Jacot,
Gabriel, and Hongler 2018). The NTK considers a linearisation of the Neural
Network at its initialization, and shows that, for certain choice of initialisation
scaling, the training dynamics of the actual network and its linearisation agree
in the overparametrised limit. This regime is however unable to account for the
feature learning abilities of NNs (Chizat, Oyallon, and Bach 2019).