Big Ideas in Applied Math: Smoothness and Degree of Approximation

At its core, computational mathematics is about representing the infinite by the finite. Even attempting to store a single arbitrary real number requires an infinite amount of memory to store its infinitely many potentially nonrepeating digits.1It is a well-known result that a real number has a repeating decimal expansion if, and only if, it is rational. When dealing with problems in calculus, however, the problem is even more severe as we want to compute with functions on the real line. In effect, a function is an uncountably long list of real numbers, one for each value of the function’s domain. We certainly cannot store an infinite list of numbers with infinitely many digits on a finite computer!

Thus, our only hope to compute with functions is to devise some sort of finite representation for them. The most natural representation is to describe function by a list of real numbers (and then store approximations of these numbers on our computer). For instance, we may approximate the function by a polynomial a_1 + a_2 x + \cdots + a_{M} x^{M-1} of degree M-1 and then store the M coefficients a_1,\ldots,a_{M}. From this list of numbers, we then can reconstitute an approximate version of the function. For instance, in our polynomial example, our approximate version of the function is just a_1 + a_2 x + \cdots + a_{M} x^{M-1}. Naturally, there is a tradeoff between the length of our list of numbers and how accurate our approximation is. This post is about that tradeoff.

The big picture idea is that the “smoother” a function is, the easier it will be to approximate it with a small number of parameters. Informally, we have the following rule of thumb: if a function f on a one-dimensional domain possesses s nice derivatives, then f can be approximated by a M-parameter approximation with error decaying at least as fast as 1/M^s. This basic result appears in many variants in approximation theory with different precise definitions of the term “s nice derivatives” and “M-parameter approximation”. Let us work out the details of the approximation problem in one concrete setting using Fourier series.

Approximation by Fourier Series

Consider a complex-valued and 2\pi-period function f defined on the real line. (Note that, by a standard transformation, there are close connections between approximation of 2\pi-periodic functions on the whole real line and functions defined on a compact interval [a,b].2Specifically, suppose that h is a function defined on [a,b]. Then define a function g on [-1,1] by g(x) = h((b-a)x + (b+a)). This linear rescaling of the domain is very simple and easy to understand. Now define a 2\pi-periodic function f on \mathbb{R} by f(\theta) = g(\cos(\theta)). There are very close connections between approximation of f and g. For example, Fourier cosine expansions of f are equivalent to Chebyshev polynomial expansions of g. For more on this subject, Trefethen’s Approximation Theory and Approximation Practice is an excellent reference.) If f is square-integrable, then f possesses a Fourier expansion

(1)   \begin{equation*} f(\theta) = \sum_{k=-\infty}^\infty} \hat{f}_k e^{{\rm i} k\theta}, \quad \hat{f}_k = \frac{1}{2\pi} \int_{-\pi}^\pi f(\theta) e^{-{\rm i}k\theta} \, d\theta. \end{equation*}

The infinite series converges in the L^2 sense, meaning that \lim_{m\to\infty} \|f - f_m\|_{L^2} = 0, where f_m is the truncated Fourier series f_M(\theta) = \sum_{k=-m}^m \hat{f}_k e^{{\rm i} k\theta} and \|\cdot\| is L^2 norm \|f\|_2 = \sqrt{\int_{-\pi}^\pi |f(\theta)|^2 \, d\theta.3One may show with considerable analysis that Fourier series converges in others senses, for example almost everywhere convergence. We also have the Plancherel theorem, which states that

(2)   \begin{equation*} \|f\|_{L^2}^2 = \int_{-\pi}^\pi |f(\theta)|^2\, d\theta = 2\pi \sum_{k=-\infty}^{\infty} |\hat{f}_k|^2. \end{equation*}

Note that the convergence of the Fourier series is fundamentally a statement about approximation. The fact that the Fourier series converges means that the truncated Fourier series f_m of 2m+1 terms acts as an arbitrarily close approximate of f (measured with the L^2 norm). However, the number M=2m+1 of terms we need to store might be quite large if a large value of M is needed for \|f - f_m\|_{L^2} to be small. However, as we shall soon see, if the function f is “smooth”, \|f - f_M\|_{L^2} will be small for even moderate values of M.

Smoothness

Often, in analysis, we refer to a function as smooth when it possesses derivatives of all orders. In one dimension, this means that the jth derivative f^{(j)} exists for every integer j \ge 0. In this post, we shall speak of smoothness as a more graded notion: loosely, a function g is smoother than a function f if g possesses more derivatives than f or the magnitude of g‘s derivatives are smaller. This conception of smoothness accords more with the plain-English definition of smoothness: the graph of a very mildly varying function g with a discontinuity in its 33rd derivative looks much smoother to the human eye than the graph of a highly oscillatory and jagged function f that nonetheless possesses derivatives of all orders.4One might refer to the precise concept of possessing derivatives of all orders as C^\infty smoothness.

For a function defined in terms of a Fourier series, it is natural to compute its derivative by formally differentiating the Fourier series term-by-term:

(3)   \begin{equation*} f'(\theta) = \sum_{k=-\infty}^\infty ({\rm i}k) \hat{f}_k e^{{\rm i} k\theta}. \end{equation*}

This formal Fourier series converges if, and only if, the L^2 norm of this putative derivative f', as computed with the Plancherel theorem, is finite: \|f'\|_{L^2}^2 = 2\pi \sum_{k=-\infty}^\infty |k|^2 |\hat{f}_k|^2 < \infty. This “derivative” f' may not be a derivative of f in the classical sense. For instance, using this definition, the absolute value function g(x) = |x| possesses a derivative g'(x) = +1 for x > 0 and g'(x) = -1 for x < 0. For the derivative of f to exist in the sense of Eq. (3), f need not be differentiable at every point, but it must define a square-integrable functions at the points where it is differentiable. We shall call the derivative f' as given by Eq. (3) to be a weak derivative of f.

If a function f has s square-integrable weak derivatives f^{(j)}(\theta) = \sum_{k=-\infty}^\infty ({\rm i}k)^j \hat{f}_k e^{{\rm i}k\theta} for 1\le j \le s, then we say that f belongs to the Sobolev space H^s. The Sobolev space H^s is equipped with the norm

(4)   \begin{equation*} \|f\|_{H^s} = \sqrt{\sum_{j=0}^s \|f^{(j)}\|_{L^2}^2}. \end{equation*}

The Sobolev norm \|\cdot\|_{H^s} is a quantitative measure of qualitative smoothness. The smaller the Sobolev norm H^s of f, the smaller the derivatives of f are. As we will see, we can use this to bound the approximation error.

Smoothness and Degree of Approximation

If f is approximated by f_m, the error of approximation is given by

(5)   \begin{equation*} f(\theta) - f_m(\theta) = \sum_{|k|>m} \hat{f}_k e^{{\rm i}k\theta}, \quad \|f - f_m\|_{L^2} = \sqrt{ \sum_{|k|>m} |\hat{f}_k|^2 }. \end{equation*}

Suppose that f \in H^s (that is, f has s square integrable derivatives). Then we may deduce the inequality

(6)   \begin{equation*} \begin{split} \|f - f_m\|_{L^2} &= \sqrt{ \sum_{|k|>m} |\hat{f}_k|^2 } \\ &= \sqrt{ \sum_{|k|>m} |k|^{-2s}|k|^{2s}|\hat{f}_k|^2 }\\ &\le m^{-s} \sqrt{ \sum_{|k|>m} |k|^{2s}|\hat{f}_k|^2 } \\ &\le m^{-s}\|f\|_{H^s}. \end{split} \end{equation*}

The first “\le” follows from the fact that |k|^{-2s} < m^{-2s} for |k|>m. This very important result is a precise instantiation of our rule of thumb from earlier: if f possesses s nice (i.e. square-integrable) derivatives, then the (L^2) approximation error for an M=2m+1-term Fourier approximation decays at least as fast as 1/M^s.

Higher Dimensions

The results for one dimension can easily be extended to consider functions f defined on d-dimensional space which are 2\pi-periodic in every argument.5e.g. f(\theta_1,\ldots, \theta_j+2\pi,\ldots,\theta_d) = f(\theta_1,\ldots,\theta_j,\ldots,\theta_d) for every 1\le j \le d For physics-based scientific simulation, we are often interested in d=2 or d=3, but for more modern problems in data science, we might be interested in very large dimensions d.

Letting \mathbb{Z}^d denote the set of all d-tuples of integers, one can show that one has the d-dimensional Fourier series

(7)   \begin{equation*} f(\theta) = \sum_{k \in \mathbb{Z}^d} f_{\hat k} e^{{\rm i}k\cdot \theta}, \quad \hat{f}_k = \frac{1}{(2\pi)^d} \int_{[-\pi,\pi]^d} f(\theta) e^{-{\rm i}k\cdot \theta} \, d\theta. \end{equation*}

Here, we denote k\cdot \theta to be the Euclidean inner product of the d-dimensional vectors k and \theta, k\cdot \theta = k_1\theta_1 + \cdots + k_d\theta_d. A natural generalization of the Plancherel theorem holds as well. Let \max |k| denote the maximum of |k_1|,\ldots,|k_d|. Then, we have the truncated Fourier series f_m = \sum_{\max |k| \le m} \hat{f}_k e^{{\rm i}k\cdot \theta}. Using the same calculations from the previous section, we deduce a very similar approximation property

(8)   \begin{equation*} \|f - f_m\|_{L^2} = \sqrt{ \sum_{\max|k|>m} |\hat{f}_k|^2 }  \le m^{-s}\|f\|_{H^s}. \end{equation*}

There’s a pretty big catch though. The approximate function f_m possesses M = (2m+1)^d terms! In order to include each of the first m Fourier modes in each of d dimensions, the number of terms M in our Fourier approximation must grow exponentially in d! In particular, the approximation error satisfies a bound

(9)   \begin{equation*} \|f - f_m\|_{L^2}  \le \left(\frac{M^{1/d}-1}{2}\right)^{-s}\|f\|_{H^s} \le C(s,d)M^{-s/d} \|f\|_{H^s}, \end{equation*}

where C(s,d) \ge 0 is a constant depending only on s and d.

This is the so-called curse of dimensionality: to approximate a function in d dimensions, we need exponentially many terms in the dimension d. In higher-dimensions, our rule of thumb needs to be modified: if a function f on a d-dimensional domain possesses s nice derivatives, then f can be approximated by a M-parameter approximation with error decaying at least as fast as 1/M^{s/d}.

The Theory of Nonlinear Widths: The Speed Limit of Approximation Theory

So far, we have shown that if one approximates a function f on a d-dimensional space by truncating its Fourier series to M terms, the approximation error decays at least as fast as 1/M^{s/d}. Can we do better than this, particularly in high-dimensions where the error decay can be very slow if s \ll d?

One must be careful about how one phrases this question. Suppose I ask “what is the best way of approximating a function f“? A subversive answer is that we may approximate f by a single-parameter approximation of the form \hat{f} = af with a=1! Consequently, there is a one-parameter approximation procedure that approximates every function perfectly. The problem with this one-parameter approximation is obvious: the one-parameter approximation is terrible at approximating most functions different than f. Thus, the question “what is the best way of approximating a particular function?” is ill-posed. We must instead ask the question “what is the best way of approximating an entire class of functions?” For us, the class of functions shall be those which are sufficiently smooth: specifically, we shall consider the class of functions whose Sobolev norm satisfies a bound \|f\|_{H^s} \le B. Call this class W.

As outlined at the beginning, an approximation procedure usually begins by taking the function f and writing down a list of M numbers a_1,\ldots,a_M. Then, from this list of numbers we reconstruct a function \hat{f} which serves as an approximation to f. Formally, this can be viewed as a mathematical function \Phi which takes f \in W to a tuple (a_1,\ldots,a_M) \in \mathbb{C}^d followed by a function \Lambda which takes (a_1,\ldots,a_M) and outputs a continuous 2\pi-periodic function \hat{f} \in C_{2\pi}(\mathbb{R}).

(10)   \begin{equation*} f \stackrel{\Phi}{\longmapsto}(a_1,\ldots,a_M)\stackrel{\Lambda}{\longmapsto} \hat{f}, \quad \Phi : W \to \mathbb{C}^M, \quad \Lambda : \mathbb{C}^M \to C_{2\pi}(\mathbb{R}). \end{equation*}

Remarkably, there is a mathematical theory which gives sharp bounds on the expressive power of any approximation procedure of this type. This theory of nonlinear widths serves as a sort of speed limit in approximation theory: no method of approximation can be any better than the theory of nonlinear widths says it can. The statement is somewhat technical, and we advise the reader to look up a precise statement of the result before using it any serious work. Roughly, the theory of nonlinear widths states that for any continuous approximation procedure6That is, an approximation procedure for which \Phi : W \to \mathbb{C}^d is a continuous function where W is equipped with the topology defined by the norm \|\cdot\|_{H^s}. that is able to approximate every function in W with L^2 approximation error no more than \epsilon, the number of parameters M must be at least some constant multiple of \epsilon^{-d/s}. Equivalently, the worst-case approximation error for a function in W with an M parameter continuous approximation is at least some multiple of 1/M^{s/d}.

In particular, the theory of nonlinear widths states that the approximation property of truncated Fourier series are as good as any method for approximating functions in the class W, as they exactly meet the “speed limit” given by the theory of nonlinear widths. Thus, approximating using truncated Fourier series is, in a certain very precise sense, as good as any other approximation technique you can think of in approximating arbitrary functions from W: splines, rational functions, wavelets, and artificial neural networks must follow the same speed limit. Make no mistake, these other methods have definite advantages, but degree of approximation for the class W is not one of them. Also, note that the theory of nonlinear widths shows that the curse of dimensionality is not merely an artifact of Fourier series; it affects all high-dimensional approximation techniques.

For the interested reader, see the following footnotes for two important ways one may perform approximations better than the theory of nonlinear widths within the scope of its rules.7The theory of nonlinear widths holds for continuous methods of approximation. This means that discontinuous approximation procedures may circumvent its bounds. Indeed, such discontinuous approximation procedures exist using probabilistic techniques. These methods are of questionable use in practice since discontinuous approximation procedures, by their nature, are extremely sensitive to the perturbations which are ubiquitous in performing computations on computers.8The theory of nonlinear widths holds for means of approximating the entire class W. More efficient methods may exist for meaningful subclasses of W. For instance, Mhaskar and Poggio show that for functions f satisfying a compositional property, that they can effectively be approximated by multilayer artificial neural networks.

Upshot: The smoother a function is, the better it can be approximated. Specifically, one can approximate a function on d dimensions with s nice derivatives with approximation error decaying with rate at least 1/M^{s/d}. In the case of 2\pi-periodic functions, such an approximation can easily be obtained by truncating the function’s Fourier series. This error decay rate is the best one can hope for to approximate all functions of this type.

One thought on “Big Ideas in Applied Math: Smoothness and Degree of Approximation

Leave a Reply

Your email address will not be published. Required fields are marked *