Big Ideas in Applied Math: Galerkin Approximation

My first experience with the numerical solution of partial differential equations (PDEs) was with finite difference methods. I found finite difference methods to be somewhat fiddly: it is quite an exercise in patience to, for example, work out the appropriate fifth-order finite difference approximation to a second order differential operator on an irregularly spaced grid and even more of a pain to prove that the scheme is convergent. I found that I liked the finite element method a lot better1Finite element methods certainly have their own fiddly-nesses (as anyone who has worked with a serious finite element code can no doubt attest to). as there was a unifying underlying functional analytic theory, Galerkin approximation, which showed how, in a sense, the finite element method computed the best possible approximate solution to the PDE among a family of potential solutions. However, I came to feel later that Galerkin approximation was, in a sense, the more fundamental concept, with the finite element method being one particular instantiation (with spectral methods, boundary element methods, and the conjugate gradient method being others). In this post, I hope to give a general introduction to Galerkin approximation as computing the best possible approximate solution to a problem within a certain finite-dimensional space of possibilities.

Systems of Linear Equations

Let us begin with a linear algebraic example, which is unburdened by some of the technicalities of partial differential equations. Suppose we want to solve a very large system of linear equations Ax = b, where the matrix A is symmetric and positive definite (SPD). Suppose that A is N\times N where N is so large that we don’t even want to store all N components of the solution x on our computer. What can we possibly do?

One solution is to consider only solutions x lying in a subspace \mathcal{X} of the set of all possible solutions \mathbb{R}^N. If this subspace has a basis x_1,x_2,\ldots,x_M \in \mathcal{X}, then the solution x \in \mathcal{X} can be represented as x = a_1x_1 + \cdots + a_Mx_M and one only has to store the M < N numbers a_1,\ldots,a_M. In general, x will not belong to the subspace \mathcal{X} and we must settle for an approximate solution \hat{x} \in \mathcal{X}.

The next step is to convert the system of linear equations Ax = b into a form which is more amenable to approximate solution on a subspace \mathcal{X}. Note that the equation Ax = b encodes n different linear equations a_i^\top x = b_i where a_i^\top is the ith row of A and b_i is the ith element of b. Note that the ith equation is equivalent to the condition e_i^\top A x = e_i^\top b, where e_i is the vector with zeros in all entries except for the ith entry which is a one. More generally, by multiplying the equation Ax = b by an arbitrary test row vector y^\top, we get y^\top Ax = y^\top b for all y \in \mathbb{R}^N. We refer to this as a variational formulation of the linear system of equations Ax = b. In fact, one can easily show that the variational problem is equivalent to the system of linear equations:

(1)   \begin{equation*} Ax = b \:\mbox{ if, and only if, }\: y^\top Ax = y^\top b \mbox{ for every } y \in \mathbb{R}^N. \end{equation*}

Since we are seeking an approximate solution from the subspace \mathcal{X}, it is only natural that we also restrict our test vectors y to lie in the subspace \mathcal{X}. Thus, we seek an approximate solution \hat{x} to the system of equations Ax = b as the solution of the variational problem

(2)   \begin{equation*} y^\top A\hat{x} = y^\top b \mbox{ for every } y \in\mathcal{X}. \end{equation*}

One can relatively easily show this problem possesses a unique solution \hat{x}.2Here is a linear algebraic proof. As we shall see below, the same conclusion will also follow from the general Lax-Milgram theorem. Let P be a matrix whose columns form a basis for \mathcal{X}. Then every y\in \mathcal{X} can be written as y = Pu for some u \in \mathbb{R}^M. Thus, writing \hat{x} = Pw, we have that y^\top A\hat{x} = u^\top P^\top A P w = u^\top P^\top b for every u\in \mathbb{R}^M. But this is just a variational formulation of the equation (P^\top A P)w = P^\top b. The matrix P^\top A P is SPD since v^\top(P^\top A P)v = (Pv)^\top A (Pv) > 0 for v\ne 0 since A is SPD. Thus (P^\top A P)w = P^\top b has a unique solution w = (P^\top A P)^{-1}P^\top b. Thus \hat{x} = Pw = P(P^\top AP)^{-1}P^\top b is the unique solution to the variational problem Eq. (2). In what sense is \hat{x} a good approximate solution for Ax = b? To answer this question, we need to introduce a special way of measuring the error to an approximate solution to Ax = b. We define the A-inner product of a vector x and y to be \langle x, y \rangle_A := y^\top Ax and the associated A-norm \|x\|_A = \sqrt{\langle x, x\rangle_A} = \sqrt{x^\top A x}.3Note that A-norm can be seen as a weighted Euclidean norm, where the components of the vector x in the direction of the eigenvectors of A are scaled by their corresponding eigenvector. Concretely, if x = a_1 q_1 + \cdots + a_N q_N where q_j is an eigenvector of A with eigenvalue \lambda_j (Aq_j = \lambda_j q_j), then we have \|x\|_A^2 = \lambda_1 a_1^2 + \cdots + \lambda_N a_N^2. All of the properties satisfied by the familiar Euclidean inner product and norm carry over to the new A-inner product and norm (e.g., the Pythagorean theorem). Indeed, for those familiar, one can show \langle \cdot, \cdot \rangle_A satisfies all the axioms for an inner product space.

We shall now show that the error x - \hat{x} between x and its Galerkin approximation \hat{x} is A-orthogonal to the space \mathcal{X} in the sense that \langle y, x - \hat{x}\rangle_A = 0 for all y \in \mathcal{X}. This follows from the straightforward calculation, for y \in \mathcal{X},

(3)   \begin{equation*} \langle y, x - \hat{x} \rangle_A = y^\top A (x - \hat{x}) = y^\top A x - y^\top A \hat{x} = y^\top b - y^\top b = 0, \end{equation*}

where y^\top A x = y^\top b since x solves the variational problem Eq. (1) and y^\top A \hat{x} = y^\top b since x solves the variational problem Eq. (2).

The fact that the error x - \hat{x} is A-orthogonal to \mathcal{X} can be used to show that \hat{x} is, in a sense, the best approximate solution to Ax = b in the subspace \mathcal{X}. First note that, for any approximate solution z \in \mathcal{X} to Ax = b, the vector \hat{x} - z \in \mathcal{X} is A-orthogonal to x - \hat{x}. Thus, by the Pythagorean theorem,

(4)   \begin{equation*} \|x - z\|^2_A = \|(x - \hat{x}) + (\hat{x}-z)\|_A^2 = \|x-\hat{x}\|_A^2 + \|\hat{x} - z\|_A^2 \ge \|x - \hat{x} \|_A^2. \end{equation*}

Thus, the Galerkin approximation \hat{x} is the best approximate solution to Ax = b in the subspace \mathcal{X} with respect to the A-norm, \|x - z\|_A \ge \|x - \hat{x} \|_A for every z \in \mathcal{X}. Thus, if one picks a subspace \mathcal{X} for which the solution x almost lies in \mathcal{X},4In the sense that \inf_{z \in \mathcal{X}} \|x - z\| is small then \hat{x} will be a good approximate solution to Ax = b, irrespective of the size of the subspace \mathcal{X}.

Variational Formulations of Differential Equations

As I hope I’ve conveyed in the previous section, Galerkin approximation is not a technique that only works for finite element methods or even just PDEs. However, differential and integral equations are one of the most important applications of Galerkin approximation since the space of all possible solution to a differential or integral equation is infinite-dimensional: approximation in a finite-dimensional space is absolutely critical. In this section, I want to give a brief introduction to how one can develop variational formulations of differential equations amenable to Galerkin approximation. For simplicity of presentation, I shall focus on a one-dimensional problem which is described by an ordinary differential equation (ODE) boundary value problem. All of this generalized wholesale to partial differential equations in multiple dimensions, though there are some additional technical and notational difficulties (some of which I will address in footnotes). Variational formulation of differential equations is a topic with important technical subtleties which I will end up brushing past. Rigorous references are Chapters 5 and 6 from Evans’ Partial Differential Equations or Chapters 0-2 from Brenner and Scott’s The Mathematical Theory of Finite Element Methods.

As our model problem for which we seek a variational formulation, we will focus on the one-dimensional Poisson equation, which appears in the study of electrostatics, gravitation, diffusion, heat flow, and fluid mechanics. The unknown u is a real-valued function on an interval which take to be [0,1].5In higher dimensions, one can consider an arbitrary domain \Omega \subseteq \mathbb{R}^d with, for example, a Lipschitz boundary. We assume Dirichlet boundary conditions that u is equal to zero on the boundary u(0) = u(1) = 0.6In higher dimensions, one has u = 0 on the boundary \partial \Omega of the region \Omega. Poisson’s equations then reads7-\Delta u = f on \Omega and u = 0 for higher dimensions, where \Delta is the Laplacian operator.

(5)   \begin{equation*} -u''(x) = f(x) \mbox{ for every } x \in (0,1), \quad u(0) = u(1) = 0. \end{equation*}

We wish to develop a variational formulation of this differential equation, similar to how we develop a variational formulation of the linear system of equations in the previous section. To develop our variational formulation, we take inspiration from physics. If u(x) represents, say, the temperature at a point x, we are never able to measure u(x) exactly. Rather, we can measure the temperature in a region around x with a thermometer. No matter how carefully we engineer our thermometer, our thermometer tip will have some volume occupying a region R in space. The temperature u_{\rm meas} measured by our thermometer will be the average temperature in the region R or, more generally, a weighted average u_{\rm meas} = \int_0^1 u(x) v(x) \, dx where v(\cdot) is a weighting function which is zero outside the region R. Now let’s use our thermometer to “measure” our differential equation:

(6)   \begin{equation*} \int_0^1-v(x) u''(x) \, dx = \int_0^1 v(x)f(x) \, dx. \end{equation*}

This integral expression is some kind of variational formulation of our differential equation, as it is an equation involving the solution to our differential equation u which must hold for every averaging function v. (The precise meaning of every will be forthcoming.) It will benefit us greatly to make this expression more “symmetric” with respect to u and v. To do this, we shall integrate by parts:8Integrating by parts is harder in higher dimensions. My personal advice for integrating by parts in higher dimensions is to remember that integration by parts is ultimately a result of the product rule. As such, to integrate by parts, we first write an expression involving our integrand using the product rule of some differential operator and then integrate by both sides. In this case, notice that \nabla \cdot (v \nabla u) = v\Delta u + \nabla v \cdot \nabla u. Rearranging and integrating, we see that \int_\Omega -v\Delta u \, dx = \int_\Omega \nabla v \cdot \nabla u \, dx - \int_\Omega \nabla \cdot (v\nabla u) \, dx. We then apply the divergence theorem to the last term to get \int_\Omega \nabla \cdot (v\nabla u) \, dx = \int_{\partial \Omega} v \nabla u \cdot n dS(x), where n represents an outward facing unit normal to the boundary \partial \Omega and dS(x) represents integration on the surface \partial \Omega. If v is zero on \partial \Omega, we conclude \int_\Omega \nabla v \cdot \nabla u \, dx  = \int_\Omega v f \, dx for all nice functions v on \Omega satisfying v = 0 on \partial \Omega.

(7)   \begin{equation*} \int_0^1-v(x)u''(x) \, dx = \int_0^1 v'(x)u'(x) \, dx - v(0) u'(0) - v(1)u'(1). \end{equation*}

In particular, if v is zero on the boundary v(0) = v(1) = 0, then the second two terms vanish and we’re left with the variational equation

(8)   \begin{equation*} \int_0^1 v'(x)u'(x) \, dx = \int_0^1 v(x) f(x) \, dx \mbox{ for all \textit{nice} functions $v$ on $[0,1]$ with } v(0) = v(1) = 0. \end{equation*}

Compare the variational formulation of the Poisson equation Eq. (8) to the variational formulation of the system of linear equations Ax = b in Eq. (1). The solution vector x in the differential equation context is a function u satisfying the boundary condition of u being zero on the boundary u(0) = u(1) = 0. The right-hand side b is replaced by a function f on the interval [0,1]. The test vector y is replaced by a test function v on the interval [0,1]. The matrix product expression y^\top A x is replaced by the integral \int_0^1 v'(x)u'(x) \, dx. The product y^\top b is replaced by the integral \int_0^1 v(x) f(x) \, dx. As we shall soon see, there is a unifying theory which treats both of these contexts simultaneously.

Before this unifying theory, we must address the question of which functions v we will consider in our variational formulation. One can show that all of the calculations we did in this section hold if v is a continuously differentiable function on [0,1] which is zero away from the endpoints 0 and 1 and u is a twice continuously differentiable function on [0,1]. Because of technical functional analytic considerations, we shall actually want to expand the class of functions in our variational formulation to even more functions v. Specifically, we shall consider all functions v which are (A) square-integrable (\int_0^1|v(x)|^2 \,dx is finite), (B) possess a square integrable derivative9More specifically, we only insist that v possess a square-integrable weak derivative. v' (\int_0^1|v'(x)|^2 \,dx is finite), and (C) are zero on the boundary. We refer to this class of functions as the Sobolev space H_0^1((0,1)).10The class of functions satisfying (A) and (B) but not necessarily (C) is the Sobolev space H^1((0,1)). For an arbitrary function in H^1((0,1)), the existence of a well-defined restriction to the boundary 0 and 1 is actually nontrivial to show, requiring showing the existence of a trace operator. Chapter 5 of Evan’s Partial Differential Equations is a good introduction to Sobolev spaces. The Sobolev spaces H_0^1((0,1)) and H^1((0,1)) naturally extend to spaces H_0^1(\Omega) and H^1(\Omega) for an arbitrary domain \Omega \subseteq \mathbb{R}^d with a nice boundary.

Now this is where things get really strange. Note that it is possible for a function u to satisfy the variational formulation Eq. (8) but for u not to satisfy the Poisson equation Eq. (5). A simple example is when f possesses a discontinuity (say, for example, a step discontinuity where f is 0 and then jumps to 1). Then no continuously differentiable u will satisfy Eq. (5) at every point in \Omega and yet a solution u to the variational problem Eq. (8) exists! The variational formulation actually allows us to give a reasonable definition of “solving the differential equation” when a classical solution to -u'' = f does not exist. Our only requirement for the variational problem is that u, itself, belongs to the space H_0^1((0,1)). A solution to the variational problem Eq. (8) is called a weak solution to the differential equation Eq. (5) because, as we have argued, a weak solution to Eq. (8) need not always solve Eq. (5).11One can show that any classical solution to Eq. (5) solves Eq. (8). Given certain conditions on f, one can go the other way, showing that weak solutions are indeed bonafide classical solutions. This is the subject of regularity theory.

The Lax-Milgram Theorem

Let us now build up an abstract language which allows us to use Galerkin approximation both for linear systems of equations and PDEs (as well as other contexts). If one compares the expressions y^\top A x from the linear systems context and \int_0^1 v'(x)u'(x) \, dx from the differential equation context, one recognizes that both these expressions are so-called bilinear forms: they depend on two arguments (x and y or u and v) and are a linear transformation in each argument independently if the other one is fixed. For example, if one defines a(x,y) = y^\top A x one has a(x,\alpha_1 y_1 + \alpha_2 y_2) = \alpha_1 a(x,y_1) + \alpha_2 a(x,y_2). Similarly, if one defines a(u,v) = \int_0^1 v'(x)u'(x) \, dx, then a(\alpha_1 u_1 + \alpha_2 u_2,v) = \alpha_1 a(u_1,v) + \alpha_2 a(u_2, v).

Implicitly swimming in the background is some space of vectors or function which this bilinear form a(\cdot,\cdot) is defined upon. In the linear system of equations context, this space \mathbb{R}^N of N-dimensional vectors and in the differential context, this space is H_0^1((0,1)) as defined in the previous section.12The connection between vectors and functions is even more natural if one considers a function u : [0,1] \to \mathbb{R} as a vector of infinite length, with one entry for each real number x \in [0,1]. Call this space \mathcal{V}. We shall assume that \mathcal{V} is a special type of linear space called a Hilbert space, an inner product space (with inner product \langle \cdot, \cdot \rangle_\mathcal{V}) where every Cauchy sequence converges to an element in \mathcal{V} (in the inner product-induced norm).13Note that every inner product space has a unique completion to a Hilbert space. For example, if one considers the space C_{0}^\infty((0,1)) of C^\infty smooth functions which are zero away from the boundary of (0,1) with the inner product \langle u, v \rangle_{H^1_0((0,1))} = \int_0^1 u(x)v(x) + u'(x)v'(x) \, dx, the completion is H_0^1((0,1)). A natural extension to higher dimensions hold. The Cauchy sequence convergence property, also known as metric completeness, is important because we shall often deal with a sequence of entries u_1,u_2,\ldots \in \mathcal{V} which we will need to establish convergence to a vector u \in \mathcal{V}. (Think of u_1,u_2,\ldots as a sequence of Galerkin approximations to a solution u.)

With these formalities, an abstract variational problem takes the form

(9)   \begin{equation*} \mbox{Find $u \in \mathcal{V}$ such that }a(u,v) = \ell(v) \mbox{ for all } v \in \mathcal{V}, \end{equation*}

where a(\cdot,\cdot) is a bilinear form on \mathcal{V} and \ell(\cdot) is a linear form on \mathcal{V} (a linear map \ell: \mathcal{V} \to \mathbb{R}). There is a beautiful and general theorem called the Lax-Milgram theorem which establishes existence and uniqueness of solutions to a problem like Eq. (9).

Theorem (Lax-Milgram): Let a(\cdot,\cdot) and f satisfy the following properties:

  1. (Boundedness of a) There exists a constant C \ge 0 such that every u,v \in \mathcal{V}, |a(u,v)| \le C \|u\|_{\mathcal{V}}\|v\|_{\mathcal{V}}.
  2. (Coercivity) There exists a positive constant c > 0 such that a(u,u) \ge c \|u\|_{\mathcal{V}}^2 for every u \in \mathcal{V}.
  3. (Boundedness of \ell) There exists a constant K such that |\ell(v)| \le K \|v\|_{\mathcal{V}} for every v\in \mathcal{V}.

Then the variational problem Eq. (9) possesses a unique solution.

For our cases, a will also be symmetric a(u,v) = a(v,u) for all u,v \in \mathcal{V}. While the Lax-Milgram theorem holds without symmetry, let us continue our discussion with this additional symmetry assumption. Note that, taken together, properties (1-2) say that the a-inner product, defined as \langle u, v \rangle_a = a(u,v), is no more than so much bigger or smaller than the standard inner product \langle u, v\rangle_{\mathcal{V}} of u and v.14That is, one has that the a-norm and the \mathcal{V}-norm are equivalent in the sense that \sqrt{c}\|v\|_{\mathcal{V}} \le \|v\|_a \le \sqrt{C} \|v\|_{\mathcal{V}}. so the norms \|\cdot\|_\mathcal{V} and \|\cdot\|_a define the same topology.

Let us now see how the Lax-Milgram theorem can apply to our two examples. For a reader who wants a more “big picture” perspective, they can comfortably skip to the next section. For those who want to see Lax-Milgram in action, see the discussion below.

Applying the Lax-Milgram Theorem
Begin with the linear system of equations with \mathcal{V} = \mathbb{R}^N with inner product \langle x, y \rangle_{\mathbb{R}^N} = y^\top x, a(x,y) = y^\top Ax, and \ell(y) = y^\top b. Note that we have the inequality \lambda_{\rm min} x^\top x \le x^\top A x \le \lambda_{\rm max} x^\top x.15This is an easy consequence of the Courant-Fischer theorem. More generally, note that, since A is symmetric, A has an orthonormal basis of eigenvectors q_1,\ldots,q_N with eigenvalues \lambda_{\rm max} = \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_N = \lambda_{\rm min}. Then x = x_1q_1 + \cdots + x_Nq_N and x^\top Ax = \lambda_{\rm max} x_1^2 + \cdots + \lambda_{\rm min} x_N^2. The inequalities follow from noting the Parseval relation \|x\|^2 = x^\top x = x_1^2 + \cdots + x_N^2 and noting that x^\top Ax / x^\top x is a convex combination of the eigenvalues of A. In particular, we have that \|x\|_A = \sqrt{x^\top A x} \le \sqrt{\lambda_{\rm max}} \|x\|. Property (1) then follows from the Cauchy-Schwarz inequality applied to the A-inner product: |a(x,y)| = |\langle x, y\rangle_A| \le \|x\|_A \|y\|_A \le \lambda_{\rm max} \|x\|\|y\|. Property (2) is simply the established inequality a(x,x) = x^\top Ax \ge \lambda_{\min} x^\top x. Property (3) also follows from the Cauchy-Schwarz inequality: |\ell(y)| = |y^\top b| \le \|b\|\|y\|. Thus, by Lax-Milgram, the variational problem y^\top Ax = y^\top b for y \in \mathbb{R}^N has a unique solution x. Note that the linear systems example shows why the coercivity property (2) is necessary. If A is positive semi-definite but not positive-definite, then there exists an eigenvector v of A with eigenvalue 0. Then v^\top Av = 0 \not\ge c v^\top v for any positive constant c and A is singular, so the variational formulation of Ax = b has no solution for some choices of the vector b.

Applying the Lax-Milgram theorem to differential equations can require powerful inequalities. In this case, the \mathcal{V} = H_0^1((0,1))-inner product is given by \langle u, v \rangle_{H_0^1((0,1))} = \int_0^1 v(x)u(x) + v'(x)u'(x) \, dx, a(u,v) = \int_0^1 v'(x)u'(x) \, dx, and \ell(v) = \int_0^1 v(x)f(x) \, dx. Condition (1) is follows from a application of the Cauchy-Schwarz inequality for integrals:16 In higher dimensions, we need even more Cauchy-Schwarz! First, we note that the absolute value of integral is less than the integral of absolute value |a(u,v)| \le \int_\Omega |\nabla v \cdot \nabla u| \, dx. Second, we apply the Cauchy-Schwarz inequality for the vectors \nabla u and \nabla v to get |\nabla v \cdot \nabla u| \le |\nabla v| |\nabla u| where, e.g., \|\nabla v\| is the Euclidean norm of the vector \nabla v. This gives |a(u,v)| \le \int_\Omega \|\nabla v\| \|\nabla u\| \, dx. Next, we apply the Cauchy-Schwarz inequality for integrals to get |a(u,v)| \le \le \left(\int_\Omega \|\nabla v\|^2 \, dx\right)^{1/2} \left(\int_\Omega \|\nabla u\|^2 \, dx\right)^{1/2}. Finally, we note that \int_\Omega \|\nabla v\|^2 \, dx \le \int_\Omega v^2 + |\nabla v|^2 \, dx = \langle v, v\rangle_{H_0^1(\Omega)} = \|v\|_{H_0^1(\Omega)}^2 and thus obtain |a(u,v)| \le \|v\|_{H_0^1(\Omega)}\|u\|_{H_0^1(\Omega)}. This is the desired inequality with constant one.

(10)   \begin{equation*} \begin{split} |a(u,v)| &\le \int_0^1 |v'(x)||u'(x)| \, dx \\ &\le \left(\int_0^1 |v'(x)|^2 \, dx\right)^{1/2} \left(\int_0^1 |u'(x)|^2 \, dx\right)^{1/2} \\ &\le \|v\|_{H_0^1((0,1))}\|u\|_{H_0^1((0,1))}. \end{split} \end{equation*}

Let’s go line-by-line. First, we note that the absolute value of integral is less than the integral of absolute value. Next, we apply the Cauchy-Schwarz inequality for integrals. Finally, we note that \int_0^1|v'(x)|^2 \, dx \le \int_0^1|v(x)|^2 + |v'(x)|^2 \, dx = \langle v, v\rangle_{H_0^1((0,1))} = \|v\|_{H_0^1((0,1))}^2. This establishes Property (1) with constant C = 1. As we already see one third of the way into verifying the hypotheses of Lax-Milgram, establishing these inequalities can require several steps. Ultimately, however, strong knowledge of just a core few inequalities (e.g. Cauchy-Schwarz) may be all that’s needed.

Proving coercivity (Property (2)) actually requires a very special inequality, Poincaré’s inequality.17In higher dimensions, Poincaré’s inequality takes the form \int_\Omega \|\nabla u\|^2 \, dx \ge k\int_\Omega u^2 \, dx for a constant k depending only on the domain \Omega. In it’s simplest incarnation, the inequality states that there exists a constant k such that, for all functions u \in H_0^1((0,1)),18A simple proof of Poincaré’s inequality for continuously differentiable u with u(0) = u(1) = 0 goes as follows. Note that, by the fundamental theorem of calculus, u(x) = u(0) + \int_0^x u'(y) \, dy = \int_0^x u'(y) \, dy. Applying the Cauchy-Schwarz inequality for integrals gives |u(x)|\le \left(\int_0^x 1^2 \, dy\right)^{1/2}\left(\int_0^x |u'(y)|^2 \, dy\right)^{1/2} \le \sqrt{x} \left(\int_0^x |u'(y)|^2 \, dy\right)^{1/2} \le \left(\int_0^1 |u'(y)|^2 \, dy\right)^{1/2} since 0\lex\le 1. Thus |u(x)|^2 \le \int_0^1 |u'(y)|^2 \, dy for all x \in [0,1] integrating over x gives \int_0^1\int_0^1|u'(y)|^2 \, dy\, dx = \int_0^1|u'(x)|^2 \, dx \ge \int_0^1|u(x)|^2 \, dx. This proves Poincaré’s inequality with the constant k = 1.

(11)   \begin{equation*} \int_0^1|u'(x)|^2 \, dx \ge k\int_0^1|u(x)|^2 \, dx. \end{equation*}

With this inequality in tow, property (2) follows after another lengthy string of inequalities:19The same estimate holds in higher dimensions, with the appropriate generalization of Poincaré’s inequality.

(12)   \begin{equation*} \begin{split} a(u,u) &= \int_0^1 |u'(x)|^2 \, dx \\ &= \frac{1}{2} \int_0^1 |u'(x)|^2 \, dx + \frac{1}{2} \int_0^1 |u'(x)|^2 \, dx \\ &\ge \frac{1}{2} \int_0^1 |u'(x)|^2 \, dx + \frac{k}{2} \int_0^1 |u(x)|^2 \, dx \\ &\ge \min \left(\frac{1}{2},\frac{k}{2}\right) \left( \int_0^1 |u'(x)|^2 \, dx + \int_0^1|u(x)|^2 \, dx\right) \\ &= \min \left(\frac{1}{2},\frac{k}{2}\right) \|u\|_{H_0^1((0,1))}^2. \end{split} \end{equation*}

For Property (3) to hold, the function f must be square-integrable. With this hypothesis, Property (3) is much easier than Properties (1-2) and we leave it as an exercise for the interested reader (or to a footnote20The proof is similar in one dimension or higher dimensions, so we state it for arbitrary domain \Omega for brevity. By Cauchy-Schwarz, we have that \ell(v) = \int_\Omega vf \, dx \le \left( \int_{\Omega} f^2 \, dx \right){1/2}\left( \int_{\Omega} v^2 \, dx \right){1/2} \le \left( \int_{\Omega} f^2 \, dx \right) \|v\|_{H_0^1(\Omega)}. for the uninterested reader).

This may seem like a lot of work, but the result we have achieved is stunning. We have proven (modulo a lot of omitted details) that the Poisson equation -u'' = f has a unique weak solution as long as f is square-integrable!21And in the footnotes, we have upgraded this proof to existence of a unique weak solution to the Poisson equation -\Delta u = f on a domain \Omega. What is remarkable about this proof is that it uses the Lax-Milgram theorem and some inequalities alone: no specialized knowledge about the physics underlying the Poisson equation were necessary. Going through the details of Lax-Milgram has been a somewhat lengthy affair for an introductory post, but hopefully this discussion has illuminated the power of functional analytic tools (like Lax-Milgram) in studying differential equations. Now, with a healthy dose of theory in hand, let us return to Galerkin approximation.

General Galerkin Approximation

With our general theory set up, Galerkin approximation for general variational problem is the same as it was for a system of linear equations. First, we pick an approximation space \mathcal{X} which is a subspace of \mathcal{V}. We then have the Galerkin variational problem

(13)   \begin{equation*} \mbox{Find $\hat{u} \in \mathcal{X}$ such that } a(\hat{u},v) = \ell(v) \mbox{ for every } v \in \mathcal{X}. \end{equation*}

Provided a and \ell satisfy the conditions of the Lax-Milgram theorem, there is a unique solution \hat{u} to the problem Eq. (13). Moreover, the special property of Galerkin approximation holds: the error u-\hat{u} is a-orthogonal to the subspace \mathcal{X}. Consequently, \hat{u} is te best approximate solution to the variational problem Eq. (9) in the a-norm. To see the a-orthogonality, we have that, for any v \in \mathcal{X},

(14)   \begin{equation*} \langle u-\hat{u}, v\rangle_a = a(u-\hat{u},v) = a(u,v) - a(\hat{u},v) = \ell(v) - \ell(v) = 0, \end{equation*}

where we use the variational equation Eq. (9) for a(u,v) = \ell(v) and Eq. (13) for a(\hat{u},v) = \ell(v). Note the similarities with Eq. (3). Thus, using the Pythagorean theorem for the a-norm, for any other approximation solution w \in \mathcal{X}, we have22Compare Eq. (4).

(15)   \begin{equation*} \|u - w\|^2_a = \|(u - \hat{u}) + (\hat{u}-w)\|_a^2 = \|u-\hat{u}\|_a^2 + \|\hat{u} - w\|_a^2 \ge \|u - \hat{u} \|_a^2. \end{equation*}

Put simply, \hat{u} is the best approximation to u in the a-norm.23Using the fact the norms \|\cdot\|_{\mathcal{V}} and \|\cdot\|_a are equivalent in virtue of Properties (1-2), one can also show that \hat{u} is within a constant factor C/c of the best approximation in the norm \|\cdot\|_{\mathcal{V}. This is known as Céa’s Lemma.

Galerkin approximation is powerful because it allows us to approximate an infinite-dimensional problem by a finite-dimensional one. If we let \phi_1,\ldots,\phi_M be a basis for the space \mathcal{X}, then the approximate solution \hat{u} can be represented as \hat{u} = x_1 \phi_1 + \cdots + x_M \phi_M. Since \phi_1,\ldots,\phi_N form a basis of \mathcal{X}, to check that the Galerkin variational problem Eq. (13) holds for all v \in \mathcal{X} it is sufficient to check that it holds for v = \phi_1, v=\phi_2,\ldots,v=\phi_M.24For an arbitrary v\in \mathcal{X} can be written as v = y_1 \phi_1 + \cdots y_M\phi_M, so a(\hat{u},v) = a\left(\hat{u}, \sum_{j=1}^M y_j \phi_j \right) = \sum_{j=1}^M y_j a(\hat{u},\phi_j) = \sum_{j=1}^M y_j \ell(\phi_j) = \ell\left(\sum_{j=1}^M y_j \phi_j\right) = \ell(v). Thus, plugging in \hat{u} = \sum_{j=1}^M x_j \phi_j and v = \phi_i into Eq. (13), we get (using bilinearity of a)

(16)   \begin{equation*} a(\hat{u},v) = a\left(\sum_{j=1}^M x_j \phi_j, \phi_i\right) = \sum_{j=1}^M a(\phi_j,\phi_i) x_j = \ell(v) = \ell(\phi_i), \quad i =1,2,\ldots,M. \end{equation*}

If we define a_{ij} = a(\phi_j,\phi_i) and b_i = \ell(\phi_i), then this gives us a matrix equation Ax = b for the unknowns x_1,\ldots,x_M parametrizing \hat{u}. Thus, we can compute our Galerkin approximation by solving a linear system of equations.

We’ve covered a lot of ground so let’s summarize. Galerkin approximation is a technique which allows us to approximately solve a large- or infinite-dimensional problem by searching for an approximate solution in a smaller finite-dimensional space \mathcal{X} of our choosing. This Galerkin approximation is the best approximate solution to our original problem in the a-norm. By choosing a basis \phi_1,\ldots,\phi_M for our approximation space \mathcal{X}, we reduce the problem of computing a Galerkin approximation to a linear system of equations.

Design of a Galerkin approximation scheme for a variational problem thus boils down to choosing the approximation space \mathcal{X} and a basis \phi_1,\ldots,\phi_M. Picking \mathcal{X} to be a space of piecewise polynomial functions (splines) gives the finite element method. Picking \mathcal{X} to be a space spanned by a collection of trigonometric functions gives a Fourier spectral method. One can use a space spanned by wavelets as well. The Galerkin framework is extremely general: give it a subspace \mathcal{X} and it will give you a linear system of equations to solve to give you the best approximate solution in \mathcal{X}.

Two design considerations factor into the choice of space \mathcal{X} and basis \phi_1,\ldots,\phi_M. First, one wants to pick a space \mathcal{X}, where the solution u almost lies in. This is the rationale behind spectral methods. Smooth functions are very well-approximated by short truncated Fourier expansions, so, if the solution u is smooth, spectral methods will converge very quickly. Finite element methods, which often use low-order piecewise polynomial functions, converge much more slowly to a smooth u. The second design consideration one wants to consider is the ease of solving the system Ax = b resulting from the Galerkin approximation. If the basis function \phi_1,\ldots,\phi_M are local in the sense that most pairs of basis functions \phi_i and \phi_j aren’t nonzero at the same point x (more formally, \phi_i and \phi_j have disjoint supports for most i and j), the system Ax = b will be sparse and thus usually much easier to solve. Traditional spectral methods usually result in a harder-to-solve dense linear systems of equations.25There are clever ways of making spectral methods which lead to sparse matrices. Conversely, if one uses high-order piecewise polynomials in a finite element approximation, one can get convergence properties similar to a spectral method. These are called spectral element methods. It should be noted that both spectral and finite element methods lead to ill-conditioned matrices A, making integral equation-based approaches preferable if one needs high-accuracy.26For example, only one researcher using a finite-element method was able to meet Trefethen’s challenge to solve the Poisson equation to eight digits of accuracy on an L-shaped domain (see Section 6 of this paper). Getting that solution required using a finite-element method of order 15! Integral equations, themselves, are often solved using Galerkin approximation, leading to so-called boundary element methods.

Upshot: Galerkin approximation is a powerful and extremely flexible methodology for approximately solving large- or infinite-dimensional problems by finding the best approximate solution in a smaller finite-dimensional subspace. To use a Galerkin approximation, one must convert their problem to a variational formulation and pick a basis for the approximation space. After doing this, computing the Galerkin approximation reduces down to solving a system of linear equations with dimension equal to the dimension of the approximation space.

6 thoughts on “Big Ideas in Applied Math: Galerkin Approximation

  1. The “big idea” described here, which is indeed big, really goes back to Rayleigh. Ritz extended it to what we would recognize today as the (Bubnov) Galerkin method. Finite element interpolation on triangles dates to Courant (a student of Hilbert’s) in the 1920s, or perhaps teens.

  2. The Golomb-Weinberger principle gives another (seemingly less well-known) perspective on “best approximation”. A related approach is the Nystrom method. Comparing these might be instructive.

Leave a Reply to Ethan Epperly Cancel reply

Your email address will not be published. Required fields are marked *