The Elegant Geometry of Generalized Eigenvalue Perturbation Theory

In this post, I want to discuss a beautiful and simple geometric picture of the perturbation theory of definite generalized eigenvalue problems. As a culmination, we’ll see a taste of the beautiful perturbation theory of Mathias and Li, which appears to be not widely known in virtue of only being explained in a technical report. Perturbation theory for the generalized eigenvalue problem is a bit of a niche subject, but I hope you will stick around for some elegant arguments. In addition to explaining the mathematics, I hope this post serves as an allegory for the importance of having the right way of thinking about a problem; often, the solution to a seemingly unsolvable problem becomes almost self-evident when one has the right perspective.

What is a Generalized Eigenvalue Problem?

This post is about the definite generalized eigenvalue problem, so it’s probably worth spending a few words talking about what generalized eigenvalue problems are and why you would want to solve them. Slightly simplifying some technicalities, a generalized eigenvalue problem consists of finding nonzero vectors x and a (possibly complex) numbers \lambda such that Ax = \lambda \, Bx.1In an unfortunate choice of naming, there is actually a completely different sense in which it makes sense to talk about generalized eigenvectors, in the context of the Jordan normal form for standard eigenvalue problems. The vector x is called an eigenvector and \lambda its eigenvalue. For our purposes, A and B will be real symmetric (or even complex Hermitian) matrices; one can also consider generalized eigenvalue problemss for nonsymmetric and even non-square matrices A and B, but the symmetric case covers many applications of practical interest. The generalized eigenvalue problem is so-named because it generalizes the standard eigenvalue problem Ax = \lambda x, which is a special case of the generalized eigenvalue problem with B = I.2One can also further generalize the generalized eigenvalue problem to polynomial and nonlinear eigenvalue problems.

Why might we want so solve a generalized eigenvalue problem? The applications are numerous (e.g., in chemistry, quantum computation, systems and control theory, etc.). My interest in perturbation theory for generalized eigenvalue problems arose in the analysis of a quantum algorithm for eigenvalue problems in chemistry, and the theory discussed in this article played a big role in that analysis. To give an application which is more easy to communicate than the quantum computation application which motivated my own interest, let’s discuss an application in classical mechanics.

The Lagrangian formalism is a way of reformulating Newton’s laws of motion in a general coordinate system.3The benefits of the Lagrangian framework are far deeper than working in generalized coordinate systems, but this is beyond the scope of our discussion and mostly beyond the scope of what I am knowledgeable enough to write meaningfully about. If q denotes a vector of generalized coordinates describing our system and \dot{q} denotes q‘s time derivative, then the time evolution of a system with Lagrangian functional L(q,\dot{q}) are given by the Euler–Lagrange equations \tfrac{d}{dt} \nabla_{\dot{q}} L = \nabla_q L. If we choose q to represent the deviation of our system from equilibrium,4That is, in our generalized coordinate system, q = 0 is a (static) equilibrium configuration for which \ddot{q} = 0 whenever q = 0 and \dot{q} = 0. then our Lagrangian is well-approximated by it’s second order Taylor series:

    \begin{equation*} L(q,\dot{q}) \approx L_0 + \frac{1}{2} q^\top A q - \frac{1}{2} \dot{q}^\top B\dot{q}. \end{equation*}

By the Euler–Lagrange equations, the equations of motion for small deviations from this equillibrium point are described by

    \begin{equation*} B\ddot{q} = -Aq. \end{equation*}

A fundamental set of solutions of this system of differential equations is given by \mathrm{e}^{\pm \sqrt{-\lambda} \, t} x, where \lambda and x are the generalized eigenvalues and eigenvectors of the pair (A,B).5That is, all solutions to B\ddot{q} = -Aq can be written (uniquely) as linear combinations of solutions of the form \mathrm{e}^{\pm \sqrt{-\lambda} \, t} x. In particular, if all the generalized eigenvalues are positive, then the equillibrium is stable and the square roots of the eigenvalues represent the modes of vibration. In the most simple mechanical systems, such as masses in one dimension connected by springs with the natural coordinate system, the matrix B is diagonal with diagonal entries equal to the different masses. In even slightly more complicated “freshman physics” problems, it is quite easy to find examples where, in the natural coordinate system, the matrix B is nondiagonal.6Almost always, the matrix B is positive definite. As this example shows, generalized eigenvalue problems aren’t crazy weird things since they emerge as natural descriptions of simple mechanical systems like coupled pendulums.

One reason generalized eigenvalue problems aren’t more well-known is that one can easily reduce a generalized eigenvalue problem into a standard one. If the matrix B is invertible, then the generalized eigenvalues of (A,B) are just the eigenvalues of the matrix B^{-1}A. For several reasons, this is a less-than-appealing way of reducing a generalized eigenvalue problem to a standard eigenvalue problem. A better way, appropriate when A and B are both symmetric and B is positive definite, is to reduce the generalized eigenvalue problem for (A,B) to the symmetrically reduced matrix B^{-1/2}AB^{-1/2}, which also possesses the same eigenvalues as (A,B). In particular, the matrix B^{-1/2}AB^{-1/2} remains symmetric, which shows that (A,B) has real eigenvalues by the spectral theorem. In the mechanical context, one can think of this reformulation as a change of coordinate system in which the “mass matrix” B becomes the identity matrix I.

There are several good reasons to not simply reduce a generalized eigenvalue problem to a standard one, and perturbation theory gives a particular good reason. In order for us to change coordinates to change the B matrix into an identity matrix, we must first know the B matrix. If I presented you with an elaborate mechanical system which you wanted to study, you would need to perform measurements to determine the A and B matrices. But all measurements are imperfect and the entries of A and B are inevitably corrupted by measurement errors. In the presence of these measurement errors, we must give up on computing the normal modes of vibration perfectly; we must content ourselves with computing the normal modes of vibration plus-or-minus some error term we hope we can be assured is small if our measurement errors are small. In this setting, reducing the problem to B^{-1/2}AB^{-1/2} seems less appealing, as I have to understand how the measurement errors in A and B are amplified in computing the triple product B^{-1/2}AB^{-1/2}. This also suggests that computing B^{-1/2}AB^{-1/2} may be a poor algorithmic strategy in practice: if the matrix B is ill-conditioned, there might be a great deal of error amplification in the computed product B^{-1/2}AB^{-1/2}. One might hope that one might be able to devise algorithms with better numerical stability properties if we don’t reduce the matrix pair (A,B) to a single matrix. This is not to say that reducing a generalized eigenvalue problem to a standard one isn’t a useful tool—it definitely is. However, it is not something one should do reflexively. Sometimes, a generalized eigenvalue problem is best left as is and analyzed in its native form.

The rest of this post will focus on the question if A and B are real symmetric matrices (satisfying a definiteness condition, to be elaborated upon below), how do the eigenvalues of the pair (A+E,B+F) compare to those of (A,B), where E and F are small real symmetric perturbations? In fact, it shall be no additional toil to handle the complex Hermitian case as well while we’re at it, so we shall do so. (Recall that a Hermitian matrix A satisfies A^* = A, where (\cdot)^* is the conjugate transpose. Since the complex conjugate does not change a real number, a real Hermitian matrix is necesarily symmetric A^* = A^\top = A.) For the remainder of this post, let A, B, E, and F be Hermitian matrices of the same size. Let \widetilde{A} := A+E and \widetilde{B} := B + F denote the perturbations.

Symmetric Treatment

As I mentioned at the top of this post, our mission will really be to find the right way of thinking about perturbation theory for the generalized eigenvalue problem, after which the theory will follow much more directly than if we were to take a frontal assault on the problem. As we go, we shall collect nuggets of insight, each of which I hope will follow quite naturally from the last. When we find such an insight, we shall display it on its own line.

The first insight is that we can think of the pair A and B interchangeably. If \lambda is a nonzero eigenvalue of the pair (A,B), satisfying Ax = \lambda\, Bx, then Bx = \lambda^{-1} \, Ax. That is, \lambda^{-1} is an eigenvalue of the pair (B,A). Lack of symmetry is an ugly feature in a mathematical theory, so we seek to smooth it out. After thinking a little bit, notice that we can phrase the generalized eigenvalue condition symmetrically as \beta \, Ax = \alpha \, Bx with the associated eigenvalue being given by \lambda = \alpha/\beta. This observation may seem trivial at first, but let us collect it for good measure.

Treat A and B symmetrically by writing the eigenvalue as \lambda = \alpha/\beta with \beta\, Ax = \alpha\, Bx.

Before proceeding, let’s ask a question that, in our new framing, becomes quite natural: what happens when \beta = 0? The case \beta = 0 is problematic because it leads to a division by zero in the expression \lambda = \alpha/\beta. However, if we have \alpha \ne 0, this expression still makes sense: we’ve found a vector x for which Ax \ne 0 but Bx = 0. It makes sense to consider x still an eigenvector of (A,B) with eigenvalue \alpha / 0 = \infty! Dividing by zero should justifiably make one squeemish, but it really is quite natural in the case to treat x as a genuine eigenvector with eigenvalue \infty.

Things get even worse if we find a vector x for which Ax = Bx = 0. Then, any (\alpha,\beta) can reasonably considered an eigenvalue of (A,B) since \alpha \, Bx = 0 = \beta\, Ax. In such a case, all complex numbers are simultaneously eigenvalues of (A,B), in which case we call (A,B) singular.7More precisely, a pair (A,B) is singular if the determinant \det(tB - A) is identically zero for all t \in \mathbb{C}. For the generalized eigenvalue problem to make sense for a pair (A,B), it is natural to require that (A,B) not be singular. In fact, we shall assume an even stronger “definiteness” condition which ensures that (A,B) has only real (or infinite) eigenvalues. Let us return to this question of definiteness in a moment and for now assume that (A,B) is not singular and possesses real eigenvalues.

With this small aside taken care of, let us return to the main thread. By modeling eigenvalues as pairs (\alpha,\beta), we’ve traded one ugliness for another. While reformulating the eigenvalue as a pair (\alpha,\beta) treats A and B symmetrically, it also adds an additional indeterminacy, scale. For instance, if (\alpha,\beta) is an eigenvalue of (A,B), then so is (10\alpha,10\beta). Thus, it’s better not to think of (\alpha,\beta) so much as a pair of numbers together with all of its possible scalings.8Projective space provides a natural framework for studying such vectors up to scale indeterminacy. For reasons that shall hopefully become more clear as we go forward, it will be helpful to only consider all the possible positive scalings of (\alpha,\beta)—e.g., all (t\alpha,t\beta) for t > 0. Geometrically, the set of all positive scalings of a point in two-dimensional space is precisely just a ray emanating from the origin.

Represent eigenvalue pairs (\alpha,\beta) as rays emanating from the origin to account for scale ambiguity.

Now comes a standard eigenvalue trick. It’s something you would never think to do originally, but once you see it once or twice you learn to try it as a matter of habit. The trick: multiply the eigenvalue-eigenvector relation by the (conjugate) transpose of x:9For another example of the trick, try applying it to the standard eigenvalue problem Ax = \lambda x. Multiplying by x^* and rearranging gives \lambda = x^*Ax/x^*x—the eigenvalue \lambda is equal to the expression x^*Ax/x^*x, which is so important it is given a name: the Rayleigh quotient. In fact, the largest and smallest eigenvalues of A can be found by maximizing and minimizing the Rayleigh quotient.

    \begin{equation*} \beta \, Ax = \alpha \, Bx \implies \beta \, x^*Ax = \alpha \, x^*Bx \implies \frac{\alpha}{\beta} = \frac{x^*Ax}{x^*Bx}. \end{equation*}

The above equation is highly suggestive: since \alpha and \beta are only determined up to a scaling factor, it shows we can take \alpha = x^*Ax and \beta = x^*Bx. And by different scalings of the eigenvector x, we can scale x^*Ax = \alpha and x^*Bx = \beta by any positive factor we want. (This retroactively shows why it makes sense to only consider positive scalings of \alpha and \beta.10To make this point more carefully, we shall make great use of the identification between pairs (\alpha,\beta) and the pair of quadratic forms (x^*Ax,x^*Bx). Thus, even though (\alpha,\beta) and (-\alpha,-\beta) lead to equivalent eigenvalues since \alpha/\beta = (-\alpha)/(-\beta), (\alpha,\beta) and (-\alpha,-\beta) don’t necessarily both arise from a pair of quadratic forms: if (\alpha,\beta) = (x^* Ax,x^* Bx), this does not mean there exists y such that (y^* Ay,y^* By) = (-\alpha,-\beta). Therefore, we only consider (\alpha,\beta) equivalent to (t\alpha,t\beta) if t > 0.) The expression x^*Ax is so important that we give it a name: the quadratic form (associated with A and evaluated at x).

The eigenvalue pair (\alpha,\beta) can be taken equal to the pair of quadratic forms (x^*Ax,x^*Bx).

Complexifying Things

Now comes another standard mathematical trick: represent points in two-dimensional space by complex numbers. In particular, we identify the pair (\alpha,\beta) with the complex number \alpha + \mathrm{i}\beta.11Recall that we are assuming that \alpha / \beta is real, so we can pick a scaling in which both \alpha and \beta are real numbers. Assume we have done this. Similar to the previous trick, it’s not fully clear why this will pay off, but let’s note it as an insight.

Identify the pair (\alpha,\beta) with the complex number \alpha + \mathrm{i}\beta.

Now, we combine all the previous observations. The eigenvalue \lambda = x^*Ax / x^*Bx is best thought of as a pair (\alpha,\beta) which, up to scale, can be taken to be \alpha = x^*Ax and \beta = x^*Bx. But then we represent (\alpha,\beta) as the complex number

    \begin{equation*} \alpha + \mathrm{i} \beta = x^*Ax + \mathrm{i} x^*Bx = x^*(A + \mathrm{i} B) x. \end{equation*}

Let’s stop for a moment and appreciate how far we’ve come. The generalized eigenvalue problem Ax = \lambda\, Bx is associated with the expression x^*(A+\mathrm{i} B)x.If we just went straight from one to the other, this reduction would appear like some crazy stroke of inspiration: why would I ever think to write down x^*(A+\mathrm{i} B)x? However, just following our nose lead by a desire to treat A and B symmetrically and applying a couple standard tricks, this expression appears naturally. The expression x^*(A+\mathrm{i} B)x will be very useful to us because it is linear in A and B, and thus for the perturbed problem (\widetilde{A},\widetilde{B}) = (A+E,B+F), we have that x^*(\widetilde{A}+\mathrm{i} \widetilde{B})x = x^*(A+\mathrm{i} B)x + x^*(E+\mathrm{i} F)x: consequently, x^*(\widetilde{A}+\mathrm{i} \widetilde{B})x is a small perturbation of x^*(A+\mathrm{i} B)x. This observation will be very useful to us.

If x is the eigenvector, then the complex number \alpha + \mathrm{i} \beta is x^*(A+\mathrm{i} B)x.

Definiteness and the Crawford Number

With these insights in hand, we can now return to the point we left earlier about what it means for a generalized eigenvalue problem to be “definite”. We know that if there exists a vector x for which Ax = Bx = 0, then the problem is singular. If we multiply by x^*, we see that this means that x^*Ax = x^*Bx = 0 as well and thus x^*(A+\mathrm{i}B)x = 0. It is thus quite natural to assume the following definiteness condition:

The pair (A,B) is said to be definite if x^*(A+\mathrm{i}B)x \ne 0 for all complex nonzero vectors x.

A definite problem is guaranteed to be not singular, but the reverse is not necessarily true; one can easily find pairs (A,B) which are not definite and also not singular.12For example, consider A = B = \operatorname{diag}(1,-1). (A,B) is not definite since x^*Ax = x^*Bx = 0 for x = (1,1). However, (A,B) is not singular; the only eigenvalue of the pair (A,B) is 1 and \det(tB-A) = -(t-1) is not identically zero.. (Note x^*Ax = x^*Bx = 0 does not imply Ax = Bx = 0 unless A and B are both positive (or negative) semidefinite.)

The “natural” symmetric condition for (A,B) to be “definite” is for x^*(A+\mathrm{i} B)x \ne 0 for all vectors x.

Since the expression x^*(A+\mathrm{i}B)x is just scaled by a positive factor by scaling the vector x, it is sufficient to check the definiteness condition x^*(A+\mathrm{i}B)x \ne 0 for only complex unit vectors x. This leads naturally to a quantitative characterization of the degree of definiteness of a pair (A,B):

The Crawford number13The name Crawford number was coined by G. W. Stewart in 1979 in recognition of Crawford’s pioneering work on the perturbation theory of the definite generalized eigenvalue problem. c(A,B) of a pair (A,B) is the minimum value of |x^*(A+\mathrm{i}B)x| = \sqrt{(x^*Ax)^2 + (x^*Bx)^2} over all complex unit vectors x.

The Crawford number naturally quantifies the degree of definiteness.14In fact, it has been shown that the Crawford number is, in a sense, the distance from a definite matrix pair (A,B) to a pair which is not simultaneously diagonalizable by congruence. A problem which has a large Crawford number (relative to a perturbation) will remain definite after perturbation, whereas the pair may become indefinite if the size of the perturbation exceeds the Crawford number. Geometrically, the Crawford number has the following interpretation: x^*(A+\mathrm{i}B)x must lie on or outside the circle of radius c(A,B) centered at 0 for all (complex) unit vectors x.

The “degree of definiteness” can be quantified by the Crawford number c(A,B) := \min_{\|x\|=1} x^*(A+iB)x.

Now comes another step in our journey which is a bit more challenging. For a matrix C (in our case C = A+\mathrm{i}B), the set of complex numbers x^*Cx for all unit vectors x has been the subject of considerable study. In fact, this set has a name

The field of values of a matrix C is the set W(C) := \{ x^*Cx : x\in\mathbb{C}^n, \: \|x\| = 1\}.

In particular, the Crawford number is just the absolute value of the closest complex number in the field of values W(A+iB) to zero.

It is a very cool and highly nontrivial fact (called the Toeplitz–Hausdorff Theorem) that the field of values is always a convex set, with every two points in the field of values containing the line segment connecting them. Thus, as a consequence, the field of values W(A+\mathrm{i}B) for a definite matrix pair (A,B) is always on “one side” of the complex plane (in the sense that there exists a line through zero which W(A+\mathrm{i}B) lies strictly on one side of15This is a consequence of the hyperplane separation theorem together with the fact that 0\notin W(A+\mathrm{i}B).).

The numbers x^*(A+iB)x for unit vectors x lie on one half of the complex plane.

The field of values W(A+\mathrm{i}B) lies outside the circle of radius c(A,B) centered at 0 and thus on one side of the complex plane.

From Eigenvalues to Eigenangles

All of this geometry is lovely, but we need some way of relating it to the eigenvalues. As we observed a while ago, each eigenvalue is best thought of as a ray emanating from the origin, owing to the fact that the pair (\alpha,\beta) can be scaled by an arbitrary positive factor. A ray is naturally associated with an angle, so it is natural to characterize an eigenvalue pair (\alpha,\beta) by the angle describing its arc.

But the angle of a ray is only defined up additions by full rotations (2\pi radians). As such, to associate each ray a unique angle we need to pin down this indeterminacy in some fixed way. Moreover, this indeterminacy should play nice with the field of values W(A+\mathrm{i}B) and the field of values W(\widetilde{A}+\mathrm{i}\widetilde{B}) of the perturbation. But in the last section, we say that each of these field of angles lies (strictly) on one half of the complex plane. Thus, we can find a ray R which does not intersect either field of values!

One possible choice is to measure the angle from this ray. We shall make a slightly different choice which plays better when we treat (\alpha,\beta) as a complex number \alpha + \mathrm{i}\beta. Recall that a number \theta is an argument for \alpha + \mathrm{i}\beta if \alpha + \mathrm{i}\beta = r\mathrm{e}^{i\theta} for some real number r \ge 0. The argument is multi-valued since \theta + 2\pi n is an argument for \alpha+\mathrm{i}\beta as long as \theta is (for all integers n). However, once we exclude our ray R, we can assign each complex number \alpha+\mathrm{i}\beta not on this ray a unique argument which depends continuously on (\alpha,\beta). Denote this “branch” of the argument by \operatorname{arg}. If (\alpha,\beta) represents an eigenvalue \lambda = \alpha/\beta, we call \theta = \arg(\alpha+\mathrm{i}\beta) an eigenangle.

Represent an eigenvalue pair (\alpha,\beta) by its associated eigenangle \theta = \arg(\alpha+i\beta).

How are these eigenangles related to the eigenvalues? It’s now a trigonometry problem:

    \begin{equation*} \lambda = \frac{\alpha}{\beta} = \frac{\mbox{adjacent}}{\mbox{opposite}} = \cot \left( \operatorname{arg}(\alpha+i\beta)). \end{equation*}

The eigenvalues are the cotangents of the eigenangles!

The eigenvalue \lambda = \alpha/\beta is the cotangent of the eigenangle \theta = \arg(\alpha+\mathrm{i}\beta).

Variational Characterization

Now comes another difficulty spike in our line of reasoning, perhaps the largest in our whole deduction. To properly motivate things, let us first review some facts about the standard Hermitian/symmetric eigenvalue problem. The big idea is that eigenvalues can be thought of as the solution to a certain optimization problem. The largest eigenvalue of a Hermitian/symmetric matrix A is given by the maximization problem

    \begin{equation*} \lambda_{\rm max}(A) = \max_{\|x\| = 1} x^*Ax. \end{equation*}

The largest eigenvalue is the maximum of the quadratic form over unit vectors x. What about the other eigenvalues? The answer is not obvious, but the famous Courant–Fischer Theorem shows that the jth largest eigenvalue \lambda_j(A) can be written as the following minimax optimization problem

    \begin{equation*} \lambda_j(A) = \min_{\dim \mathcal{X} = n-j+1} \max_{\substack{x \in \mathcal{X} \\ \|x\| = 1}} x^*Ax. \end{equation*}

The minimum is taken over all subspaces \mathcal{X} of dimension n-j+1 whereas the maximum is taken over all unit vectors x within the subspace \mathcal{X}. Symmetrically, one can also formulate the eigenvalues as a max-min optimization problem

    \begin{equation*} \lambda_j(A) = \max_{\dim \mathcal{X} = j} \min_{\substack{x \in \mathcal{X} \\ \|x\| = 1}} x^*Ax. \end{equation*}

These variational/minimax characterizations of the eigenvalues of a Hermitian/symmetric matrix are essential to perturbation theory for Hermitian/symmetric eigenvalue problems, so it is only natural to go looking for a variational characterization of the generalized eigenvalue problem. There is one natural way of doing this that works for B positive definite: specifically, one can show that

    \begin{equation*} \lambda_j(A,B) = \min_{\dim \mathcal{X} = n-j+1} \max_{\substack{x \in \mathcal{X} \\ \|x\| = 1}} \frac{x^*Ax}{x^*Bx}. \end{equation*}

This characterization, while useful in its own right, is tricky to deal with because it is nonlinear in A and B. It also treats A and B non-symmetrically, which should set off our alarm bells that there might be a better way. Indeed, the ingenious idea, due to G. W. Stewart in 1979, is to instead provide a variational characterization of the eigenangles! Specifically, Stewart was able to show16Stewart’s original definition of the eigenangles differs from ours; we adopt the definition of Mathias and Li. The result amounts to the same thing.

(1)   \begin{align*} \theta_j &= \min_{\dim \mathcal{X} = n-j+1} \max_{\substack{x\in\mathcal{X} \\ \|x\|=1}} \arg(x^*(A+\mathrm{i}B)x), \\ \theta_j &= \max_{\dim \mathcal{X} = j} \min_{\substack{x\in\mathcal{X} \\ \|x\|=1}} \arg(x^*(A+\mathrm{i}B)x), \end{align*}

for the eigenangles \theta_1\ge \theta_2\ge\cdots\ge \theta_n.17Note that since the cotangent is decreasing on [0,\pi], this means that the eigenvalues \lambda_1 = \cot \theta_1 \le \lambda_2 = \cot \lambda_2 \le \cdots are now in increasing order, in contrast to our convention from earlier in this section. This shows, in particular, that the field of values is subtended by the smallest and largest eigenangles.

The eigenangles satisfy a minimax variational characterization.

How Big is the Perturbation?

We’re tantalizingly close to our objective. The final piece in our jigsaw puzzle before we’re able to start proving perturbation theorems is to quantify the size of the perturbing matrices E and F. Based on what we’ve done so far, we see that the eigenvalues are natural associated with the complex number x^*(A+\mathrm{i}B)x, so it is natural to characterize the size of the perturbing pair (E,F) by the distance between x^*(A+\mathrm{i}B)x and x^*(\widetilde{A}+\mathrm{i}\widetilde{B})x. But the difference between these two quantities is just

    \begin{equation*} x^*(\widetilde{A}+\mathrm{i}\widetilde{B})x - x^*(A+\mathrm{i}B)x = x^*(E+\mathrm{i}F)x. \end{equation*}

We’re naturally led to the question: how big can x^*(E+\mathrm{i}F)x be? If the vector x has a large norm, then quite large, so let’s fix x to be a unit vector. With this assumption in place, the maximum size of x^*(E+\mathrm{i}F)x is simple the distance of the farthest point in the field of values E+iF from zero. This quantity has a name:

The numerical radius of a matrix G (in our case =E+\mathrm{i}F) is r(G) := \max_{\|x\|=1} |x^*Gx|.18This maximum is taken over all complex unit vectors x.

The size of the perturbation (E,F) is the numerical radius r(E+\mathrm{i}F) = \max_{\|x\|=1} |x^*(E+\mathrm{i}F)x|.

It is easy to upper-bound the numerical radius r(E+\mathrm{i}F) by more familiar quantities. For instance, once can straightforwardly show the bound r(E+\mathrm{i}F) \le \sqrt{\|E\|^2+\|F\|^2}, where \|\cdot\| is the spectral norm. We prefer to state results using the numerical radius because of its elegance: it is, in some sense, the “correct” measure of the size of the pair (E,F) in the context of this theory.

Stewart’s Perturbation Theory

Now, after many words of prelude, we finally get to our first perturbation theorem. With the work we’ve done in place, the result is practically effortless.

Let \widetilde{\theta}_1\ge \widetilde{\theta}_2\ge \cdots\ge\widetilde{\theta}_n denote the eigenangles of the perturbed pair (\widetilde{A},\widetilde{B}) and consider the jth eigenangle. Let \mathcal{X}^* be the subspace of dimension n-j+1 achieving the minimum in the first equation of the variational principle (1) for the original unperturbed pair (A,B). Then we have

(2)   \begin{equation*} \widetilde{\theta}_j = \min_{\dim \mathcal{X} = n-j+1} \max_{\substack{x\in\mathcal{X} \\ \|x\|=1}} \arg(x^*(\widetilde{A}+\mathrm{i}\widetilde{B})x) \le \max_{\substack{x\in\mathcal{X}^* \\ \|x\|=1}} \arg(x^*(A+\mathrm{i}B)x + x^*(E+\mathrm{i}F)x). \end{equation*}

This is something of a standard trick when dealing with variational problems in matrix analysis: take the solution (in this case the minimizing subspace) for the original problem and plug it in for the perturbed problem. The solution may no longer be optimal, but it at least gives an upper (or lower) bound. The complex number x^*(A+\mathrm{i}B)x must lie at least a distance c(A,B) from zero and |x^*(E+\mathrm{i}F)x| \le r(E+\mathrm{i}F). We’re truly toast if the perturbation is large enough to perturb x^*(\widetilde{A}+i\widetilde{B})x to be equal to zero, so we should assume that r(E+\mathrm{i}F) < c(A,B).

For our perturbation theory to work, we must assume r(E+\mathrm{i}F) < c(A,B).

x^*(A+\mathrm{i}B)x lies on or outside the circle centered at zero with radius c(A,B). x^*(\tilde{A}+\mathrm{i}B)x might lie anywhere in a circle centered at x^*(A+\mathrm{i}B)x with radius r(E+\mathrm{i}F), so one must have r(E+\mathrm{i}F) < c(A,B) to ensure the perturbed problem is nonsingular (equivalently x^*(\tilde{A}+\mathrm{i}B)x\ne 0 for every x).

Making the assumption that r(E+\mathrm{i}F) < c(A,B), bounding the right-hand side of (2) requires finding the most-counterclockwise angle necessary to subtend a circle of radius r(E+\mathrm{i}F) centered at x^*(A+\mathrm{i}B)x, which must lie a distance c(A,B) from the origin. The worst-case scenario is when x^*(A+\mathrm{i}B)x is exactly a distance c(A,B) from the origin, as is shown in the following diagram.

In the worst case, x^*(A+\mathrm{i}B)x lies on the circle centered at zero with radius c(A,B), which is subtended above by angle \theta_j + \sin^{-1}(r(E+iF)/c(A,B)).

Solving the geometry problem for the counterclockwise-most subtending angle in this worst-case sitation, we conclude the eigenangle bound \widetilde{\theta}_j -\theta_j \le \sin^{-1}(r(E+\mathrm{i}F)/c(A,B)). An entirely analogous argument using the max-min variational principle (1) proves an identical lower bound, thus showing

(3)   \begin{equation*} \sin |\theta_j - \widetilde{\theta}_j| \le \frac{r(E+\mathrm{i}F)}{c(A,B)}. \end{equation*}

In the language of eigenvalues, we have19I’m being a little sloppy here. For a result like this to truly hold, I believe all of the perturbed and unperturbed eigenangles should all be contained in one half of the complex plane.

    \begin{equation*} |\cot^{-1}(\widetilde{\lambda}_j) - \cot^{-1}(\lambda_j)| \le \sin^{-1}\left( \frac{r(E+\mathrm{i}F)}{c(A,B)} \right). \end{equation*}

Interpreting Stewart’s Theory

After much work, we have finally proven our first generalized eigenvalue perturbation theorem. After taking a moment to celebrate, let’s ask ourselves: what does this result tell us?

Let’s start with the good. This result shows us that if the perturbation, measured by the numerical radius r(E+iF), is much smaller than the definiteness of the original problem, measured by the Crawford number c(A,B), then the eigenangles change by a small amount. What does this mean in terms of the eigenvalues? For small eigenvalues (say, less than one in magnitude), small changes in the eigenangles also lead to small changes of the eigenvalues. However, for large eigenangles, small changes in the eigenangle are magnified into potentially large changes in the eigenvalues. One can view this result in a positive or negative framing. On the one hand, large eigenvalues could be subject to dramatic changes by small perturbations; on the other hand, the small eigenvalues aren’t “taken down with the ship” and are much more well-behaved.

Stewart’s theory is beautiful. The variational characterization of the eigenangles (1) is a master stroke and exactly the extension one would want from the standard Hermitian/symmetric theory. From the variational characterization, the perturbation theorem follows almost effortlessly from a little trigonometry. However, Stewart’s theory has one important deficit: the Crawford number. All that Stewart’s theory tells is that all of the eigenangles change by at most roughly “perturbation size over Crawford number”. If the Crawford number is quite small since the problem is nearly indefinite, this becomes a tough pill to swallow.

The Crawford number is in some ways essential: if the perturbation size exceeds the Crawford number, the problem can become indefinite or even singular. Thus, we have no hope of fully removing the Crawford number from our analysis. But might it be the case that some eigenangles change by much less than “pertrubation size over Crawford number”? Could we possibly improve to a result of the form “the eigenangles change by roughly perturbation size over something (potentially) much less than the Crawford number”? Sun improved Stewart’s analysis in 1982, but the scourge of the Crawford number remained.20Sun’s bound does not explicitly have the Crawford number, instead using the quantity \zeta := \max_{\|x\|=1} \sqrt{|x^*(E+\mathrm{i}F)x|/|x^*(A+iB)x|} and another hard-to-concisely describe quantity. In many cases, one has nothing better to do than to bound \zeta \le r(E+\mathrm{i}F)/c(A,B), in which case the Crawford number has appeared again. The theory of Mathias and Li, published in a technical report in 2004, finally produced a bound where the Crawford number is replaced.

The Mathias–Li Insight and Reduction to Diagonal Form

Let’s go back to the Stewart theory and look for a possible improvement. Recall in the Stewart theory that we considered the point x^*(A+\mathrm{i}B)x on the complex plane. We then argued that, in the worst case, this point would lie a distance c(A,B) from the origin and then drew a circle around it with radius r(E+\mathrm{i}F). To improve on Stewart’s bound, we must somehow do something better than using the fact that |x^*(A+\mathrm{i}B)x|\ge c(A,B). The insight of the Mathias–Li theory is, in some sense, as simple as this: rather than using the fact that |x^*(A+\mathrm{i}B)x| \ge c(A,B) (as in Stewart’s analysis), use how far x^*(A+\mathrm{i}B)x actually is from zero, where x is chosen to be the unit norm eigenvectors of (A,B).21This insight is made more nontrivial by the fact that, in the context of generalized eigenvalue problems, it is often not convenient to choose the eigenvectors to have unit norm. As Mathias and Li note, there are often two more popular normalizations for x. If B is positive definite, one often normalizes x such that \beta = x^*Bx = 1—the eigenvectors are thus made “B-orthonormal”, generalizing the fact that the eigenvectors of a Hermitian/symmetric matrix are orthonormal. Another popular normalization is to scale x such that |\alpha+i\beta| = |x^*(A+\mathrm{i}B)x| = 1. In this way, just taking the eigenvector x to have unit norm is already a nontrivial insight.

Before going further, let us quickly make a small reduction which will simplify our lives greatly. Letting X denote a matrix whose columns are the unit-norm eigenvectors of (A,B), one can verify that X^*AX and X^*BX are diagonal matrices with entries \alpha_1,\ldots,\alpha_n and \beta_1,\ldots,\beta_n respectively. With this in mind, it can make our lives a lot easy to just do a change of variables A \mapsto X^*AX and B\mapsto X^*BX (which in turn sends E\mapsto X^*EX and F \mapsto X^*FX). The change of variables A \mapsto X^*AX is very common in linear algebra and is called a congruence transformation.

Perform a change of variables by a congruence transformation with the matrix of eigenvectors.

While this change of variables makes our lives a lot easier, we must first worry about how this change of variables might effect the size of the perturbation matrices (E,F). It turns out this change of variables is not totally benign, but it is not maximally harmful either. Specifically, the spectral radius r(E+\mathrm{i}F) can grow by as much as a factor of n.22This is because, in virtue of having unit-norm columns, the spectral norm of the X matrix is \|X\| \le \|X\|_{\rm F} \le \sqrt{n}. Further, note the following variational characterization of the spectral radius r(E+\mathrm{i}F) = \max_\theta \| (\cos \theta) E + (\sin\theta) F \|. Plugging these two facts together yields r(X^*EX+\mathrm{i}\, X^*FX) \le \|X\|^2r(E+\mathrm{i}F) \le nr(E+\mathrm{i}F). This factor of n isn’t great, but it is much better than if the bound were to degrade by a factor of the condition number \|X\|\|X^{-1}\|, which can be arbitrarily large.

This change of variables may increase r(E+\mathrm{i}F) by at most a factor of n.

From now on, we shall tacitly assume that this change of variables has taken place, with A and B being diagonal and E and F being such that r(E+\mathrm{i}F) is at most a factor n larger than it was previously. We denote by \alpha_j and \beta_j the jth diagonal element of A and B, which are given by \alpha_j = x_j^*Ax_j and \beta_j = x_j^*Bx_j where x_j is the jth unit-norm eigenvector

Mathias and Li’s Perturbation Theory

We first assume the perturbation (E,F) is smaller than the Crawford number in the sense r(E+\mathrm{i}F) < c(A,B), which is required to be assured that the perturbed problem (\widetilde{A},\widetilde{B}) does not lose definiteness. This will be the only place in this analysis where we use the Crawford number.

Draw a circle of radius r(E+\mathrm{i}F) around \alpha_j + \mathrm{i}\beta_j.

If \theta_j is the associated eigenangle, then this circle is subtended by arcs with angles

    \begin{equation*} \ell_j = \theta_j - \sin^{-1}\left(\frac{r(E+\mathrm{i}F)}{|\alpha_j+\mathrm{i}\beta_j|}\right), \quad u_j = \theta_j + \sin^{-1}\left(\frac{r(E+\mathrm{i}F)}{|\alpha_j+\mathrm{i}\beta_j|}\right). \end{equation*}

It would be nice if the perturbed eigenangles \widetilde{\theta}_j were guaranteed to lie in these arcs (i.e., \ell_j \le \widetilde{\theta}_j \le u_j). Unfortunately this is not necessarily the case. If one \alpha_j + \mathrm{i}\beta_j is close to the origin, it will have a large arc which may intersect with other arcs; if this happens, we can’t guarantee that each perturbed eigenangle will remain within its individual arc. We can still say something though.

What follows is somewhat technical, so let’s start with the takeaway conclusion: \widetilde{\theta}_j is larger than any j of the lower bounds \ell_j. In particular, this means that \widetilde{\theta}_j is larger than the jth largest of all the lower bounds. That is, if we rearrange the lower bounds \ell_1,\ldots,\ell_n in decreasing order \ell_1^\downarrow \ge \ell_2^\downarrow \ge \cdots \ge \ell_n^\downarrow, we hace \widetilde{\theta}_j \ge \ell_j^\downarrow. An entirely analogous argument will give an upper bound, yielding

(4)   \begin{equation*} \ell_j^\downarrow \le \widetilde{\theta}_j \le u_j^\downarrow. \end{equation*}

For those interested in the derivation, read on the in the following optional section:

Derivation of the Mathias–Li Bounds
Since A and B are diagonal, the eigenvectors of the pair (A,B) are just the standard basis vectors, the jth of which we will denote e_j. The trick will be to use the max-min characterization (1) with the subspace \mathcal{X} spanned by some collection of j basis vectors e_{i_1},\ldots,e_{i_j}. Churning through a couple inequalities in quick fashion,23See pg. 17 of the Mathias and Li report. we obtain

    \begin{align*} \widetilde{\theta}_j &\ge \min_{\substack{x \in \mathcal{X} \\ \|x\| = 1}} \arg \left( x^*(A+\mathrm{i}B)x + x^*(E+\mathrm{i}F)x \right) \\ &\ge \min \left\{ \arg(y+z) : y \in \operatorname{conv} \{ \alpha_{i_1}+\mathrm{i}\beta_{i_1},\ldots,\alpha_{i_j}+\mathrm{i}\beta_{i_j} \}, z\in W(E+\mathrm{i}F) \} \\ &\ge \min \left\{ \arg(y+z) : y \in \operatorname{conv} \{ \alpha_{i_1}+\mathrm{i}\beta_{i_1},\ldots,\alpha_{i_j}+\mathrm{i}\beta_{i_j} \}, |z|\le r(E+\mathrm{i}F) \} \\ &\ge \min \left\{ \arg(w) : w\in\operatorname{conv} \bigcup_{k=1}^j \{ a\in\mathbb{C} : |\alpha_{i_k} + \mathrm{i}\beta_{i_k} - a| \le r(E+\mathrm{i}F) \} \right\} \\ &= \min_{k=1,\ldots,j} \ell_{i_k}. \end{align*}

Here, \operatorname{conv} denotes the convex hull. Since this holds for every set of indices i_1,\ldots,i_j, it in particular holds for the set of indices which makes \min_{k=1,\ldots,j} \ell_{i_k} the largest. Thus, \widetilde{\theta}_j \ge \ell^\downarrow_j.

How to Use Mathias–Li’s Perturbation Theory

The eigenangle perturbation bound (4) can be instantiated in a variety of ways. We briefly sketch two. The first is to bound |\alpha_j + \mathrm{i}\beta_j| by its minimum over all j, which then gives a bound on u^\downarrow_j (and \ell^\downarrow_j)

    \begin{equation*} |\alpha_j + \mathrm{i} \beta_j| \ge \min_{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j| \implies u_j^\downarrow \le \theta_j + \sin^{-1} \frac{r(E+\mathrm{i}F)}{\min_{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j|}. \end{equation*}

Plugging into (4) and simplifying gives

(5)   \begin{equation*} \sin \left| \widetilde{\theta}_j - \theta_j \right| \le \frac{r(E+\mathrm{i}F)}{\min_{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j|}. \end{equation*}

This improves on Stewart’s bound (3) by replacing the Crawford number c(A,B) by \min_{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j|; as Mathias and Li show \min_{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j| is always smaller than or equal to c(A,B) and can be much much smaller.24Recall that Mathias and Li’s bound first requires us to do a change of variables where A and B both become diagonal, which can increase r(E+\mathrm{i}F) by a factor of n. Thus, for an apples-to-apples comparison with Stewart’s theory where A and B are non-diagonal, (5) should be interpreted as \sin \left| \widetilde{\theta}_j - \theta_j \right| \le n\,r(E+\mathrm{i}F)/\min{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j|.

For the second instantiation (4), we recognize that if an eigenangle \theta_j is sufficiently well-separated from other eigenangles (relative to the size of the perturbation and \min_{1\le j\le n} |\alpha_j + \mathrm{i} \beta_j|), then we have u_j^\downarrow \le u_j and \ell_j^\downarrow \ge \ell_j. (The precise instantiation of “sufficiently well-separated” requires some tedious algebra; if you’re interested, see Footnote 7 in Mathias and Li’s paper.25You may also be interested in Corollary 2.2 in this preprint by myself and coauthors.) Under this separation condition, (4) then reduces to

(6)   \begin{equation*} \sin \left| \widetilde{\theta}_j - \theta_j \right| \le \frac{r(E+\mathrm{i}F)}{|\alpha_j + \mathrm{i} \beta_j|}. \end{equation*}

This result improves on Stewart’s result (4) by even more, since we have now replaced the Crawford number c(A,B) by |\alpha_j + \mathrm{i} \beta_j| for a sufficiently small perturbation. In fact, a result of this form is nearly as good as one could hope for.26Specifically, the condition number of the eigenangle \theta_j is |\alpha_j + \mathrm{i} \beta_j|^{-1}, so we know for sufficiently small perturbations we have \left| \widetilde{\theta}_j - \theta_j \right| \lessapprox (\mbox{size of perturbation}) \times |\alpha_j + \mathrm{i} \beta_j|^{-1} and |\alpha_j + \mathrm{i} \beta_j|^{-1} is the smallest number for which such a relation holds. Mathias and Li’s theory allows for a statement of this form to be made rigorous for a finite-size perturbation. Again, the only small deficit is the additional factor of “n” from the change of variables to diagonal form.

The Elegant Geometry of Generalized Eigenvalue Perturbation Theory

As I said at the start of this post, what fascinates me about this generalized eigenvalue perturbation is the beautiful and elegant geometry. When I saw it for the first time, it felt like a magic trick: a definite generalized eigenvalue problem with real eigenvalues was transformed by sleight of hand into a geometry problem on the complex plane, with solutions involving just a little high school geometry and trigonometry. Upon studying the theory, I began to appreciate it for a different reason. Upon closer examination, the magic trick was revealed to be a sequence of deductions, each logically following naturally from the last. To the pioneers of this subject—Stewart, Sun, Mathias, Li, and others—this sequence of logical deductions was not preordained, and their discovery of this theory doubtlessly required careful thought and leaps of insight. Now that this theory has been discovered, however, we get the benefit of retrospection, and can retell a narrative of this theory where each step follows naturally from the last. When told this way, one almost imagines being able to develop this theory by oneself, where at each stage we appeal to some notion of mathematical elegance (e.g., by treating A and B symmetrically) or by applying a standard trick (e.g., identifying a pair (\alpha,\beta) with the complex number \alpha + \mathrm{i}\beta). Since this theory took several decades to fall into place, we should not let this storytelling exercise fool us into thinking the prospective act of developing a new theory will be as straightforward and linear as this retelling, pruned of dead ends and halts in progress, might suggest.

That said, I do think the development of the perturbation theory of the generalized eigenvalue problem does have a lesson for those of us who seek to develop mathematical theories: be guided by mathematical elegance. At several points in the development of the perturbation theory, we obtained great gains by treating quantities which play a symmetric role in the problem symmetrically in the theory or by treating a pair of real numbers as a complex number and asking how to interpret that complex number. My hope is that this perturbation theory serves as a good example for how letting oneself be guided by intuition, a small array of standard tricks, and a search for elegance can lead one to conceptualize a problem in the right way which leads (after a considerable amount of effort and a few lucky breaks) to a natural solution.

Leave a Reply

Your email address will not be published. Required fields are marked *