Low-Rank Approximation Toolbox: Generalized Nyström Approximation

May 20, 2026 by Ethan N. Epperly Leave a comment

Today, I want to talk about the generalized Nyström approximation, which I regard as the one of the “big three” approaches to constructing a low-rank approximation to matrix.¹ Understanding this approximation, under what conditions it works and the sharpest possible error bounds for it, is a subject of two recent papers of mine:

Faster Randomized Linear Algebra with Structured Random Matrices, joint with Chris Camaño, Raphael Meyer, and Joel Tropp.
Sharp analysis of sketched least squares and randomized low-rank approximation, joint with Robert Webber.

On the occasion of the release of the second paper this morning, I felt it was a good time to talk about the generalized Nyström approximation on this blog. In this post, I will try and motivate the generalized Nyström approximation, describing the motivation for the method and when it might be preferable to alternatives.

Existing Characters: Nyström Approximation and the Randomized SVD

To begin our story, let me begin with a reminder of a couple of characters we’ve met in previous installments of this blog, randomized Nyström approximation and the randomized SVD.

Randomized Nyström approximation is a method for producing a low-rank approximation to a positive semidefinite² (psd) matrix $A$ . For form this approximation, begin by drawing a random test matrix $\Omega$ , say, with independent standard normal random entries. (We will have more to say about the choice of $\Omega$ below). Using this test matrix, the Nyström approximation is defined as³

$\hat{A} = A\Omega (\Omega^\top A\Omega)^{-1} (A\Omega)^\top.$

To implement this algorithm in practice, one should take care to use numerically stable pseudocode; see this paper for details.

The randomized SVD is a method for constructing a low-rank approximation to a general, non-symmetric or even non-square matrix $B$ . Again, begin by constructing a random test matrix $\Omega$ . To construct a low-rank approximation, compute the product $B\Omega$ and orthonormalize its columns (e.g., by QR decomposition) to obtain

$Q \coloneqq \operatorname{orth}(B\Omega).$

Then, to construct a low-rank approximation, we employ a second product with the matrix $B$ , yielding the low-rank approximation

$\hat{B} = QC \quad \text{for } C = Q^\top B.$

How do these two algorithms compare? There are at least three major differences between the two algorithms. Here are the first two:

Scope. Nyström approximation applies only to psd matrices, and randomized SVD applies to a general rectangular mtrix.
Single-pass? The Nyström approximation requires only a single pass over the matrix $A$ to form. (Each entry of $A$ needs to be read once to form the product $A\Omega$ , after which we have all the information we need from $A$ to form $\hat{A}$ .) By contrast, the randomized SVD requires two passes, one to compute $B\Omega$ and a second to compute $Q^\top B$ .

The third point is more subtle and concerns the accuracy of these algorithms. As we saw in a previous post, the randomized SVD approximation satisfies the error bound

(1) $\expect \norm{B - \hat{B}}_{\rm F}^2 \le \min_{r \le k-2} \left( 1 + \frac{r}{k-(r+1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2.$

Here, $\norm{\cdot}_{\rm F}$ is the matrix Frobenius norm and $\lowrank{B}_r$ denotes the best rank- $r$ approximation to $B$ . This result, due to Halko, Martinsson, & Tropp (2011), shows that the error of rank- $k$ randomized SVD is comparable to the error of the best rank- $r$ approximation to $B$ of any rank $r\le k-2$ . See this post for more discussion of this error bound.

Here is analogous bound for the randomized Nyström approximation, taken from Corollary 8.3 in this paper of Tropp and Webber:

(2) $\left(\expect \norm{A - \hat{A}}_{\rm F}^2 \right)^{1/2} \le \min_{r\le k-4} \left(1 + \frac{r+1}{k-(r+3)}\right) \left( \norm{A - \lowrank{A}}_{\rm F} + \frac{1}{\sqrt{k-r}} \norm{A - \lowrank{A}_r}_* \right).$

This bound is more complicated than the bound for the randomized SVD in several ways. For us, let us focus on one main difference: The error of the randomized Nyström approximation depends on the nuclear norm error $\norm{A - \lowrank{A}_r}_*$ of the best rank- $r$ approximation.

The nuclear norm

$\norm{C}_* = \sigma_1(C) + \sigma_2(C) + \cdots$

is defined as the sum of the singular values of a matrix. It is always larger than the Frobenius norm and is much larger when the singular values of $C$ decrease a slow rate. The matrix with the slowest-possible rate of singular decrease is the identity matrix. For this matrix, its Frobenius is $\norm{\rm I}_{\rm F} = \sqrt{n}$ , and its nuclear norm is $\norm{\rm I}_* = n$ —a factor $\sqrt{n}$ larger!

The conclusion of this discussion is that, for matrices with slowly decaying eigenvalues⁴, the the randomized Nyström error bound (2) can be much larger than the randomized SVD error bound (1). For such problems, the error of the randomized SVD can be much smaller than the error of randomized Nyström approximation. We add this to our list of comparisons

Frobenius-norm error bounds? The (Frobenius-norm) error of the randomized SVD $\norm{B - \hat{B}}_{\rm F}$ is bounded in terms of the Frobenius-norm error of the best rank- $r$ approximation $\norm{B - \lowrank{B}_r}_{\rm F}$ for $r \approx k$ . For the Nyström approximation, the error $\norm{A - \hat{A}}_{\rm F}$ is bounded by a more complicated expression that also involves the nuclear norm of the best rank- $r$ approximation.

Generalized Nyström Approximation: The Best of Both Worlds

It is natural to ask: Is there one algorithm that achieves the positive attributes of both the randomized Nyström and randomized SVD algorithms? Is there a single-pass low-rank approximation algorithm that can be applying to any rectangular matrix and achieves Frobenius-norm error bounds? The answer is yes, and we will derive such an approximation now.

As with the randomized SVD and randomized Nyström approximation, we first compute the product $Y = B\Omega$ of $B$ with a random matrix $\Omega$ . We may then search for the best approximation $\hat{B}$ to $B$ spanned by the columns of $Y$ . Such an approximation takes the form $\hat{B} = YW$ , and we may find the best $W$ by solving a least-squares problem

(3) $W = \argmin_W \norm{B - YW}_{\rm F}.$

Symbolically, the solution to this problem may be written as $W = Y^\dagger B$ , where ${}^\dagger$ is the Moore–Penrose pseudoinverse. The resulting low-rank approximation is $\hat{B} = Y(Y^\dagger B)$ . In fact, this approximation coincides with the approximation generated by the randomized SVD algorithm. As with the standard randomized SVD, this approximation takes two passes over $B$ to form, one to form $Y = B\Omega$ and a second to form $W = Y^\dagger B$ .

To obtain a one-pass algorithm, we need a faster way of computing an (approximate) solution to the least-squares problem (3). As we have seen before on this blog, sketching provides a natural approach to quickly and approximately solving a least-squares problem. Specifically, we draw another random test matrix $\Phi \in \real^{n\times p}$ and solve the “sketched” least squares problem

(4) $\hat{W} = \argmin_W \norm{\Phi^\top B - (\Phi^\top Y)W}_{\rm F}.$

For this approach to be effective, we need oversampling: the dimension $p$ of the sketching matrix $\Phi \in \real^{n\times p}$ such be larger than the rank $k$ , e.g.. $p = 2k$ . The solution to (4) is

$\hat{W} = (\Phi^\top Y)^\dagger (\Phi^\top B)$

and the low-rank approximation is

$\hat{B} = Y(\Phi^\top Y)^\dagger (\Phi^\top B) = (B\Omega) (\Phi^\top B\Omega)^\dagger (\Phi^\top B).$

This type of approximation is called a generalized Nyström approximation, and it can be computed in a single pass over $B$ . (Namely, one should acquire—in the same pass—the products $B\Omega$ and $\Phi^\top B$ .) The generalized Nyström approximation also satisfies our other desired properties, applying to general, rectangular matrix and, as we will see, achieving Frobenius-norm approximation error.

History

In the modern randomized linear algebra literature, the generalized Nyström approximation appears to have been concurrently discovered by Woolfe, Liberty, Rokhlin, & Tygert (2008) and Clarkson & Woodruff (2009). An algebraically equivalent but more numerically stable version of the generalized Nyström approximation was developed by Tropp, Yurtsever, Udell, & Cevher (2017). Nakatsukasa (2020) re-examined the low-rank approximation format, developed a different class of numerical stable implementations, and suggested the name generalized Nyström approximation. Alex Townsend and Per-Gunnar Martinsson trace the origins of this low-rank approximation format far earlier back to the works of Wedderburn (1934).

Implementation

This post is concerned with the generalized Nyström approximation as a type of low-rank approximation format. To use generalized Nyström approximation in practice, one must use an appropriate algorithm which computes the decomposition in a stable way.

Perhaps the simplest algorithm for computing a generalized Nyström approximation was studied by Nakatsukasa (2020). One begins by computing the matrices

$Y = B\Omega, \quad Z = B^\top \Omega, \quad C = \Phi^\top Y.$

Then, to represent the generalized Nyström approximation $\hat{B}$ as a factored matrix, one takes a QR decomposition $C = QR$ and defines $F = YR^{-1}$ and $G = ZQ$ . The generalized Nyström approximation has been computed in factored form: $\hat{B} = FG^\top$ . In cases where the core matrix $C$ is rank-deficient up to machine precision, the numerical stability of this procedure can sometimes be aided by using a truncated SVD or column-pivoted QR decomposition of $C$ ; see Nakatsukasa’s paper for details. An alternate implementation which outputs $\hat{B}$ as a compact SVD was developed by Tropp, Yurtsever, Udell, and Cevher (2017).

Relationship to Other Formats

As the name suggests, the generalized Nyström approximation format generalizes the Nyström approximation beyond psd matrices. Indeed, the standard Nyström approximation

$\hat{A} = (A\Omega) (\Omega^\top A \Omega)^\dagger (\Omega^\top A)$

is precisely the generalized Nyström approximation of $A$ with a test matrix $\Omega = \Phi$ .

Perhaps less obviously, the generalized Nyström approximation also generalizes the randomized SVD approximation. Indeed, the randomized SVD approximation $\hat{B} = (B\Omega)(B\Omega)^\dagger B$ is the generalized Nyström approximation with trivial right test matrix $\Phi = I$ .

Generalized Nyström Approximation = Sketch-and-Solve + Randomized SVD

What is the generalized Nyström approximation? There are several interpretations. For instance, if $p = k$ and $\Phi^\top B \Omega$ is invertible, the generalized Nyström approximation is the unique approximation satisfying the interpolatory condition

$\hat{B}\Omega = B\Omega \quad \text{and} \quad \hat{B}^\top\Phi = B^\top\Phi.$

Notwithstanding the validity and usefulness of other interpretations, my view is that the most useful interpretation of generalized Nyström approximation is the one we started with:

Generalized Nyström approximation is a sketched version of the randomized SVD approximation.

To see this insight in action, we will use it to analyze the generalized Nyström approximation with Gaussian test matrices. Let $\Omega \in \real^{n\times k}$ and $\Phi \in \real^{n\times p}$ be populated with independent standard Gaussian random entries, and as we have been, assume $p\ge k$ . Let us analyze the expected (squared) Frobenius-norm error of the generalized Nyström approximation.

We will use the following result for sketching with a Gaussian embedding, due to Bartan & Pilanci (2020) and discussed in this previous blog post.

Theorem 1 (sketch-and-solve): Consider a (matrix) least-squares problem
$W = Y^\dagger B= \argmin_W \norm{B - YW}_{\rm F}$
with dimensions $Y \in \real^{n \times k}$ and $B \in \real^{n\times n}$ . Let $\Phi \in \real^{n\times p}$ be a (standard) Gaussian test matrix, and instate the sketch-and-solve solution
$\hat{W} = (\Phi^\top Y)^\dagger \Phi^\top B = \argmin_W \norm{\Phi^\top B - (\Phi^\top Y)W}_{\rm F}.$
Then
$\expect \norm{B - Y\hat{W}}_{\rm F}^2 = \expect \norm{B - Y(\Phi^\top Y)^\dagger\Phi^\top B}_{\rm F}^2 = \left( 1 + \frac{k}{p - (k+1)} \right)\norm{B - YY^\dagger B}_{\rm F}^2.$

We can apply this result to the generalized Nyström approximation by setting $Y = B\Omega$ . Let $\expect_{\Phi}$ denote the expectation over $\Phi$ alone. Then

$\expect_\Phi \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2 = \left( 1 + \frac{k}{p - (k + 1)} \right)\norm{B - (B\Omega)(B\Omega)^\dagger B}_{\rm F}^2.$

But $(B\Omega)(B\Omega)^\dagger B$ is just the randomized SVD approximation. Invoking the randomized SVD bound (1) yields

$\begin{align*}\expect \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2 &= \expect_\Omega\left[\expect_\Phi \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2\right] \\&= \left( 1 + \frac{k}{p - (k + 1)} \right)\expect_\Omega \norm{B - (B\Omega)(B\Omega)^\dagger B}_{\rm F}^2 \\&\le \left( 1 + \frac{k}{p - (k + 1)} \right)\left[\min_{r < k-1}\left( 1 + \frac{r}{k - (r + 1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2\right].\end{align*}$

Voilà! We have obtained explicit bounds for the generalized Nyström method with little effort. We record this result as a theorem:

Theorem 2 (generalized Nyström approximation): With the present setting, it holds that
$\expect \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2 \le \left( 1 + \frac{k}{p - (k + 1)} \right)\left[\min_{r < k-1}\left( 1 + \frac{r}{k - (r + 1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2\right].$

This result is Theorem 4.3 in this paper of Tropp, Yurtsever, Udell, & Cevher (2017). A slight refinement of this bound appears in my paper with Robert Webber, and we show that our new bound is sharp on hard examples. Thus, Theorem 2 is nearly the best possible error bound for the generalized Nyström approximation.

Choice of Random Matrix

For most of this post, we have focused on the cases where the random test matrices $\Omega$ and $\Phi$ are unstructured matrices with Gaussian random entries. But can we use more structured random test matrices? Say, sparse test matrices? Do these lead to faster low-rank approximation algorithms?

For the randomized SVD, the results are disappointing. Computing $Y = B\Omega$ with a sparse random matrix is fast. But then we compute $Q = \operatorname{orth}(Y)$ and compute $C = Q^\top B$ ; the matrix $Q$ is dense and unstructured, so computing $C = Q^\top B$ is slower and all benefits of the sparse test matrix have been erased.

The issue with the randomized SVD is that it’s a two-pass algorithm: The first pass, computing $B\Omega$ , can be done using a sparse random test matrix. But the second pass $Q^\top B$ requires a matrix product with a dense matrix $Q$ .

The situation is much better for the generalized Nyström approximation, which requires only a single pass and can be implemented only by multiplying $B$ against sparse matrices. Indeed, generating $\Omega$ and $\Phi$ to be sparse matrices, the generalized Nyström approximation can be written

$\hat{B} = Y(\Phi^* Y)^\dagger Z \quad \text{for } Y = B\Omega \text{ and } Z = \Phi^\top B.$

We see that the only interaction we need with the matrix $B$ has been isolated into matrix products with the random test matrices, and we obtain speedups by replacing using sparse random test matrices for $\Omega$ and $\Phi$ .

Structured test matrices, like sparse ones, can be very powerful. But basic theoretical questions remain about their properties. We tackle these theoretical questions in my new paper (joint with Chris Camaño, Raphael Meyer, and Joel Tropp), and we provide experiments demonstrating how structured sketching matrices can lead to large speedups in generalized Nyström approximation and other linear algebra tasks. I think it’s a really neat paper, and my wonderful collaborator Chris did some really beautiful experiments for it. I hope you’ll check it out!

Note to Self: Trace Estimation with Tensor Products

March 3, 2026 by Ethan N. Epperly Leave a comment

Let $x_1,\ldots,x_\ell$ be random vectors and let $x = x_1 \otimes \cdots \otimes x_\ell$ denote their tensor product. Assume the vectors $x_i$ are isotropic, in the sense that

$\expect[x_i^{\vphantom{\top}} x_i^\top] = I.$

The vector $x$ inherits the isotropy property as well $\expect[xx^\top] = I$ . As a consequence, we can use the vector $x$ to form an unbiased estimator for the matrix trace $\expect[x^\top Ax] = \tr(A)$ . Trace estimation has been a frequent topic on this blog.

What is the variance of the trace estimate $x^\top Ax$ ? This question was addressed by Raphael Meyer and Haim Avron. The variance of the trace estimator $x^\top A x$ is

$\Var(x^\top A x) = \expect[(x^\top A x)^2] - \expect[x^\top A x]^2 = \expect[(x^\top A x)^2] - \tr(A)^2.$

As such, bounding the variance is equivalent to bounding the second moment $\expect[(x^\top A x)^2]$ . Suppose that the individual base vectors $x_i$ satisfy a moment bound

(1) $\expect[(x_i^\top A x_i)^2] \le \alpha \tr(A)^2 \quad \text{for every psd matrix } A.$

For instance, Gaussian, random sign, and uniformly random vectors on the sphere all satisfy this bound with $\alpha \le 3$ . Under assumption (1), Meyer and Avron show that the tensor-product trace estimator $x^\top Ax$ satisfies the bound

(2) $\expect[(x^\top A x)^2] \le \alpha^\ell \tr(A)^2 \quad \text{for any psd matrix } A.$

Here, a matrix $A$ is positive semidefinite (psd) if it is symmetric and satisfies $v^\top A v \ge 0$ for any vector $v$ . Observe, the constant in the bound (2) is exponentially larger than the constant in bound (1). Unfortunately, as Meyer and Avron show, this exponentially large variance is a real property of the tensor-structured trace estimator, at least on worst-case examples.

Meyer and Avron’s paper is really nice, and it contains many different results for Kronecker-structured trace estimation beyond the bound (2). I highly recommend checking their paper out, which was just appeared in the SIAM Journal of Matrix Analysis and Applications! In this blog, I’ll give an alternate, somewhat shorter proof of the Meyer–Avron bound (2).

Suppose that $x = y \otimes z$ is a tensor product of isotropic vectors $y\in \real^{d_1}$ and $z \in \real^{n_2}$ , and suppose that

(3) $\expect[(y^\top A_1 y)^2] \le \alpha_1 \tr(A_1)^2 \quad \text{and} \quad \expect[(z^\top A_2 z)^2] \le \alpha_2 \tr(A_2)^2$

for any psd matrices $A_1$ and $A_2$ . Now, let $A$ be an $(d_1n_2)\times(d_1n_2)$ psd matrix, and partition $A$ as

(4) $A = \begin{bmatrix} A_{11} & \cdots & A_{1d_1} \\ \vdots & \ddots & \vdots \\ A_{d_1 1} & \cdots & A_{d_1 d_1} \end{bmatrix}.$

Then

(1) $\begin{align*}x^\top A x &= \begin{bmatrix} y_1 z \\ \vdots \\ y_{d_1}z\end{bmatrix}^\top\begin{bmatrix} A_{11} & \cdots & A_{1d_1} \\ \vdots & \ddots & \vdots \\ A_{d_1 1} & \cdots & A_{d_1 d_1} \end{bmatrix}\begin{bmatrix} y_1 z \\ \vdots \\ y_{d_1}z\end{bmatrix} \\ &= \underbrace{\begin{bmatrix} y_1 \\ \vdots \\ y_{d_1}\end{bmatrix}^\top}_{y^\top}\begin{bmatrix} z^\top A_{11}z & \cdots & z^\top A_{1d_1}z \\ \vdots & \ddots & \vdots \\ z^\top A_{d_1 1}z & \cdots & z^\top A_{d_1 d_1}z \end{bmatrix}\underbrace{\begin{bmatrix} y_1 \\ \vdots \\ y_{d_1}\end{bmatrix}}_y.\end{align*}$

Taking the expectation over the random vector $y$ alone and applying (3), we obtain

(2) $\begin{align*}\expect_y [(x^\top A x)^2] &\le \alpha_1 \left( \tr \begin{bmatrix} z^\top A_{11}z & \cdots & z^\top A_{1d_1}z \\ \vdots & \ddots & \vdots \\ z^\top A_{d_1 1}z & \cdots & z^\top A_{d_1 d_1}z \end{bmatrix} \right)^2 \\ &= \alpha_1 [z^\top(A_{11}+\cdots+ A_{d_1d_1})z]^2.\end{align*}$

Taking the expectation over $z$ and invoking the law of total expectation then yields

(3) $\begin{align*}\expect [(x^\top A x)^2] &\le \alpha_1 \alpha_2 \tr(A_{11}+\cdots+ A_{d_1d_1})^2=\alpha_1 \alpha_2\tr(A)^2.\end{align*}$

In the last line, we observe that the trace of $A$ is the sum of the traces of its diagonal blocks $A_{ii}$ . Voilà! We have deduced the bound

$\expect [(x^\top A x)^2] \le \alpha_1\alpha_2 \tr(A)^2,$

which immediately yields the Meyer–Avron bound (2) by iteration.

The Other Markov’s Inequality

January 16, 2026 by Ethan N. Epperly Leave a comment

If a polynomial function is trapped in a box, how much can it wiggle? This question is answered by Markov’s inequality, which states that for a degree- $n$ polynomial $p$ that maps $[-1,1]$ into $[-1,1]$ , it holds that

(1) $\max_{x \in [-1,1]} |p'(x)| \le n^2.$

That is, if a polynomial $p$ is trapped within a square box $[-1,1] \times [-1,1]$ , the fastest it can wiggle—as measured by its first derivative—is the square of its degree.

How tight is this inequality? Do polynomials we know and love come close to saturating it, or is this bound very loose for them? A first polynomial which is natural to investigate is the power $p(x) = x^n$ . This function maps $[-1,1]$ into $[-1,1]$ , and the maximum value of its derivative is

$\max_{x \in [-1,1]} |p'(x)| = \max_{x \in [-1,1]} |nx^{n-1}| = n\ll n^2.$

Markov’s inequality is quite loose for the power function.

To saturate the inequality, we need something wigglier. The heavyweight champions for polynomial wiggliness are the Chebyshev polynomials $T_n$ , which are motivated and described at length in this previous post. For our purposes, what’s important is that Chebyshev polynomials really know how to wiggle. Just look at how much more rapidly the degree-7 Chebyshev polynomial (blue solid line) moves around than the degree-7 power $x^n$ (orange dashed line).

In addition to seeming a lot more wiggly to the eye, the degree- $n$ Chebyshev polynomials have much larger derivatives, saturating Markov’s inequality:

$\max_{x \in [-1,1]} |T_n'(x)| = n^2.$

These two examples illustrate a good rule of thumb. The derivative of a polynomial which is “simple” or “power-like” will be of size about $n$ , whereas the derivative of a “clever” or “Chebyshev-like” polynomial will be much higher at about $n^2$ .

The inequality (1) was proven by Andrey Markov. It is much less well-known than Andrey Markov’s famous inequality in probability theory. A generalization of Markov’s inequality (1) was proven by Andrey’s brother Vladimir Markov, who proved a bound on the $k$ th derivative of any polynomial $p$ mapping $[-1,1]$ into $[-1,1]$ :

(2) $\max_{x \in [-1,1]} |p^{(k)}(x)| \le \max_{x \in [-1,1]} |T_d^{(k)}| = \frac{n^2(n^2-1^2)\cdots(n^2-(k-1)^2)}{1\times 3 \times \cdots \times (2k-1)}.$

The inequality (1) is a special case of (2) with $k=1$ . The inequality (2) is often called the Markov brothers’ inequality, and this name is sometimes also attached to the special case (1)—to help distinguish it from the probabilistic Markov inequality.

For the rest of this post, we will focus on the basic Markov inequality (1). This inequality is easily extended to polynomials trapped in a box $[a,b] \times [c,d]$ of general sidelengths. Indeed, any polynomial $p$ mapping $[a,b]\to[c,d]$ can be transmuted to a polynomial $\tilde{p}$ mapping $[-1,1]$ to $[-1,1]$ by an affine change of variables:

$\tilde{p}(x) = -1 + 2\cdot \frac{p(\ell(x)) - c}{d-c} \quad \text{for } \ell(x) = a + (b-a) \cdot \frac{x+1}{2}.$

The precise form of this change of variables is not important. What’s important is that $\tilde p$ maps $[-1,1]$ to $[-1,1]$ and, by the chain rule, the maximum value of its derivative is

$\max_{t \in [a,b]} |p'(t)| = \frac{d-c}{b-a} \cdot \max_{x \in [-1,1]} |\tilde p'(x)|.$

Therefore, we obtain a general version of Markov’s inequality (1).

Markov’s inequality (general domain/codomain). Let $p$ be a degree- $n$ polynomial that maps $[a,b]$ to $[c,d]$ . Then
(3) $\max_{t \in [a,b]} |p'(t)| \le \frac{d-c}{b-a} \cdot n^2.$

What’s Markov’s inequality good for? A lot, actually. In this post, we’ll see one application of this inequality: proving polynomial inapproximability. There are lots of times in computational math where its valuable to approximate some function like $\mathrm{e}^{-t}$ , $\sqrt{t}$ , or $t^{-1}$ by a polynomial. There are lots of techniques for producing polynomial approximations and understanding the rate of convergence of the best-possible polynomial approximation. But sometimes we just want a quick estimate of the form, say, “a polynomial needs to be of degree at least $n$ to approximate that function up to additive error $0.1$ “. That is, we seek lower bounds on the polynomial needed to approximate a given function to a specified level of accuracy.

The general Markov’s inequality (3) provides a direct way of doing this. Our treatment follows Chapter 5 of Faster Algorithms via Approximation Theory by Sushant Sachdeva and Nisheesh K Vishnoi. The argument will consist of two steps. First, we trap the function in a box. Then, we show it wiggles a lot (i.e., there is a point at which the derivative is large). Therefore, by Markov’s inequality, we conclude that the degree of the polynomial must be sufficiently large.

Let’s start with the function $\mathrm{e}^{-t}$ and ask the question:

What polynomial degree $n$ do we need to approximate this function to error 0.1 on the interval $[0,b]$ ?

We are interested in the case where the interval is large, so we assume $b\ge 2$ . To address this question, suppose that $p$ is a polynomial that satisfies

$|p(t) - \e^{-t}| \le 0.1 \quad \text{for all }t \in [0,b].$

We will use this information to trap $p$ in a box and show it wiggles.

Trap it in a box. Since $0 \le \e^{-t}\le 1$ for positive $t$ , it therefore must hold that $-0.1 \le p(t) \le 1.1$ . Therefore, the polynomial $p$ is trapped in the box $[0,b] \times [-0.1,1.1]$ .
Show it wiggles. The function $\e^{-t}$ decreases quite rapidly. At zero, it is $\e^{-0} = 1$ and at two, it is $\e^{-2} = 0.135\ldots < 0.14$ . Therefore, since $p(t)$ is within $0.1$ of $\e^{-t}$ for all $t$ , it must hold that $p(0)$ is at least $0.9$ and $p(2)$ is at most $0.24$ . Therefore, by the intermediate value theorem, there is some $t^*$ between $0$ and $2$ for which
$p'(t^*) = \frac{p(2) - p(0)}{2-0} \ge \frac{0.9 - 0.24}{2}=0.33.$

We are ready for the conclusion. Apply the bullet point above and Markov’s inequality (3) to obtain

$0.33 \le p'(t^*) \le \max_{t \in [0,b]} |p'(t)| \le \frac{1.1-(-0.1)}{b-0} \cdot n^2.$

Rearrange to obtain

$n \ge \sqrt{\frac{0.33}{1.2}} \cdot \sqrt{b} > 0.5 \sqrt{b}.$

We conclude that we need a polynomial of degree at least $0.5\sqrt{b}$ to approximate $\e^{-t}$ on $[0,b]$ . This rough estimate actually turns out to be pretty sharp. By using an appropriate polynomial approximation technique (say, interpolation at the Chebyshev points), a polynomial of degree $\mathcal{O}(\sqrt{b})$ also suffices to approximate $\e^{-t}$ to this level of accuracy on $[0,b]$ .

We’ve illustrated with just a single example, but this same technique can also be used to give (sometimes sharp, sometimes not) inapproximability results for other functions we know and love like $1/x$ and $\sqrt{x}$ . For a bit of fun, see if you can get results for approximating a power $x^s$ on $[-1,1]$ by a polynomial $n$ of degree $n \ll s$ . You may be surprised by what you find!

I find the Markov inequality technique for proving polynomial inapproximability to be pretty cool. As we saw in a previous post, we usually understand the difficulty of approximating a function in terms of the rate of convergence of the best (polynomial) approximation, which is tied to fine properties of the function and its smoothness. The Markov inequality approach answers a different question and uses entirely different information about the function. Rather than asking about the asymptotic rate of convergence, the Markov inequality approach tells you at what degree does a polynomial start approximating the function at all. And rather than using information about smoothness, the Markov approach shows that polynomial inapproximability is hard for any function that changes a lot over a small interval. As an exercise, you can show that the same argument for inapproximability of $\e^{-t}$ also shows a polynomial of degree $\Omega(\sqrt{b})$ is necessary for approximating the ramp function

$f_{\rm ramp}(t) = \begin{cases} 1-t/2, & 0 \le t \le 2, \\ 0, & 2 < t \le b.\end{cases}.$

From the perspective of rate of convergence, these two functions could not be more different from one another, as polynomial approximations to $\e^{-t}$ converge at an exponential rate, whereas polynomial approximations to $f_{\rm ramp}$ converge at the rate $\order(1/n)$ . But in terms of the polynomial degree to start approximating these functions, both functions require the same degree $\Theta(\sqrt{b})$ . Pretty neat, I think.

Reference. This blog post is my take on an argument presented in Chapter 5 of Faster Algorithms via Approximation Theory by Sushant Sachdeva and Nisheesh K Vishnoi. It’s a very nice monograph, and I highly recommend you check it out!

Vandermonde Matrices are Merely Exponentially Ill-Conditioned

August 13, 2025 by Ethan N. Epperly Leave a comment

I am excited to share that my paper Does block size matter in randomized block Krylov low-rank approximation? has recently been released on arXiv. In that paper, we study the randomized block Krylov iteration (RBKI) algorithm for low-rank approximation. Existing results show that RBKI is efficient at producing rank- $k$ approximations with a large block size of $k$ or a small block size of $1$ , but these results give poor results for intermediate block sizes $1\ll b \ll k$ . But often these intermediate block sizes are the most efficient in practice. In our paper, we close this theoretical gap, showing RBKI is efficient for any block size $1\le b \le k$ . Check out the paper for details!

In our paper, the core technical challenge is understanding the condition number of a random block Krylov matrix of the form

$K = \begin{bmatrix} G & AG & \cdots & A^{t-1}G \end{bmatrix},$

where $A$ is an $n\times n$ real symmetric positive semidefinite matrix and $G$ is an $n\times (n/b)$ random Gaussian matrix. Proving this result was not easy, and our proof required several ingredients. In this post, I want to talk about just one of them: Gautschi’s bound on the conditioning of Vandermonde matrices. (Check out Gautschi’s original paper here.)

Evaluating a Polynomial and Vandermonde Matrices

Let us begin with one the humblest but most important characters in mathematics, the univariate polynomial

$p_a(u) = a_1 + a_2 u+ \cdots + a_t u^{t-1}.$

Here, we have set the degree to be $t-1$ and have written the polynomial $p_a(\cdot)$ parametrically in terms of its vector of coefficients $a = (a_1,\ldots,a_t)$ . The polynomials of degree at most $t-1$ form a linear space of dimension $t$ , and the monomials $1,u,\ldots,u^{t-1}$ provide a basis for this space. In this post, we will permit the coefficients $a \in \complex^t$ to be complex numbers.

Given a polynomial, we often wish to evaluate it at a set of inputs. Specifically, let $\lambda_1,\ldots,\lambda_s \in \complex$ be $s$ (distinct) input locations. If we evaluate $p_a$ at each number, we obtain a list of (output) values, which we denote by $f = (p_a(\lambda_1),\ldots,p_a(\lambda_s))$ of $s$ (distinct) values, each of which given by the formula

$p_a(\lambda_i) = \sum_{j=1}^t \lambda_i^{j-1} a_j.$

Observe that the outputs $f$ are a nonlinear function of the input values $\lambda$ but a linear function of the coefficients $a$ . We may call the mapping $a \mapsto f$ the coefficients-to-values map.

Every linear transformation between vectors can be realized as a matrix–vector product, and the matrix for the coefficients-to-values map is called a Vandermonde matrix $V$ . It is given by the formula

$V= \begin{bmatrix} 1 & \lambda_1 & \lambda_1^2 & \cdots & \lambda_1^{t-1} \\ 1 & \lambda_2 & \lambda_2^2 & \cdots & \lambda_2^{t-1} \\ 1 & \lambda_3 & \lambda_3^2 & \cdots & \lambda_3^{t-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \lambda_s & \lambda_s^2 & \cdots & \lambda_s^{t-1} \end{bmatrix} \in \complex^{s\times t}.$

The Vandermonde matrix defines the coefficients-to-values map in the sense that $f = Va$ .

Interpolating by a Polynomial and Inverting a Vandermonde Matrix

Going forward, let us set $s = t$ so the number of locations $\lambda = (\lambda_1,\ldots,\lambda_t)$ equals the number of coefficients $a = (a_1,\ldots,a_t)$ . The Vandermonde matrix maps the vector of coefficients $a$ to the vector of values $f = (f_1,\ldots,f_t) = (p_a(\lambda_1),\ldots,p_a(\lambda_t))$ . Its inverse $V^{-1}$ maps a set of values $f$ to a set of coefficients $a$ defining a polynomial $p_a$ which interpolates the values $f$ :

$p_a(\lambda_i) = f_i \quad \text{for } i =1,\ldots,t .$

More concisely, multiplying the inverse of a Vandermonde matrix solves the problem of polynomial interpolation.

To solve the polynomial interpolation problem with Vandermonde matrices, we can do the following. Given values $f$ , we first solve the linear system of equations $Va = f$ , obtaining a vector of coefficients $a = V^{-1}f$ . Then, define the interpolating polynomial

(1) $q(u) = a_1 + a_2 u + \cdots + a_t u^{t-1} \quad \text{with } a = V^{-1} f.$

The polynomial $q$ interpolates the values $f$ at the locations $\lambda_i$ , $q(\lambda_i) = p_a(\lambda_i) = f_i$ .

Lagrange Interpolation

Inverting a the Vandermonde matrix is one way to solve the polynomial interpolation problem, but the polynomial interpolation can also be solved directly. To do so, first notice that we can construct a special polynomial $(u - \lambda_2)(u - \lambda_3)\cdots(u-\lambda_t)$ that is zero at the locations $\lambda_2,\ldots,\lambda_t$ but nonzero at the first location $\lambda_1$ . (Remember that we have assumed that $\lambda_1,\ldots,\lambda_t$ are distinct.) Further, by rescaling this polynomial to

$\ell_1(u) = \frac{(u - \lambda_2)(u - \lambda_3)\cdots(u-\lambda_t)}{(\lambda_1 - \lambda_2)(\lambda_1 - \lambda_3)\cdots(\lambda_1-\lambda_t)},$

we obtain a polynomial whose value at $\lambda_1$ can be set to $1$ . Likewise, for each $i$ , we can define a similar polynomial

(2) $\ell_i(u) = \frac{(u - \lambda_1)\cdots(u-\lambda_{i-1}) (u-\lambda_{i+1})\cdots(u-\lambda_t)}{(\lambda_i - \lambda_1)\cdots(\lambda_i-\lambda_{i-1}) (\lambda_i-\lambda_{i+1})\cdots(\lambda_i-\lambda_t)},$

which is $1$ at $\lambda_i$ and zero at $\lambda_j$ for $j\ne i$ . Using the Dirac delta symbol, we may write

$\ell_i(\lambda_j) = \delta_{ij} = \begin{cases} 1, & i = j, \\ 0, & i\ne j. \end{cases}$

The polynomials $\ell_i$ are called the Lagrange polynomials of the locations $\lambda_1,\ldots,\lambda_q$ . Below is an interactive illustration of the second Lagrange polynomial $\ell_2$ associated with the points $\lambda_i = i$ (with $t = 5$ ).

With the Lagrange polynomials in hand, the polynomial interpolation problem is easy. To obtain a polynomial whose values are $f$ , simply multiply each Lagrange polynomial $\ell_i$ by the value $f_i$ and sum up, obtaining

$q(u) = \sum_{i=1}^t f_i \ell_i(u).$

The polynomial $q$ interpolates the values $f$ . Indeed,

(3) $q(\lambda_j) = \sum_{i=1}^t f_i \ell_i(\lambda_j) = \sum_{i=1}^t f_i \delta_{ij} = f_j.$

The interpolating polynomial computed is shown in the interactive display below:

From Lagrange to Vandermonde via Elementary Symmetric Polynomials

We now have two ways of solving the polynomial interpolation problem, the Vandermonde way (1) and the Lagrange way (3). Ultimately, the difference between these formulas is one of basis: The Vandermonde formula (1) expresses the interpolating polynomial $q$ as a linear combination of monomials $1,u,\ldots,u^{t-1}$ and the Lagrange formula (3) expresses $q$ as a linear combination of the Lagrange polynomials $\ell_1,\ldots,\ell_n$ . To convert between these formulas, we just need to express the Lagrange polynomial basis in the monomial basis.

To do so, let us examine the Lagrange polynomials more closely. Consider first the case $t = 4$ , and consider the fourth unnormalized Lagrange polynomial

$(u - \lambda_1) (u - \lambda_2) (u - \lambda_3).$

Expanding this polynomial in the monomials $u^i$ , we obtain the expression

$(u - \lambda_1) (u - \lambda_2) (u - \lambda_3) = u^3 - (\lambda_1 + \lambda_2 + \lambda_3) u^2 + (\lambda_1\lambda_2 + \lambda_1\lambda_3 + \lambda_2\lambda_3)u - \lambda_1\lambda_2\lambda_3.$

Looking at the coefficients of this polynomial in $u$ , we recognize some pretty distinctive expressions involving the $\lambda_i$ ‘s:

$\begin{align*}\text{coefficient of $u^2$} &= - (\lambda_1 + \lambda_2 + \lambda_3), \\\text{coefficient of $u$} &= \lambda_1\lambda_2 + \lambda_1\lambda_3 + \lambda_2\lambda_3, \\\text{coefficient of $1$} &= -\lambda_1\lambda_2\lambda_3.\end{align*}$

Indeed, these expressions are special. Up to a plus-or-minus sign, they are called the elementary symmetric polynomials of the locations $\lambda_i$ . Specifically, given numbers $\mu_1,\ldots,\mu_s$ , the $k$ th elementary symmetric polynomial $e_k(\mu_1,\ldots,\mu_s)$ is defined as the sum of all products $\mu_{i_1} \mu_{i_2} \cdots \mu_{i_k}$ of $k$ values, i.e.,

$e_k(\mu_1,\ldots,\mu_{t-1}) = \sum_{i_1 < i_2 < \cdots < i_k} \mu_{i_1}\mu_{i_2}\cdots \mu_{i_k}.$

The zeroth elementary symmetric polynomial is $1$ by convention.

The elementary symmetric polynomials appear all the time in mathematics. In particular, they are the coefficients of the characteristic polynomial of a matrix and feature heavily in the theory of determinantal point processes. For our purposes, the key observation will be that the elementary symmetric polynomials appear whenever one expands out an expression like $(u + \mu_1) \cdots (u + \mu_k)$ :

Lemma 1 (Expanding a product of linear functions). It holds that
$(u + \mu_1) (u + \mu_2) \cdots (u+\mu_k) = \sum_{j=0}^k e_{k-j}(\mu_1,\ldots,\mu_k) u^j.$

Using this fact, we obtain an expression for the Lagrange polynomials in the monomial basis. Let $\lambda_{-i} = (\lambda_1,\ldots, \lambda_{i-1},\lambda_{i+1},\ldots,\lambda_t)$ denote the list of locations without $\lambda_i$ . Then the $i$ th Lagrange polynomial is given by

$\ell_i(u) = \frac{\sum_{j=1}^t e_{t-j}(-\lambda_{-i})u^{j-1}}{\prod_{k\ne i} (\lambda_i - \lambda_k)}.$

Not exactly a beautiful expression, but it will get the job done.

Indeed, we can write the interpolating polynomial as

$q(u) = \sum_{i=1}^t f_i\ell_i(u) = \sum_{i=1}^t \sum_{j=1}^t f_i \frac{e_{t-j}(-\lambda_{-i})}{\prod_{k\ne i} (\lambda_i - \lambda_k)} u^{j-1}.$

To make progress, we interchange the order of summation and regroup:

$q(u) = \sum_{j=1}^t \sum_{i=1}^t f_i \frac{e_{t-j}(-\lambda_{-i})}{\prod_{k\ne i} (\lambda_i - \lambda_k)} u^{j-1} = \sum_{j=1}^t \left(\sum_{i=1}^t \frac{e_{t-j}(-\lambda_{-i})}{\prod_{k\ne i} (\lambda_i - \lambda_k)}\cdot f_i \right) u^{j-1}.$

We see that the coefficients of the interpolating polynomial (in the monomial basis) are

$a_j = \sum_{i=1}^t \frac{e_{t-j}(-\lambda_{-i})}{\prod_{k\ne i} (\lambda_i - \lambda_k)}\cdot f_i.$

But we also know that the coefficients are given by $a = V^{-1}f$ . Therefore, we conclude that the entries of the inverse-Vandermonde matrix $V^{-1}$ are

(4) $(V^{-1})_{ji} = \frac{e_{t-j}(-\lambda_{-i})}{\prod_{k\ne i} (\lambda_i - \lambda_k)}.$

Vandermonde Matrices are Merely Exponentially Ill-Conditioned

Vandermonde matrices are notoriously ill-conditioned, meaning that small changes to the values $f$ can cause large changes to the coefficients $a$ . On its face, this might seem like the problem of polynomial interpolation itself is ill-conditioned, but this is too hasty a conclusion. After all, it is only the mapping from values $f$ to coefficients $a$ in the monomial basis that is ill-conditioned. Fortunately, there are much better, more numerically stable bases for representing a polynomial like the Chebyshev polynomials.

But these more stable methods of polynomial interpolation and approximation are not the subject of this post: Here, our task is to will be to characterize just how ill-conditioned the computation of $a = V^{-1}f$ is. To characterize this ill-conditioning, we will utilize the condition number of the matrix $V$ . Given a norm $\uinorm{\cdot}$ , the condition number of $V$ is defined to be

$\kappa_{\uinorm{\cdot}}(V) = \uinorm{V} \uinorm{\smash{V^{-1}}}.$

For this post, we will focus on the case where $\uinorm\cdot}$ is chosen to be the (operator) $1$ –norm, defined as

$\norm{A}_1= \max_{x \ne 0} \frac{\norm{Ax}_1}{\norm{x}_1} \quad \text{where } \norm{x}_1 = \sum_i |x_i|.$

The $1$ -norm has a simple characterization: It is the maximum sum of the absolute values of the entries in any column

(5) $\norm{A}_\infty = \max_j \sum_i |A_{ij}|.$

We shall denote the $1$ -norm condition number $\kappa_1(\cdot)$ .

Bounding $\norm{V}_1$ is straightforward. Indeed, setting $M = \max_{1\le i \le t} |\lambda_i|$ and using (5), we compute

$\norm{V}_1= \max_{1\le j \le t} \sum_{i=1}^t |\lambda_i|^{j-1} \le \max_{1\le j \le t} tM^{j-1} = t\max\{1,M^{t-1}\}.$

An even weaker, but still useful, bound is

$\norm{V}_1\le t(1+M)^{t-1}.$

The harder task is bounding $\norm{\smash{V^{-1}}}_1$ . Fortunately, we have already done most of the hard work needed to bound this quantity. Using our expression (4) for the entries of $V^{-1}$ and using the formula (5) for the $1$ -norm, we have

$\norm{\smash{V^{-1}}}_1= \max_{1\le j \le t} \sum_{i=1}^t |(V^{-1})_{ij}| = \max_{1\le j \le t}\frac{ \sum_{i=1}^t |e_{t-i}(-\lambda_{-j})|}{\prod_{k\ne j} |\lambda_j- \lambda_k|}.$

To bound this expression, we make use of the following “triangle inequality” for elementary symmetric polynomials

$|e_k(\mu_1,\ldots,\mu_s)| \le e_k(|\mu_1|,\ldots,|\mu_s|).$

Using this bound and defining $|\lambda_{-j}| = (|\lambda_1|,\ldots,|\lambda_{j-1}|,|\lambda_{j+1}|,\ldots,|\lambda_t|)$ , we obtain

$\norm{\smash{V^{-1}}}_1\le \max_{1\le j \le t}\frac{ \sum_{i=1}^t e_{t-i}(|\lambda_{-j}|)}{\prod_{k\ne j} |\lambda_j- \lambda_k|}.$

We now use the Lemma 1 with $u = 1$ to obtain the expression

(6) $\norm{\smash{V^{-1}}}_1\le \max_{1\le j \le t}\frac{ \prod_{k\ne j} (1 + |\lambda_k|)}{\prod_{k\ne j} |\lambda_j- \lambda_k|}.$

Equation (6) is Gautschi’s bound on the norm of the inverse of a Vandermonde matrix, the result we were working towards proving.

Often, it is helpful to simply Gautschi’s bound a bit. Setting $M = \max_{1\le i \le t} |\lambda_i|$ as above, the numerator is bounded as $(1+M)^{t-1}$ . To bound the denominator, let $\mathrm{gap} \coloneqq \min_{k \le j} |\lambda_j - \lambda_k|$ be the smallest distance between two locations. Using $M$ and $\mathrm{gap}$ , we can weaken the bound (6) to obtain

$\norm{\smash{V^{-1}}}_1\le \left(\frac{1+M}{\mathrm{gap}}\right)^{t-1}.$

Combining this with our bound on $\norm{V}_\infty$ from above, we obtain a bound on the condition number

$\kappa_1(V) \le t \left( \frac{(1+M)^2}{\mathrm{gap}} \right)^{t-1}.$

We record these results:

Theorem 2 (Gautschi’s bound, simplified). Introduce $M = \max_{1\le i \le t} |\lambda_i|$ and $\mathrm{gap} \coloneqq \min_{k < j} |\lambda_j - \lambda_k|$ . Then
$\norm{\smash{V^{-1}}}_1\le \left(\frac{1+M}{\mathrm{gap}}\right)^{t-1}$
and
$\kappa_1(V) \le t \left( \frac{(1+M)^2}{\mathrm{gap}} \right)^{t-1}.$

Gautschi’s bound suggests that Vandermonde matrices can be very ill-conditioned, which is disappointing. But Gautschi’s bound also shows that Vandermonde matrices are merely exponentially ill-conditioned—that is, they are not worse than exponentially conditioned. The fact that Vandermonde matrices are only exponentially ill-conditioned plays a crucial role in our analysis of randomized block Krylov iteration.

Gautschi’s Bound as a Robust Version of the Fundamental Theorem of Algebra

The fundamental theorem of algebra states that a degree- $(t-1)$ polynomial has precisely $t-1$ roots. Consequently, at $t$ locations, it must be nonzero at least one. But how nonzero must the polynomial be at that one location? How large must it be? On this subject, the fundamental theorem of algebra is moot. However, Gautschi’s bound provides an answer.

To answer this question, we ask: What is the minimum possible size $\norm{f}_1$ of the values $f = Va = (p_a(\lambda_1),\ldots,p_a(\lambda_t))$ ? Well, if we set all the coefficients $a = (0,\ldots,0)$ to zero, then $f = 0$ as well. So to avoid this trivial case, we should enforce a normalization condition on the coefficient vector $a$ , say $\norm{a}_1 = 1$ . With this setting, we are ready to compute. Begin by observing that

$\min_{\norm{a}_1 = 1} \norm{Va}_1 = \min_{a\ne 0} \frac{\norm{Va}_1}{\norm{a}_1}.$

Now, we make the change of variables $f = V^{-1}a$ , obtaining

$\min_{\norm{a}_1 = 1} \norm{Va}_1 = \min_{a\ne 0} \frac{\norm{Va}_1}{\norm{a}_1} = \min_{f\ne 0} \frac{\norm{f}_1}{\norm{\smash{V^{-1}f}}_1}.$

Now, take the inverse of both sides to obtain

$\left(\min_{\norm{a}_1 = 1} \norm{Va}_1\right)^{-1} = \max_{f\ne 0} \frac{\norm{\smash{V^{-1}f}}_1}{\norm{f}_1} = \norm{\smash{V^{-1}}}_1.$

Ergo, we conclude

$\min_{\norm{a}_1 = 1} \norm{Va}_1 = \norm{\smash{V^{-1}}}_1^{-1}.$

Indeed, this relation holds for all operator norms:

Proposition 3 (Minimum stretch). For vector norm $\uinorm{\cdot}$ and its induced operator norm $\uinorm{\cdot}$ , it holds that for any square invertible matrix $A$ that
$\min_{\uinorm{v} = 1} \uinorm{Av} = \min_{v\ne 0} \frac{\uinorm{Av}}{\uinorm{v}} = \uinorm{\smash{A^{-1}}}^{-1}.$

Using this result, we obtain the lower bound $\norm{f}_1\ge \norm{a}_1/\norm{\smash{V^{-1}}}_1$ on the values $f$ of a polynomial with coefficients $a$ . Combining with Gautschi’s bound gives the following robust version of the fundamental theorem of algebra:

Theorem 4 (Robust fundamental theorem of algebra). Fix a polynomial $p(u) = a_1 + a_2 t + \cdots + a_t u^{t-1}$ and locations $\lambda_1,\ldots,\lambda_t$ . Define $M = \max_{1\le i \le t} |\lambda_i|$ and $\mathrm{gap} \coloneqq \min_{k \le j} |\lambda_j - \lambda_k|$ . Then
$|p(\lambda_1)| + \cdots + |p(\lambda_t)| \ge \left(\frac{\mathrm{gap}}{1+M}\right)^{t-1} (|a_1| + \cdots + |a_t|).$

Thus, at $t$ locations, a degree- $(t-1)$ polynomial must be nonzero at least one point. In fact, the sum of the values at these $t$ locations must be no worse than exponentially small in $t$ .

Gaussian Integration by Parts

August 4, 2025 by Ethan N. Epperly 2 Comments

Gaussian random variables are wonderful, and there are lots of clever tricks for doing computations with them. One particularly nice tool is the Gaussian integration by parts formula, which I learned from my PhD advisor Joel Tropp. Here it is:

Gaussian integration by parts. Let $z$ be a standard Gaussian random variable. Then $\expect[zf(z)] = \expect[f'(z)]$ .

This formula makes many basic computations effortless. For instance, to compute the second moment $\expect[z^2]$ of a standard Gaussian random variable $z$ , we apply the formula with $f(x) = x$ to obtain

$\expect[z^2] = \expect[z f(z)] = \expect[f'(z)] = \expect[1] = 1.$

Therefore, the second moment is one. Since $z$ has mean zero, this also means that the variance of a standard Gaussian random variable is one.

The fourth moment is no harder to compute. Using $f(x) = x^3$ , we compute

$\expect[z^4] = \expect[zf(z)] = \expect[f'(z)] = 3\expect[z^2] = 3.$

Easy peesy. We’ve shown the fourth moment of a standard Gaussian variable is three—no fancy integral tricks required.

Iterating this trick, we can compute all the even moments of a standard Gaussian random variable. Indeed,

$\expect[z^{2p}] = \expect[z\cdot z^{2p-1}] = (2p-1) \expect[z^{2p-2}] = (2p-1)(2p-3) \expect[z^{2p-4}] = \cdots = (2p-1)(2p-3)\cdots 1.$

We conclude that the $(2p)$ th moment of a standard Gaussian random variable is $(2p-1)!!$ , where $!!$ indicates the (in)famous double factorial $(2p-1)!! = (2p-1)\times(2p-3)\times\cdots \times 3 \times 1$ .

As a spicier application, let us now compute $\expect[|z|]$ . To do so, we choose $f(x) = \operatorname{sign}(x)$ to be the sign function:

$\operatorname{sign}(x) = \begin{cases} 1, & x > 0, \\ 0 & x = 0, \\ -1 & x < 0.\end{cases}$

This function is not differentiable in a “Calculus 1” sense because it is discontinuous at zero. However, it is differentiable in a “distributional sense” and its derivative is $f'(x) = 2\delta(x)$ , where $\delta$ denotes the famous Dirac delta “function”. We then compute

$\expect[|z|] = \expect[zf(z)] = \expect[f'(z)] = 2\expect[\delta(z)].$

To compute $\expect[\delta(z)]$ , write the integral out using the probability density function $\phi$ of the standard Gaussian distribution:

$\expect[|z|] = 2\expect[\delta(z)] = 2\int_{-\infty}^\infty \phi(x) \delta(x) \, \mathrm{d} x = 2\phi(0).$

The standard Gaussian distribution has density

$\phi(x) = \frac{1}{\sqrt{2\pi}} \exp \left( - \frac{x^2}{2} \right),$

so we conclude

$\expect[|z|] = 2 \cdot \frac{1}{\sqrt{2 \pi}} = \sqrt{\frac{2}{\pi}}}.$

A computation involving integrating against the Gaussian density has again been made trivial by using the Gaussian integration by parts formula.

Application: Power Method from a Random Start

As an application of the Gaussian integration by parts formula, we can analyze the famous power method for eigenvalue computations with a (Gaussian) random initialization. This discussion is adapted from the tutorial of Kireeva and Tropp (2024).

Setup

Before we can get to the cool application of the Gaussian integration by parts formula, we need to setup the problem and do a bit of algebra. Let $A$ be a matrix, which we’ll assume for simplicity to be symmetric and positive semidefinite. Let $\lambda_1 > \lambda_2 \ge \lambda_3 \ge \cdots \ge \lambda_n \ge 0$ denote the eigenvalues of $A$ . We assume the largest eigenvalue $\lambda_1$ is strictly larger than the next eigenvalue $\lambda_2$ .

The power method computes the largest eigenvalue of $A$ by repeating the iteration $x \gets Ax / \norm{Ax}$ . After many iterations, $x$ approaches an eigenvector of $A$ and $x^\top A x$ approaches an eigenvalue. Letting $x^{(0)}$ denote the initial vector, the $t$ th power iterate is $x^{(t)} = A^t x^{(0)} / \norm{A^t x^{(0)}}$ and the $t$ th eigenvalue estimate is

$\mu^{(t)} = \left(x^{(t)}\right)^\top Ax^{(t)}= \frac{\left(x^{(0)}\right)^\top A^{2t+1} x^{(0)}}{\left(x^{(0)}\right)^\top A^{2t} x^{(0)}}.$

It is common to initialize the power method with a vector $x^{(0)}$ with (independent) standard Gaussian random coordinates. In this case, the components $z_1,\ldots,z_n$ of $x^{(0)}$ in an eigenvector basis of $A$ are also independent standard Gaussians, owing to the rotational invariance of the (standard multivariate) Gaussian distribution. Then the $t$ th eigenvalue estimate is

$\mu^{(t)} = \frac{\sum_{i=1}^n \lambda_i^{2t+1} z_i^2}{\sum_{i=1}^n \lambda_i^{2t} z_i^2},$

and the error of $\mu^{(t)}$ as an approximation to the dominant eigenvalue $\lambda_1$ is

$\frac{\lambda_1 - \mu^{(t)}}{\lambda_1} = \frac{\sum_{i=1}^n (\lambda_1 - \lambda_i)/\lambda_1 \cdot \lambda_i^{2t} z_i^2}{\sum_{i=1}^n \lambda_i^{2t} z_i^2}.$

Analysis

Having set everything up, we can now use the Gaussian integration by parts formula to make quick work of the analysis. To begin, observe that the quantity

$\frac{\lambda_1 - \lambda_i}{\lambda_1}$

is zero for $i = 1$ and is at most one for $i > 1$ . Therefore, the error is bounded as

$\frac{\lambda_1 - \mu^{(t)}}{\lambda_1} \le \frac{\sum_{i=2}^n \lambda_i^{2t} z_i^2}{\lambda_1^{2t} z_1^2 + \sum_{i=2}^n \lambda_i^{2t} z_i^2} = \frac{c^2}{\lambda_1^{2t} z_1^2 + c^2} \quad \text{for } c^2 = \sum_{i=2}^n \lambda_i^{2t} z_i^2.$

We have introduced a parameter $c$ to consolidate the terms in this expression not depending on $z_1$ . Since $z_1,\ldots,z_n$ are independent, $z_1$ and $c$ are independent as well.

Now, let us bound the expected value of the error. First, we take an expectation $\expect_{z_1}$ with respect to only the randomness in the first Gaussian variable $z_1$ . Here, we use Gaussian integration by parts in the reverse direction. Introduce the function

$f'(x) = \frac{c^2}{\lambda_1^{2t}x^2 + c^2}.$

This function is the derivative of

$f(x) = \frac{c}{\lambda_1^t} \arctan \left( \frac{\lambda_1^t}{c} \cdot x \right).$

Thus, by the Gaussian integration by parts formula, we have

$\expect_{z_1}\left[\frac{\lambda_1 - \mu^{(t)}}{\lambda_1}\right] \le \expect_{z_1} \left[z_1 \cdot \frac{c}{\lambda_1} \arctan \left( \frac{\lambda_1^t}{c} \cdot z_1 \right) \right] = \expect_{z_1} \left[|z_1| \cdot \frac{c}{\lambda_1^t} \left|\arctan \left( \frac{\lambda_1^t}{c} \cdot z_1 \right)\right| \right].$

In the last line, we observed that the bracketed quantity is nonnegative, so we are free to introduce absolute value signs. The arctangent function is always at most $\pi/2$ , so we can bound

$\expect_{z_1}\left[\frac{\lambda_1 - \mu^{(t)}}{\lambda_1}\right] \le = \frac{\pi}{2} \cdot \frac{c}{\lambda_1^t} \cdot \expect_{z_1} \left[|z_1|\right] \le \sqrt{\frac{\pi}{2}} \cdot \frac{c}{\lambda_1^t}.$

Here, we used our result from above that $\expect[|z|] = \sqrt{2/\pi}$ for a standard Gaussian variable $z$ .

We’re in the home stretch! We can bound $c$ as

$c = \sqrt{\sum_{i=2} \lambda_i^{2t} z_i^2} \le \lambda_2^{t} \sqrt{\sum_{i=2}^n z_i^2} = \lambda_2^t \norm{(z_2,\ldots,z_n)}.$

We see that $c$ is at most $\lambda_2^t$ times the length of a vector of $n-1$ standard Gaussian entries. As we’ve seen before on this blog, the expected length of a Gaussian vector with $n-1$ entries is at most $\sqrt{n-1}$ . Thus, $\expect[c] \le \lambda_2^t \sqrt{n-1}$ . We conclude that the expected error for power iteration is

$\expect\left[\frac{\lambda_1 - \mu^{(t)}}{\lambda_1}\right] = \expect_c\left[\expect_{z_1} \left[ \frac{\lambda_1 - \mu^{(t)}}{\lambda_1} \right]\right] \le \sqrt{\frac{\pi}{2}} \cdot \expect\left[ \frac{c}{\lambda_1^t} \right] =\sqrt{\frac{\pi}{2}} \cdot \left(\frac{\lambda_2}{\lambda_1}\right)^t \cdot \sqrt{n-1} .$

We see that the power iteration converges geometrically at a rate of at least $(\lambda_2/\lambda_1)^t$ .

The first analyses of power iteration from a random start were done by Kuczyński and Woźniakowski (1992) and require pages of detailed computations involving integrals. This simplified analysis, due to Tropp (2020), makes the analysis effortless by comparison.

Five Years of Blogging

July 8, 2025 by Ethan N. Epperly 1 Comment

Five years ago today, I embarked on a crazy experiment. It was the summer after I finished my undergraduate degree, and I was filled with lots of exciting things I learned from my mentor Shiv Chandrasekaran and my own self-study. I kept asking, “Why did no one teach me that subject in this way? If Shiv hadn’t taught me this trick, how would I have ever have learned this? How does everyone not know this cool theorem?” I was young and naïve. But I also was brimming with passion for my subject, and I had a lot I wanted to say.

Applied mathematics, and mathematics in general, is a rare subject in which its best researchers are also among its best communicators. Writers like Joel Tropp, Nick Trefethen, and Nick Higham were my heroes, and I desperately wanted to be like them. I figured that practice is the best way to learn any skill, and I decided that writing a blog was the best way to hone my skills as a mathematical communicator and to share this burning list of mathematical curiosities I had collected on my travels. I had my doubts—was this a waste of my time? would anyone actually want to read this? did I really have anything worth saying?—but I pushed them aside and published my first blog post on July 8, 2020.

To my great surprise, this experiment has been a bigger success than I ever could have imagined. An early success came just a month after starting the blog when my post on Galerkin approximation was posted to Hacker news and received 18,000 page views in a few-day span: what a rush! But perhaps a bigger surprise to me was how writing the blog has helped connect me to people in my field. I can’t count how many times the blog comes up within minutes of meeting a new researcher, and—to my great shock—the blog has now started to garner citations in academic papers. To everyone who’s read this blog and gotten something out of it, thank you.

It is hard to truly internalize how much things have changed for me over the past five years of blogging. I went from incoming PhD student to PhD student to person with a PhD. I met my heroes—not just Joel Tropp, Nick Trefethen, and Nick Higham—but also many more like John Urschel, Heather Wilber, Anne Greenbaum, Yuji Nakatsuksa, and Lin Lin, to name just a few. I am honored that many of them are now my collaborators. Truly, computational mathematics is a wonderful research community, and I am proud to be a member of such a warm and inviting group. I am excited to continue my research—and blogging—later this summer as a Miller postdoctoral fellow at UC Berkeley.

Let me end with a message to my former self or anyone else thinking about writing a blog, producing a YouTube video, or creating any other type of expository content: Just start! The world is hungry for clear explanations of difficult subjects, and your unique perspective on your subject is worth sharing. Your writing might not be very good at first (mine certainly wasn’t), but you’ll improve with practice. Just start. You may be surprised where you end up.

Randomized Kaczmarz: How Should You Sample?

June 16, 2025 by Ethan N. Epperly Leave a comment

The randomized Kaczmarz method is a method for solving systems of linear equations:

(1) $\text{Find $x$ such that } Ax = b.$

Throughout this post, the matrix $A$ will have dimensions $n\times d$ . Beginning from an initial iterate $x_0 = 0$ , randomized Kaczmarz works as follows. For $t = 0,1,2,\ldots$ :

Sample a random row index $i_t$ with probability $\prob \{ i_t = j \} = p_j$ .
Update to enforce the equation $a_{i_t}^\top x = b_{i_t}$ holds exactly:
$x_{t+1} \coloneqq x_t + \frac{b_{i_t} - a_{i_t}^\top x_t}{\norm{a_{i_t}}^2} a_{i_t}.$
Throughout this post, $a_j^\top$ denotes the $j$ th row of $A$ .

What selection probabilities $p_j$ should we use? The answer to this question may depend on whether the system (1) is consistent, i.e., whether it possesses a solution $x$ . For this post, we assume (1) is consistent; see this previous post for a discussion of the inconsistent case.

The classical selection probabilities for randomized Kaczmarz were proposed by Strohmer and Vershynin in their seminal paper:

(1) $p_j = \frac{\norm{a_j}^2}{\norm{A}_{\rm F}^2} \quad \text{for } j = 1,2,\ldots,n.$

Computing these selection probabilities requires a full pass over the matrix, which can be expensive for the largest problems.¹ A computationally appealing alternative is to implement randomized Kaczmarz with uniform selection probabilities

(2) $p_j = \frac{1}{n} \quad \text{for } j = 1,2,\ldots,n.$

Ignoring computational cost, which sampling rule leads to faster convergence: (1) or (2)?

Surprisingly, to me at least, the simpler strategy (2) often works better than (1). Here is a simple example. Define a matrix $A \in \real^{20\times 20}$ with entries $A_{ij} = \min(i,j)^2$ , and choose the right-hand side $b\in\real^{20}$ with standard Gaussian random entries. The convergence of standard RK with sampling rule (1) and uniform RK with sampling rule (2) is shown in the plot below. After a million iterations, the difference in final accuracy is dramatic: the final relative error 0.00012 was uniform RK and 0.67 for standard RK!

Error for randomized Kaczmarz with both squared row norm sampling ("standard") and uniformly random rows on matrix with entries min(i,j)^2. Uniform randomized Kaczmarz achieves significantly smaller final error

In fairness, uniform RK does not always outperform standard RK. If we change the matrix entries to $A_{ij} = \min(i,j)$ , then the performance of both methods is similar, with both methods ending with a relative error of about 0.07.

Error for randomized Kaczmarz with both squared row norm sampling ("standard") and uniformly random rows on matrix with entries min(i,j). Uniform randomized Kaczmarz and standard randomized Kaczmarz achieve comparable final errors

Another experiment, presented in section 4.1 of Strohmer and Vershynin’s original paper, provides an example where standard RK converges a bit more than twice as fast as uniform RK (called “simple RK” in their paper). Still, taken all together, these experiments demonstrate that standard RK (sampling probabilities (1)) is often not dramatically better than uniform RK (sampling probabilities (2)), and uniform RK can be much better than standard RK.

Randomized Kaczmarz Error Bounds

Why does uniform RK often outperform standard RK? To answer these questions, let’s look at the error bounds for the RK method.

The classical analysis of standard RK shows the method is geometrically convergent

(3) $\expect\left[ \norm{x_t - x_\star}^2 \right] \le (1 - \kappa_{\rm dem}(A)^{-2})^t \norm{x_\star}^2.$

Here,

(4) $\kappa_{\rm dem}(A) = \frac{\norm{A}_{\rm F}}{\sigma_{\rm min}(A)} = \sqrt{\sum_i \left(\frac{\sigma_i(A)}{\sigma_{\rm min}(A)}\right)^2}$

is known as the Demmel condition number and $\sigma_i(A)$ are the singular values of $A$ . Recall that we have assumed the system $Ax = b$ is consistent, possessing a solution $x_\star$ . If there are multiple solutions, we let $x_\star$ denote the solution of minimum norm.

What about uniform RK? Let $D_A = \diag ( \norm{a_i}^{-1} : i=1,\ldots,n )$ denote a diagonal matrix containing the inverse row norms, and introduce the row-equilibrated matrix $D_A A$ . The row-equilibrated matrix $D_A A$ has been obtained from $A$ by rescaling each of its rows to have unit norm.

Uniform RK can then be related to standard RK run on the row-equilibrated matrix:

Fact (uniform sampling and row equilibration): Uniform RK on the system $Ax = b$ produces the same sequence of (random) iterates $\hat{x}_t$ as standard RK applied to the row-equilibrated system $(D_A A)x = D_A b$ .

Therefore, by (3), the iterates $\hat{x}_t$ of uniform RK satisfy the bound

(5) $\expect\left[ \norm{\hat{x}_t - x_\star}^2 \right] \le (1 - \kappa_{\rm dem}(D_A A)^{-2})^t \norm{x_\star}^2.$

Thus, at least using the error bounds (3) and (5), whether standard or uniform RK is better depends on which matrix has a smaller Demmel condition number: $A$ or $D_A A$ .

Row Equilibration and the Condition Number

Does row equilibration increase or decrease its condition number? What is the optimal way of scaling the rows of a matrix to minimize its condition number? These are classical questions in numerical linear algebra, and they were addressed in a classical 1969 paper of van der Sluis. These results were then summarized and generalized in Higham’s delightful monograph Accuracy and Stability of Numerical Algorithms. Here, we present answers to these questions using a variant of van der Sluis’ argument.

First, let’s introduce some more concepts and notation. Define the spectral norm condition number

$\kappa(A) \coloneqq \frac{\sigma_{\rm max}(A)}{\sigma_{\rm min}(A)}$

The spectral norm and Demmel condition numbers are always comparable $\kappa(A) \le \kappa_{\rm dem}(A)\le \sqrt{\min(n,d)}\cdot \kappa(A)$ . Also, let $\mathrm{Diag}$ denote the set of all (nonsingular) diagonal matrices.

Our first result shows us that row equilibration never hurts the Demmel condition number by much. In fact, the row-equilibrated matrix produces a nearly optimal Demmel condition number when compared to any row scaling:

Theorem 1 (Row equilibration is a nearly optimal row scaling). Let $A\in\real^{n\times d}$ be wide $n\le d$ and full-rank, and let $D_AA$ denote the row-scaling of $A$ to have unit row norms. Then
$\kappa_{\rm dem}(D_AA) \le \sqrt{n}\cdot \min_{D \in \mathrm{Diag}} \kappa (DA) \le \sqrt{n}\cdot \min_{D \in \mathrm{Diag}} \kappa_{\rm dem} (DA).$

By scaling the rows of a square or wide matrix to have unit norm, we bring the Demmel condition number to within a $\sqrt{n}$ factor of the optimal row scaling. In fact, we even bring the Demmel condition number to within a $\sqrt{n}$ factor of the optimal spectral norm condition number for any row scaling.

Since the convergence rate for randomized Kaczmarz is $\kappa_{\rm dem}(A)^{-2}$ , this result shows that implementing randomized Kaczmarz with uniform sampling yields to a convergence rate that is within a factor of $n$ of the optimal convergence rate using any possible sampling distribution.

This result shows us that row equilibration can’t hurt the Demmel condition number by much. But can it help? The following proposition shows that it can help a lot for some problems.

Proposition 2 (Row equilibration can help a lot). Let $A\in\real^{n\times d}$ be wide $n\le d$ and full-rank, and let $\gamma$ denote the maximum ratio between two row norms:
$\gamma \coloneqq \frac{ \max_i \norm{a_i}}{\min_i \norm{a_i}}.$
Then the Demmel condition number of the original matrix $A$ satisfies
$\kappa_{\rm dem}(A) \le \gamma \cdot \kappa_{\rm dem}(D_A A).$
Moreover, for every $\gamma\ge 1$ , there exists a matrix $A_\gamma$ where this bound is nearly attained:
$\kappa_{\rm dem}(A_\gamma) \ge \sqrt{1-\frac{1}{n}} \cdot \gamma \cdot \kappa_{\rm dem}(D_{A_\gamma}A_\gamma).$

Taken together, Theorem 1 and Proposition 2 show that row equilibration often improves the Demmel condition number, and never increases it by that much. Consequently, uniform RK often converges faster than standard RK for square (or short wide) linear systems, and it never converges much slower.

Proof of Theorem 1

We follow Higham’s approach. Each of the $n$ rows of $D_AA$ each have unit norm, so

(7) $\norm{D_AA}_{\rm F} = \sqrt{n}.$

The minimum singular value of $D_A A$ can be written in terms of the Moore–Penrose pseudoinverse $(D_A A)^\dagger = A^\dagger D_A^{-1}$ as follows

$\frac{1}{\sigma_{\rm min}(D_A A)} = \norm{A^\dagger D_A^{-1}}.$

Here, $\norm{\cdot}$ denotes the spectral norm. Then for any nonsingular diagonal matrix $D$ , we have

(8) $\frac{1}{\sigma_{\rm min}(D_A A)} = \norm{A^\dagger D^{-1} (DD_A^{-1})} \le \norm{A^\dagger D^{-1}} \norm{DD_A^{-1}} = \frac{\norm{DD_A^{-1}}}{\sigma_{\rm min}(DA)}.$

Since the matrix $DD_A^{-1}$ is diagonal its spectral norm is

$\norm{DD_A^{-1}} = \max \left\{ \frac{|D_{ii}|}{|(D_A)_{ii}|} : 1\le i \le n \right\}.$

The diagonal entries of $D_A$ are $\norm{a_i}^{-1}$ , so

$\norm{DD_A^{-1}} = \max \left\{ |D_{ii}|\norm{a_i} : 1\le i\le n \right\}$

is the maximum row norm of the scaled matrix $DA$ . The maximum row norm is always less than the largest singular value of $DA$ , so $\norm{DD_A^{-1}} \le \sigma_{\rm max}(DA)$ . Therefore, combining this result, (7), and (9), we obtain

$\kappa_{\rm dem}(D_AA) \le \sqrt{n} \cdot \frac{\sigma_{\rm max}(DA)}{\sigma_{\rm min}(DA)} = \sqrt{n}\cdot \kappa (DA).$

Since this bound holds for every $D \in \mathrm{Diag}$ , we are free to minimize over $D$ , leading to the first inequality in the theorem:

$\kappa_{\rm dem}(D_AA) \le \sqrt{n}\cdot \min_{D \in \mathrm{Diag}} \kappa (DA).$

Since the spectral norm condition number is smaller than the Demmel condition number, we obtain the second bound in the theorem.

Proof of Proposition 2

Write $A = D_A^{-1}(D_AA)$ . Using the Moore–Penrose pseudoinverse again, write

(10) $\kappa_{\rm dem}(A) = \norm{D_A^{-1}(D_AA)}_{\rm F} \norm{(D_A A)^\dagger D_A}.$

The Frobenius norm and spectral norm satisfy a (mixed) submultiplicative property

$\norm{BC}_{\rm F} \le \norm{B}\norm{C}_{\rm F}, \quad \norm{BC} \le\norm{B}\norm{C}.$

Applying this result to (1), we obtain

$\kappa_{\rm dem}(A) \le \norm{D_A^{-1}}\norm{D_AA}_{\rm F} \norm{(D_A A)^\dagger} \norm{D_A}.$

We recognize $\gamma = \norm{D_A^{-1}}\norm{D_A}$ and $\kappa_{\rm dem}(D_A A) = \norm{D_AA}_{\rm F} \norm{(D_A A)^\dagger}$ . We conclude

$\kappa_{\rm dem}(A) \le \gamma \cdot \kappa_{\rm dem}(D_A A).$

To show this bound is nearly obtained, introduce $A_\gamma = \diag(\gamma,\gamma,\ldots,\gamma,1)$ . Then $D_{A_\gamma} A_\gamma = I$ with $\kappa_{\rm dem}(D_{A_\gamma}A_{\gamma}) = \sqrt{n}$ and

$\kappa_{\rm dem}(A_\gamma) = \frac{\norm{A_{\gamma}}_{\rm F}}{\sigma_{\rm min}(A_\gamma)} = \frac{\sqrt{(n-1)\gamma^2+1}}{1} \ge \sqrt{n} \cdot \sqrt{1-\frac{1}{n}} \cdot \gamma.$

Therefore,

$\kappa_{\rm dem}(A_\gamma) \ge \sqrt{1-\frac{1}{n}} \cdot \gamma \cdot \kappa_{\rm dem}(D_{A_\gamma}A_\gamma).$

Practical Guidance

What does this theory mean for practice? Ultimately, single-row randomized Kaczmarz is often not the best algorithm for the job for ordinary square (or short–wide) linear systems, anyway—block Kaczmarz or (preconditioned) Krylov methods have been faster in my experience. But, supposing that we have locked in (single-row) randomized Kaczmarz as our algorithm, how should we implement it?

This question is hard to answer, because there are examples where standard RK and uniform RK both converge faster than the other. Theorem 1 suggests uniform RK can require as many as $n\times$ more iterations than standard RK on a worst-case example, which can be a big difference for large problems. But, particularly for badly row-scaled problems, Proposition 2 shows that uniform RK can dramatically outcompete standard RK. Ultimately, I would give two answers.

First, if the matrix $A$ has already been carefully designed to be well-conditioned and computing the row norms is not computationally burdensome, then standard RK may be worth the effort. Despite this theory suggesting it can do quite badly, it took a bit of effort to construct a simple example of a “bad” matrix where uniform RK significantly outcompeted standard RK. (On most examples I constructed, the rate of convergence of the two methods were similar.)

Second, particularly for the largest systems where you only want to make a small number of total passes over the matrix, expending a full pass over the matrix to compute the row norms is a significant expense. And, for poorly row-scaled matrices, sampling using the squared row norms can hurt the convergence rate. Based on these observations, given a matrix of unknown row scaling and conditioning or given a small budget of passes over the matrix, I would use the uniform RK method over the standard RK method.

Finally, let me again emphasize that the theoretical results Theorem 1 and Proposition 2 only apply to square or wide matrices $A$ . Uniform RK also appears to work for consistent systems with a tall matrix, but I am unaware of a theoretical result comparing the Demmel condition numbers of $D_AA$ and $A$ that applies to tall matrices. And for inconsistent systems of equations, it’s a whole different story.

Edit: After initial publication of this post, Mark Schmidt shared that the observation that uniform RK can outperform standard RK was made nearly ten years ago in section 4.2 of the following paper. They support this observation with a different mathematical justification

A Neat Not-Randomized Algorithm: Polar Express

June 7, 2025 by Ethan N. Epperly Leave a comment

Every once in a while, there’s a paper that comes out that is so delightful that I can’t help share it on this blog, and I’ve started a little series Neat Randomized Algorithms for exactly this purpose. Today’s entry into this collection is The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm by Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. Despite its authors belonging to the randomized linear algebra ouvré, this paper is actually about a plain-old deterministic algorithm. But it’s just so delightful that I couldn’t help but share it in this series any way.

The authors of The Polar Express are motivated by the recent Muon algorithm for neural network optimization. The basic idea of Muon is that it helps to orthogonalize the search directions in a stochastic gradient method. That is, rather than update a weight matrix $W$ with search direction $G$ using the update rule

$W \gets W - \eta G,$

instead use the update

$W\gets W - \eta \operatorname{polar}(G).$

Here,

$\operatorname{polar}(G) \coloneqq \operatorname*{argmin}_{Q \textrm{ with orthonormal columns}} \norm{G - Q}_{\rm F}$

is the closed matrix with orthonormal columns to $G$ and is called the (unitary) polar factor of $G$ . (Throughout this post, we shall assume for simplicity that $G$ is tall and full-rank.) Muon relies on efficient algorithms for rapidly approximating $\operatorname{polar}(G)$ .

Given a singular value decomposition $G = U\Sigma V^\top$ , the polar factor may be computed in closed form as $\operatorname{polar}(G) = UV^\top$ . But computing the SVD is computationally expensive, particularly in GPU computing environments. Are there more efficient algorithms that avoid the SVD? In particular, can we design algorithms that use only matrix multiplications, for maximum GPU efficiency?

The Polar Factor as a Singular Value Transformation

Computing the polar factor $\operatorname{polar}(G)$ of a matrix $G$ effectively applies an operation to $G$ which replaces all of its singular values by one. Such operations are studied in quantum computing, where they are called singular value transformations.

Definition (singular value transformation): Given an odd function $f$ , the singular value transformation of $G = U\Sigma V^\top$ by $f$ is $f[G] \coloneqq Uf(\Sigma)V^\top$ .

On its face, it might seem like that the polar factor of $G$ is cannot be obtained as a singular value transformation. After all, the constantly one function $f(x)= 1$ is not odd. But, to obtain the polar factor, we only need a function $f$ which sends positive inputs to $1$ . Thus, the polar decomposition $\operatorname{polar}(G)$ is given by the singular value transformation associated with the sign function:

$\operatorname{sign}(x) = \begin{cases} 1, & x > 0, \\ 0, & x = 0, \\ -1, & x < 0. \end{cases}$

The sign function is manifestly odd, and the polar factor satisfies

$\operatorname{polar}(G) = \operatorname{sign}[G].$

Singular Value Transformations and Polynomials

How might we go about computing the singular value transformation of a matrix? For an (odd) polynomial, this computation can be accomplished using a sequence of matrix multiplications alone. Indeed, for $p(x) = a_1 x + a_3 x^3 + \cdots + a_{2k+1} x^{2k+1}$ , we have

$p[G] = a_1 G + a_3 G(G^\top G) + \cdots + a_{2k+1} G(G^\top G)^k.$

For a general (odd) function $f$ , we can approximately compute the singular value transformation $f[G]$ by first approximating $f$ by a polynomial $p$ , and then using $p[G]$ as a proxy for $f[G]$ . Here is an example:

>> G = randn(2)                            % Random test matrix
G =
   0.979389080992349  -0.198317114406418
  -0.252310961830649  -1.242378171072736
>> [U,S,V] = svd(G);
>> fG = U*sin(S)*V'                        % Singular value transformation
fG =
   0.824317193982434  -0.167053523352195
  -0.189850719961322  -0.935356030417109
>> pG = G - (G*G'*G)/6 + (G*G'*G*G'*G)/120 % Polynomial approximation
pG =
   0.824508188218982  -0.167091255945116
  -0.190054681059327  -0.936356677704568

We see that we get reasonably high accuracy by approximating $\sin[G]$ using its degree-three Taylor polynomial.

The Power of Composition

The most basic approach to computing the sign function would be to use a fixed polynomial of degree $2k+1$ . However, this approach converges fairly slowly as we increase the degree $k$ .

A better strategy is to use compositions. A nice feature of the sign function is the fixed point property: For every $x$ , $\operatorname{sign}(x)$ is a fixed point of the $\operatorname{sign}$ function:

$\operatorname{sign}(\operatorname{sign}(x)) = \operatorname{sign}(x) \quad \text{for all } x \in \real.$

The fixed point strategy suggests an alternate strategy for computing the sign function using polynomials. Rather than using one polynomial of large degree, we can instead compose many polynomials of low degree. The simplest such compositional algorithm is the Newton–Schulz iteration, which consists of initializing $P\gets G$ applying the following fixed point equation until convergence:

$P \gets \frac{3}{2} P - \frac{1}{2} PP^\top P.$

Here is an example execution of the algorithm:

>> P = randn(100) / 25;
>> [U,~,V] = svd(P); polar = U*V'; % True polar decomposition
>> for i = 1:20
      P = 1.5*P-0.5*P*P'*P; % Newton-Schulz iteration
      fprintf("Iteration %d\terror %e\n",i,norm(P - polar));
   end
Iteration 1	error 9.961421e-01
Iteration 2	error 9.942132e-01
Iteration 3	error 9.913198e-01
Iteration 4	error 9.869801e-01
Iteration 5	error 9.804712e-01
Iteration 6	error 9.707106e-01
Iteration 7	error 9.560784e-01
Iteration 8	error 9.341600e-01
Iteration 9	error 9.013827e-01
Iteration 10	error 8.525536e-01
Iteration 11	error 7.804331e-01
Iteration 12	error 6.759423e-01
Iteration 13	error 5.309287e-01
Iteration 14	error 3.479974e-01
Iteration 15	error 1.605817e-01
Iteration 16	error 3.660929e-02
Iteration 17	error 1.985827e-03
Iteration 18	error 5.911348e-06
Iteration 19	error 5.241446e-11
Iteration 20	error 6.686995e-15

As we see, the initial rate of convergence is very slow, and obtain only a single digit of accuracy after 15 iterations. After this burn-in period, the rate of convergence is very rapid, and the method achieves machine accuracy after 20 iterations.

The Polar Express

The Newton–Schulz iteration approximates the sign function using a composition of the same polynomial $p$ repeatedly. But we can get better approximations by applying a sequence of different polynomials $p_1,\ldots,p_t$ , resulting in an approximation of the form

$\operatorname{sign}[G] \approx p_t[p_{t-1}[\cdots[p_2[p_1[G]]\cdots]].$

The Polar Express paper asks the question:

What are the optimal choice of polynomials $p_i$ ?

For simplicity, the authors of The Polar Express focus on the case where all of the polynomials $p_i$ have the same (odd) degree $2k+1$ .

On its face, it seems like this problem might be intractable as the best choice of polynomial $p_{i+1}$ seemly could depend in a complicated way on all of the previous polynomials $p_1,\ldots,p_i$ . Fortunately, the authors of The Polar Express show that there is a very simple way of computing the optimal polynomials. Begin by assuming that the singular values of $G$ lie in an interval $[\ell_0,u_0]$ . We then choose $p_1$ to be the degree-( $2k+1$ ) odd polynomial approximation to the sign function on $[\ell_0,u_0]$ that minimizes the $L_\infty$ error:

$p_1 = \operatorname*{argmin}_{\text{odd degree-($2k+1$) polynomial } p} \max_{x \in [\ell_0,u_0]} |p(x) - \operatorname{sign}(x)|.$

This optimal polynomial can be computed by a version of the Remez algorithm provided in the Polar Express paper. After applying $p_1$ to $G$ , the singular values of $p_1[G]$ lie in a new interval $[\ell_1,u_1]$ . To build the next polynomial $p_2$ , we simply find the optimal approximation to the sign function on this interval:

$p_2 = \operatorname*{argmin}_{\text{odd degree-($2k+1$) polynomial } p} \max_{x \in [\ell_1,u_1]} |p(x) - \operatorname{sign}(x)|.$

Continuing in this way, we can generate as many polynomials $p_1,p_2,\ldots$ as we want.

For given values of $\ell_0$ and $u_0$ , the coefficients of the optimal polynomials $p_1,p_2,\ldots$ can be computed in advance and stored, allowing for rapid deployment at runtime. Moreover, we can always ensure the upper bound is $u_0 = 1$ by normalizing $G\gets G / \norm{G}_{\rm F}$ . As such, there is only one parameter $\ell_0$ that we need to know in order to compute the optimal coefficients. The authors of The Polar Express are motivated by applications in deep learning using 16-bit floating point numbers. In this value, the lower bound $\ell_0 = 0.001$ is appropriate. (As the authors stress, their method remains convergent even if too large a value of $\ell_0$ is chosen, though convergence may be slowed somewhat.)

Below, I repeat the experiment from above using (degree-5) Polar Express instead of Newton–Schulz. The coefficients for the optimal polynomials are taken from the Polar Express paper.

>> P = randn(100) / 25;
>> [U,~,V] = svd(P); polar = U*V';
>> P2 = P*P'; P = ((17.300387312530933*P2-23.595886519098837*eye(100))*P2+8.28721201814563*eye(100))*P; fprintf("Iteration 1\terror %e\n",norm(P - polar));
Iteration 1	error 9.921347e-01
>> P2 = P*P'; P = ((0.5448431082926601*P2-2.9478499167379106*eye(100))*P2+4.107059111542203*eye(100))*P; fprintf("Iteration 2\terror %e\n",norm(P - polar));
Iteration 2	error 9.676980e-01
>> P2 = P*P'; P = ((0.5518191394370137*P2-2.908902115962949*eye(100))*P2+3.9486908534822946*eye(100))*P; fprintf("Iteration 3\terror %e\n",norm(P - polar));
Iteration 3	error 8.725474e-01
>> P2 = P*P'; P = ((0.51004894012372*P2-2.488488024314874*eye(100))*P2+3.3184196573706015*eye(100))*P; fprintf("Iteration 4\terror %e\n",norm(P - polar));
Iteration 4	error 5.821937e-01
>> P2 = P*P'; P = ((0.4188073119525673*P2-1.6689039845747493*eye(100))*P2+2.300652019954817*eye(100))*P; fprintf("Iteration 5\terror %e\n",norm(P - polar));
Iteration 5	error 1.551595e-01
>> P2 = P*P'; P = ((0.37680408948524835*P2-1.2679958271945868*eye(100))*P2+1.891301407787398*eye(100))*P; fprintf("Iteration 6\terror %e\n",norm(P - polar));
Iteration 6	error 4.588549e-03
>> P2 = P*P'; P = ((0.3750001645474248*P2-1.2500016453999487*eye(100))*P2+1.8750014808534479*eye(100))*P; fprintf("Iteration 7\terror %e\n",norm(P - polar));
Iteration 7	error 2.286853e-07
>> P2 = P*P'; P = ((0.375*P2-1.25*eye(100))*P2+1.875*eye(100))*P; fprintf("Iteration 8\terror %e\n",norm(P - polar));
Iteration 8	error 1.113148e-14

We see that the Polar Express algorithm converges to machine accuracy in only 8 iterations (24 matrix products), a speedup over the 20 iterations (40 matrix products) required by Newton–Schulz. The Polar Express paper contains further examples with even more significant speedups.

Make sure to check out the Polar Express paper for many details not shared here, including extra tricks to improve stability in 16-bit floating point arithmetic, discussions about how to compute the optimal polynomials, and demonstrations of the Polar Express algorithm for training GPT-2.

References: Muon was first formally described in the blog post Muon: An optimizer for hidden layers in neural networks (2024); for more, see this blog post by Jeremy Bernstein and this paper by Jeremy Bernstein and Laker Newhouse. The Polar Express is proposed in The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm (2025) by Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. For more on the matrix sign function and computing it, chapter 5 of Functions of Matrices: Theory and Computation (2008) by Nicholas H. Higham is an enduringly useful reference.

Markov Musings 5: Poincaré Inequalities

May 22, 2025 by Ethan N. Epperly Leave a comment

In the previous posts, we’ve been using eigenvalues to understand the mixing of reversible Markov chains. Our main convergence result was as follows:

$\chi^2\left(\rho^{(n)} \, \middle|\middle| \, \pi\right) \le \left( \max \{ \lambda_2, -\lambda_n \} \right)^{2n} \chi^2\left(\rho^{(0)} \, \middle|\middle| \, \pi\right).$

Here, $\rho^{(n)}$ denotes the distribution of the chain at time $n$ , $\pi$ denotes the stationary distribution, $\chi^2(\cdot \mid\mid \cdot)$ denotes the $\chi^2$ divergence, and $1 = \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_m \ge -1$ denote the decreasingly ordered eigenvalues of the Markov transition matrix $P$ .

Bounding the the rate of convergence requires an upper bound on $\lambda_2$ and a lower bound on $\lambda_m$ . In this post, we will talk about techniques for bounding $\lambda_2$ . For more on the smallest eigenvalue $\lambda_m$ , see the previous post.

Setting

Let’s begin by establishing some notation, mostly the same as previous posts as this series. We work with a reversible Markov chain with transition matrix $P$ and stationary distribution $\pi$ .

As in previous posts, we identify vectors $f \in \real^m$ and functions $f : \{1,\ldots,m\} \to \real$ , treating them as one and the same $f_i = f(i)$ .

For a vector/function $f$ , $\expect_\pi[f]$ and $\Var_\pi(f)$ denote the variance with respect to the stationary distribution $\pi$ :

$\expect_\pi[f] = \sum_{i=1}^m f(i) \pi_i, \quad \Var_\pi(f) \coloneqq \expect_\pi[(f-\expect_\pi[f])^2].$

We will make frequent use of the $\pi$ -inner product

$\langle f, g\rangle \coloneqq \expect_\pi[f\cdot g] = \sum_{i=1}^m f(i) g(i) \pi_i.$

We shall also use expressions such as $\expect_{x \sim \sigma, y\sim \tau} [f(x,y)]$ to denote the expectation of $f(x,y)$ where $x$ is drawn from distribution $\sigma$ and $y$ is drawn from $\tau$ .

We denote the eigenvalues of the transition matrix are $1 = \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_m \ge -1$ . The associated eigenvectors (eigenfunctions) $\varphi_1,\ldots,\varphi_m$ are orthonormal in the $\pi$ -inner product

$\langle \varphi_i ,\varphi_j\rangle = \begin{cases}1, & i = j, \\0, & i \ne j.\end{cases}$

Variance and Local Variance

To discover methods for bounding $\lambda_2$ , we begin by investigating a seemingly simple question:

How much variable is the output of a function $f : \{1,\ldots,m\} \to \real$ ?

There are two natural quantities which provide answers to this question: the variance and the local variance. Poincaré inequalities—the main subject of this post—establish a relation between these two numbers. As a consequence, Poincaré inequalities will provide a bound on $\lambda_2$ .

Variance

We begin with the first of our two main characters, the variance $\Var_\pi(f)$ . The variance is a very familiar measure of variation, as it is defined for any random variable. It measures the average squared deviation of $f(x)$ from its mean, where $x$ is drawn from the stationary distribution $\pi$ .

Another helpful formula for the variance is the exchangeable pairs formula:

$\Var_\pi(f) = \frac{1}{2} \expect_{x,y \sim \pi} [(f(x) - f(y))^2].$

The exchangeable pairs formula states that the variance of $f$ is proportional to the average square difference of $f$ ‘s values when measured at locations $x$ and $y$ sampled (independently) from the stationary distribution $\pi$ .

Local Variance

The exchangeable pairs formula shows that variance is a measure of the global variability of the function: It measures the amount $f$ varies across locations $x$ and $y$ sampled randomly from the entire set of possible states $\{1,\ldots,m\}$ .

The local variance measures how much $f$ varies between points $x$ and $y$ which are separated by just one step of the Markov chain, thus providing a more local measure of variability. Let $x_0 \sim \pi$ be sampled from the stationary distribution, and let $x_1$ denote one step of the Markov chain after $x_0$ . The local variance is

$\mathcal{E}(f) = \frac{1}{2} \expect [(f(x_0) - f(x_1))^2].$

Other names for the local variance include the Dirichlet form and the quadratic form of the Markov chain.

An important note: The variance of a function $f$ depends only on the stationary distribution $\pi$ . By contrast, the local variance depends on the Markov transition matrix $P$ .

Poincaré Inequalities

If $f$ does not vary much over a single step of the Markov chain, then it seems reasonable to expect that it doesn’t vary much globally. This intuition is made quantitative using Poincaré inequalities.

Definition (Poincaré inequality). A Markov chain is said to satisfy a Poincaré inequality with constant $\alpha$ if
(1) $\Var_\pi(f)\le \alpha \cdot \mathcal{E}(f) \quad \text{for every function } f.$

Poincaré Inequalities and Mixing

Poincaré inequalities are intimately related with the speed of mixing for a Markov chain.

To see why, consider a function $f$ with small local variance. Because $f$ has small local variance, $f(x_0)$ is close to $f(x_1)$ , $f(x_1)$ is close to $f(x_2)$ , etc.; the function $f$ does not change much over a single step of the Markov chain. Does this mean that the (global) variance of $f$ will also be small? Not necessarily. If the Markov chain takes a long time to mix, the small local variance can accumulate to a large global variance over many steps of the Markov chain. Thus, a slowly mixing chain has a large Poincaré constant $\alpha$ . Conversely, if the chain mixes rapidly, the Poincaré constant $\alpha$ is small.

This relation between mixing and Poincaré inequalities is quantified by the following theorem:

Theorem (Poincaré inequalities from eigenvalues). The Markov chain satisfies a Poincaré inequality with constant
$\alpha= \frac{1}{1-\lambda_2}.$
This is the smallest possible Poincaré inequality for the Markov chain.

One way to interpret this result is that the eigenvalue $\lambda_2$ gives you Poincaré inequality (1). But we can flip this result around: Poincaré inequalities (1) establish bounds on the eigenvalue $\lambda_2$ .

Corollary (Eigenvalue bounds from Poincaré inequalities). If the Markov chain satisfies a Poincaré inequality (1) for a certain constant $\alpha$ , then
$\lambda_2 \le \frac{1}{1-\alpha}.$

A View to the Continuous Setting

For a particularly vivid example of a Poincaré inequality, it will be helpful to take a brief detour to the world of continuous Markov processes. This series has—to this point—exclusively focused on Markov chains $x_0,x_1,x_2,\ldots$ that have finitely many possible states and are indexed by discrete times $0,1,2,\ldots$ .. We can generalize Markov chains by lifting both of these restrictions, considering Markov processes $(x_t)_{t\ge 0}$ which take values in continuous space (such as the real line $\real$ ) and are indexed by continuous times $t\ge 0$ .

The mathematical details for Markov processes are a lot more complicated than for their Markov chain siblings, so we will keep it light on details.

For this example, our Markov process will be the Ornstein–Uhlenbeck process. This process has the somewhat mysterious form

$x_t = e^{-t}x_0 + e^{-t} B_{e^{2t}-1},$

where $(B_s)_{s\ge 0}$ denotes a (standard) Brownian motion, independent of the starting state $x_0$ . At time $s$ , the Brownian motion $B_s$ has a Gaussian distribution with variance $s$ . Thus,

Conditional on its starting value $x_0$ , the Ornstein–Uhlenbeck process $x_t$ is has a Gaussian distribution with mean $e^{-t}x_0$ and variance $1-e^{-2t}$ .

From this observation, it appears that the stationary distribution of the Ornstein–Uhlenbeck process is the standard Gaussian distribution. Indeed, this is the case, and the Ornstein–Uhlenbeck process converges to stationarity exponentially fast.

Since we have exponential convergence to stationarity,¹ there’s a Poincaré inequality lurking in the background, known as the Gaussian Poincaré inequality. Letting $Z$ denote a standard Gaussian random variable, Gaussian Poincaré inequality states that

(2) $\Var(f(Z)) \le \expect \big[(f'(Z))^2\big].$

The right-hand side of this inequality is the local variance of the Ornstein–Uhlenbeck process, equal to the expected squared derivative:

$\mathcal{E}(f) = \expect \big[(f'(Z))^2\big].$

The Gaussian Poincaré inequality presents a very clear demonstration of what a Poincaré inequality is: The global variance of the function $f(Z)$ is controlled by its local variability, here quantified by the expected squared derivative:

$\mathcal{E}(f) = \expect \big[(f'(Z))^2\big].$

For general Markov chains or processes, it can remain helpful to thinking of the local variance $\mathcal{E}(f)$ as a generalization of the “expected squared derivative” of the function $f$ .

Our main interest in Poincaré inequalities in this post is instrumental, we seek to use Poincaré inequalities to understand the mixing properties of Markov chains. But the Gaussian Poincaré inequality demonstrates that Poincaré inequalities are also interesting on their own terms. The inequality (2) is a useful inequality for bounding the variance of a function of a Gaussian random variable. As an immediate example, observe that the function $f(x) = \tanh(x)$ has derivative bounded by $1$ : $|f'(x)| \le 1$ . Thus,

$\Var(\tanh Z) \le \expect[(f'(Z))^2] \le 1.$

This inequality is not too difficult to prove directly,² but the point stands that the Gaussian Poincaré inequality—and Poincaré inequalities in general—can be useful on their own terms.³

Poincaré Inequalities and Eigenvalues

For the remainder of this post, we will develop the connection between Poincaré inequalities and eigenvalues, leading to a proof of our main theorem:

Theorem (Poincaré inequalities from eigenvalues). The Markov chain satisfies a Poincaré inequality with constant
$\alpha= \frac{1}{1-\lambda_2}.$
That is,
(3) $\Var_\pi(f)\le \frac{1}{1-\lambda_2}\cdot\mathcal{E}(f) \quad \text{for all } f\in\real^m.$
There exists a function $f$ for which equality is attained.

We begin by showing that it suffices to consider mean-zero functions $\expect[f] = 0$ to prove (3). Next, we derive formulas for $\Var_\pi(f)$ and $\mathcal{E}(f)$ using the $\pi$ -inner product $\langle\cdot,\cdot\rangle$ . We conclude by expanding $f$ in eigenvectors of $P$ and deriving the Poincaré inequality (3).

Shift to Mean-Zero

To prove the Poincaré inequality (3), we are free to assume that $f$ has mean zero, $\expect_\pi[f]=0$ . Indeed, both the variance $\Var_\pi(f)$ and local variance $\mathcal{E}(f)$ don’t change if we shift $f$ by a constant $c$ . That is, letting $\mathbb{1}$ denote the function

$\mathbb{1}(i) = 1 \quad\text{for }i =1,\ldots,m,$

then

$\Var_\pi(f+c\mathbb{1})=\Var_\pi(f)\quad\text{and}\quad\mathcal{E}(f+c\mathbb{1})=\mathcal{E}(f)$

for every function $f$ and constant $c$ . Therefore, for proving our Poincaré inequality, we can always shift $f$ so that it is mean-zero:

$\expect_\pi[f] = 0.$

Variance

Our strategy for proving the main theorem will be to develop a more linear algebraic formula for the variance and local variance. Let’s begin with the variance.

Assume $\expect_\pi[f] = 0$ . Then the variance is

$\Var_\pi(f)=\expect[f^2]=\sum_{i=1}^m f(i)f(i)\pi_i.$

Using the definition of the $\pi$ -inner product, we have shown that

$\Var_\pi(f)=\langle f,f\rangle.$

Local Variance

Now we derive a formula for the local variance:

$\mathcal{E}(f) = \frac{1}{2} \expect[(f(x_0)-f(x_1))^2]\quad \text{where }x_0\sim\pi.$

The probability that $x_0=i$ and $x_1=j$ is $\pi_iP_{ij}$ . Thus,

$\mathcal{E}(f) = \frac{1}{2} \sum_{i,j=1}^m (f(i)-f(j))^2 \pi_iP_{ij}.$

Expanding the parentheses and regrouping strategically, we obtain

$\mathcal{E}(f) = {\rm A} + {\rm B} + {\rm C}$

where

$\begin{align*}{\rm A} &= \frac{1}{2} \sum_{i=1}^n (f(i))^2 \pi_i \left( \sum_{j=1}^m P_{ij}\right),\\{\rm B} &= \frac{1}{2} \sum_{j=1}^m (f(j))^2 \left( \sum_{i=1}^m \pi_i P_{ij} \right), \\{\rm C} &= -\sum_{i=1}^m f(i)\left( \sum_{j=1}^m P_{ij} f(j) \right) \pi_i.\end{align*}$

Let’s take each of these terms one-by-one. For ${\rm A}$ , recognize that $\sum_{j=1}^m P_{ij} = 1$ . Thus,

${\rm A} = \frac{1}{2}\sum_{i=1}^m (f(i))^2 \pi_i = \frac{1}{2}\langle f, f\rangle.$

For $\rm B$ , use detailed balance $\pi_i P_{ij} = \pi_j P_{ji}$ . Then, using the condition $\sum_{i=1}^m P_{ji} = 1$ , we obtain

${\rm B} = \frac{1}{2} \sum_{j=1}^m (f(j))^2 \left(\sum_{i=1}^m \pi_j P_{ji} \right) = \frac{1}{2} \sum_{j=1}^m (f(j))^2 \pi_j = \frac{1}{2} \langle f, f\rangle.$

Finally, for ${\rm C}$ , recognize that $\sum_{j=1}^m P_{ij} f(j)$ is the $i$ th entry of the matrix–vector product $Pf$ . Thus,

${\rm C} = - \sum_{i=1}^m f(i) Pf(i) \,\pi_i = -\langle f, Pf\rangle.$

Thus, we conclude

$\mathcal{E}(f) = \langle f, (I-P)f\rangle,$

where $I$ denotes the identity matrix.

Conclusion

The Poincaré inequality

$\Var_\pi(f) \le \frac{1}{1-\lambda_2} \cdot\mathcal{E}(f) \quad \text{for all $f$ with $\expect_\pi[f] = 0$}.$

is equivalent to showing

$\frac{\mathcal{E}(f)}{\Var_\pi(f)}\ge 1-\lambda_2 \quad \text{for all $f$ with $\expect_\pi[f] = 0$}.$

Using our newly derived formulas, this in turn is equivalent to showing

$\frac{\langle f, (I-P)f\rangle}{\langle f, f\rangle} \ge 1-\lambda_2 \quad \text{for all $f$ with $\expect_\pi[f] = 0$}.$

We shall prove this version of the Poincaré inequality by expanding $f$ as a linear combination of eigenvectors.

Consider a decomposition of $f$ as a linear combination of $P$ ‘s eigenvectors:

$f = c_1 \varphi_1 + c_2 \varphi_2 + \cdots + c_m\varphi_m.$

As we showed in last post, the condition $\expect[f] = 0$ is equivalent to saying that $c_1 = 0$ .

Using the orthonormality of $\varphi_1,\ldots,\varphi_m$ under the $\pi$ -inner product and the eigenvalue relation $P\varphi_i = \lambda_i \, \varphi_i$ , we have that

$\begin{align*}\langle f, f\rangle &= c_2^2 + \cdots + c_m^2, \\\langle f, (I-P)f\rangle &= (1-\lambda_2)c_2^2 + \cdots + (1-\lambda_m)c_m^2.\end{align*}$

Thus,

(4) $\begin{align*}\frac{\langle f, (I-P)f\rangle}{\langle f, f\rangle} &= \frac{(1-\lambda_2)c_2^2 + \cdots + (1-\lambda_m)c_m^2}{c_2^2 + \cdots + c_m^2} \\&= (1-\lambda_2)a_2 + \cdots + (1-\lambda_m)a_m,\end{align*}$

where

$a_i = \frac{c_i^2}{c_2^2 + \cdots + c_m^2}.$

The coefficients $a_i$ are nonnegative and add to $1$ :

$a_2+\cdots+a_m = \frac{c_2^2+\cdots+c_m^2}{c_2^2+\cdots+c_m^2} = 1.$

Therefore, the smallest possible value for (4) is achieved by setting $a_2 = 1$ and $a_3 = \cdots = a_m = 0$ (equivalently, setting $c_3 = \cdots=c_m = 0$ ). Thus, we conclude

$\frac{\langle f, (I-P)f\rangle}{\langle f, f\rangle}\ge 1-\lambda_2,$

with equality when $f$ is a multiple of $\varphi_2$ .

The Schur Product Theorem

February 25, 2025 by Ethan N. Epperly Leave a comment

The Schur product theorem states that the entrywise product $A\circ M$ of two positive semidefinite matrices is also positive semidefinite. This post will present every proof I know for this theorem, and I intend to edit it to add additional proofs if I learn of them. (Please reach out if you know another!) My goal in this post is to be short and sweet, so I will assume familiarity with many properties for positive semidefinite matrices.

For this post, a matrix $A\in\real^{n\times n}$ is positive semidefinite (psd, for short) if it is symmetric and satisfies $x^\top Ax\ge 0$ for all vectors $x\in\real^n$ . All matrices in this post are real, though the proofs we’ll consider also extend to complex matrices. The entrywise product will be denoted $\circ$ and is defined as $(A\circ M)_{ij} = A_{ij}M_{ij}$ . The entrywise product is also known as the Hadamard product or Schur product.

It is also true that the entrywise product of two positive definite matrices is positive definite. The interested reader may be interested in seeing which of the proofs also yield this result.

Proof 1: Trace formula

We start by computing $x^\top (A\circ M)x$ :

$x^\top (A\circ M)x = \sum_{i,j=1}^n x_i (A\circ M)_{ij} x_j = \sum_{i,j=1}^n x_i A_{ij} M_{ij} x_j.$

Now, we may rearrange the sum, use symmetry of $M$ , and repackage it as a trace

$x^\top (A\circ M)x = \sum_{i,j=1}^n x_i A_{ij} x_j M_{ji} = \tr(\operatorname{diag}(x) A \operatorname{diag}(x) M).$

This the trace formula for quadratic forms in the Schur product.

Recall that a matrix $A$ is psd if and only if it $A$ is a Gram matrix (able to be expressed as $A = B^\top B$ ). Thus, we may write $A = B^\top B$ and $M = C^\top C$ . Substituting these expressions in the trace formula and invoking the cyclic property of the trace, we get

$x^\top (A\circ M)x = \tr(\operatorname{diag}(x) B^\top B \operatorname{diag}(x) C^\top C) = \tr(C\operatorname{diag}(x) B^\top B \operatorname{diag}(x) C^\top).$

The matrix on the right-hand side has the expression

$C\operatorname{diag}(x) B^\top B \operatorname{diag}(x) C^\top = G^\top G \quad \text{for } G = B \operatorname{diag}(x) C^\top.$

Therefore, it is psd and so its trace is psd:

$x^\top (A\circ M)x = \tr(G^\top G) \ge 0.$

We have shown $x^\top (A\circ M)x\ge 0$ for every vector $x$ , so $A\circ M$ is psd.

Proof 2: Gram matrix

Since $A$ and $M$ are psd, they may be written as $A = B^\top B$ and $M = C^\top C$ . Letting $b_i^\top$ and $c_i^\top$ denote the $i$ th rows of $B$ and $C$ , we have

$A = \sum_i b_ib_i^\top \quad \text{and} \quad M = \sum_j c_jc_j^\top.$

Computing the Schur product and distributing, we have

$A\circ M = \sum_{i,j} (b_ib_i^\top \circ c_jc_j^\top).$

The Schur product of rank-one matrices $b_ib_i^\top$ and $c_jc_j^\top$ is, by direct computation, $(b_i\circ c_j)(b_i\circ c_j)^\top$ . Thus,

$A\circ M = \sum_{i,j} (b_i\circ c_j)(b_i\circ c_j)^\top$

is a sum of (rank-one) psd matrices and is thus psd.

Proof 3: Covariances

Let $x$ and $y$ be independent random vectors with zero mean and covariance matrices $A$ and $M$ . The vector $x\circ y$ is seen to have zero mean as well. Thus, the $ij$ entry of the covariance matrix $\Cov(x\circ y)$ of $x\circ y$ is

$\expect[x_iy_ix_jy_j] = \expect[x_ix_j] \expect[y_iy_j] = A_{ij} M_{ij} = (A\circ M)_{ij}.$

The second equality is the independence of $x$ and $y$ , and the third equality uses the fact that $A$ and $M$ are the covariance matrices of $x$ and $y$ . Thus, the covariance matrix of $x\circ y$ is $A\circ M$ . All covariance matrices are psd, so $A\circ M$ is psd as well.¹