Low-Rank Approximation Toolbox: Generalized Nyström Approximation

Today, I want to talk about the generalized Nyström approximation, which I regard as the one of the “big three” approaches to constructing a low-rank approximation to matrix.¹ Understanding this approximation, under what conditions it works and the sharpest possible error bounds for it, is a subject of two recent papers of mine:

Faster Randomized Linear Algebra with Structured Random Matrices, joint with Chris Camaño, Raphael Meyer, and Joel Tropp.
Sharp analysis of sketched least squares and randomized low-rank approximation, joint with Robert Webber.

On the occasion of the release of the second paper this morning, I felt it was a good time to talk about the generalized Nyström approximation on this blog. In this post, I will try and motivate the generalized Nyström approximation, describing the motivation for the method and when it might be preferable to alternatives.

Existing Characters: Nyström Approximation and the Randomized SVD

To begin our story, let me begin with a reminder of a couple of characters we’ve met in previous installments of this blog, randomized Nyström approximation and the randomized SVD.

Randomized Nyström approximation is a method for producing a low-rank approximation to a positive semidefinite² (psd) matrix $A$ . For form this approximation, begin by drawing a random test matrix $\Omega$ , say, with independent standard normal random entries. (We will have more to say about the choice of $\Omega$ below). Using this test matrix, the Nyström approximation is defined as³

$\hat{A} = A\Omega (\Omega^\top A\Omega)^{-1} (A\Omega)^\top.$

To implement this algorithm in practice, one should take care to use numerically stable pseudocode; see this paper for details.

The randomized SVD is a method for constructing a low-rank approximation to a general, non-symmetric or even non-square matrix $B$ . Again, begin by constructing a random test matrix $\Omega$ . To construct a low-rank approximation, compute the product $B\Omega$ and orthonormalize its columns (e.g., by QR decomposition) to obtain

$Q \coloneqq \operatorname{orth}(B\Omega).$

Then, to construct a low-rank approximation, we employ a second product with the matrix $B$ , yielding the low-rank approximation

$\hat{B} = QC \quad \text{for } C = Q^\top B.$

How do these two algorithms compare? There are at least three major differences between the two algorithms. Here are the first two:

Scope. Nyström approximation applies only to psd matrices, and randomized SVD applies to a general rectangular mtrix.
Single-pass? The Nyström approximation requires only a single pass over the matrix $A$ to form. (Each entry of $A$ needs to be read once to form the product $A\Omega$ , after which we have all the information we need from $A$ to form $\hat{A}$ .) By contrast, the randomized SVD requires two passes, one to compute $B\Omega$ and a second to compute $Q^\top B$ .

The third point is more subtle and concerns the accuracy of these algorithms. As we saw in a previous post, the randomized SVD approximation satisfies the error bound

(1) $\expect \norm{B - \hat{B}}_{\rm F}^2 \le \min_{r \le k-2} \left( 1 + \frac{r}{k-(r+1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2.$

Here, $\norm{\cdot}_{\rm F}$ is the matrix Frobenius norm and $\lowrank{B}_r$ denotes the best rank- $r$ approximation to $B$ . This result, due to Halko, Martinsson, & Tropp (2011), shows that the error of rank- $k$ randomized SVD is comparable to the error of the best rank- $r$ approximation to $B$ of any rank $r\le k-2$ . See this post for more discussion of this error bound.

Here is analogous bound for the randomized Nyström approximation, taken from Corollary 8.3 in this paper of Tropp and Webber:

(2) $\left(\expect \norm{A - \hat{A}}_{\rm F}^2 \right)^{1/2} \le \min_{r\le k-4} \left(1 + \frac{r+1}{k-(r+3)}\right) \left( \norm{A - \lowrank{A}}_{\rm F} + \frac{1}{\sqrt{k-r}} \norm{A - \lowrank{A}_r}_* \right).$

This bound is more complicated than the bound for the randomized SVD in several ways. For us, let us focus on one main difference: The error of the randomized Nyström approximation depends on the nuclear norm error $\norm{A - \lowrank{A}_r}_*$ of the best rank- $r$ approximation.

The nuclear norm

$\norm{C}_* = \sigma_1(C) + \sigma_2(C) + \cdots$

is defined as the sum of the singular values of a matrix. It is always larger than the Frobenius norm and is much larger when the singular values of $C$ decrease a slow rate. The matrix with the slowest-possible rate of singular decrease is the identity matrix. For this matrix, its Frobenius is $\norm{\rm I}_{\rm F} = \sqrt{n}$ , and its nuclear norm is $\norm{\rm I}_* = n$ —a factor $\sqrt{n}$ larger!

The conclusion of this discussion is that, for matrices with slowly decaying eigenvalues⁴, the the randomized Nyström error bound (2) can be much larger than the randomized SVD error bound (1). For such problems, the error of the randomized SVD can be much smaller than the error of randomized Nyström approximation. We add this to our list of comparisons

Frobenius-norm error bounds? The (Frobenius-norm) error of the randomized SVD $\norm{B - \hat{B}}_{\rm F}$ is bounded in terms of the Frobenius-norm error of the best rank- $r$ approximation $\norm{B - \lowrank{B}_r}_{\rm F}$ for $r \approx k$ . For the Nyström approximation, the error $\norm{A - \hat{A}}_{\rm F}$ is bounded by a more complicated expression that also involves the nuclear norm of the best rank- $r$ approximation.

Generalized Nyström Approximation: The Best of Both Worlds

It is natural to ask: Is there one algorithm that achieves the positive attributes of both the randomized Nyström and randomized SVD algorithms? Is there a single-pass low-rank approximation algorithm that can be applying to any rectangular matrix and achieves Frobenius-norm error bounds? The answer is yes, and we will derive such an approximation now.

As with the randomized SVD and randomized Nyström approximation, we first compute the product $Y = B\Omega$ of $B$ with a random matrix $\Omega$ . We may then search for the best approximation $\hat{B}$ to $B$ spanned by the columns of $Y$ . Such an approximation takes the form $\hat{B} = YW$ , and we may find the best $W$ by solving a least-squares problem

(3) $W = \argmin_W \norm{B - YW}_{\rm F}.$

Symbolically, the solution to this problem may be written as $W = Y^\dagger B$ , where ${}^\dagger$ is the Moore–Penrose pseudoinverse. The resulting low-rank approximation is $\hat{B} = Y(Y^\dagger B)$ . In fact, this approximation coincides with the approximation generated by the randomized SVD algorithm. As with the standard randomized SVD, this approximation takes two passes over $B$ to form, one to form $Y = B\Omega$ and a second to form $W = Y^\dagger B$ .

To obtain a one-pass algorithm, we need a faster way of computing an (approximate) solution to the least-squares problem (3). As we have seen before on this blog, sketching provides a natural approach to quickly and approximately solving a least-squares problem. Specifically, we draw another random test matrix $\Phi \in \real^{n\times p}$ and solve the “sketched” least squares problem

(4) $\hat{W} = \argmin_W \norm{\Phi^\top B - (\Phi^\top Y)W}_{\rm F}.$

For this approach to be effective, we need oversampling: the dimension $p$ of the sketching matrix $\Phi \in \real^{n\times p}$ such be larger than the rank $k$ , e.g.. $p = 2k$ . The solution to (4) is

$\hat{W} = (\Phi^\top Y)^\dagger (\Phi^\top B)$

and the low-rank approximation is

$\hat{B} = Y(\Phi^\top Y)^\dagger (\Phi^\top B) = (B\Omega) (\Phi^\top B\Omega)^\dagger (\Phi^\top B).$

This type of approximation is called a generalized Nyström approximation, and it can be computed in a single pass over $B$ . (Namely, one should acquire—in the same pass—the products $B\Omega$ and $\Phi^\top B$ .) The generalized Nyström approximation also satisfies our other desired properties, applying to general, rectangular matrix and, as we will see, achieving Frobenius-norm approximation error.

History

In the modern randomized linear algebra literature, the generalized Nyström approximation appears to have been concurrently discovered by Woolfe, Liberty, Rokhlin, & Tygert (2008) and Clarkson & Woodruff (2009). An algebraically equivalent but more numerically stable version of the generalized Nyström approximation was developed by Tropp, Yurtsever, Udell, & Cevher (2017). Nakatsukasa (2020) re-examined the low-rank approximation format, developed a different class of numerical stable implementations, and suggested the name generalized Nyström approximation. Alex Townsend and Per-Gunnar Martinsson trace the origins of this low-rank approximation format far earlier back to the works of Wedderburn (1934).

Implementation

This post is concerned with the generalized Nyström approximation as a type of low-rank approximation format. To use generalized Nyström approximation in practice, one must use an appropriate algorithm which computes the decomposition in a stable way.

Perhaps the simplest algorithm for computing a generalized Nyström approximation was studied by Nakatsukasa (2020). One begins by computing the matrices

$Y = B\Omega, \quad Z = B^\top \Omega, \quad C = \Phi^\top Y.$

Then, to represent the generalized Nyström approximation $\hat{B}$ as a factored matrix, one takes a QR decomposition $C = QR$ and defines $F = YR^{-1}$ and $G = ZQ$ . The generalized Nyström approximation has been computed in factored form: $\hat{B} = FG^\top$ . In cases where the core matrix $C$ is rank-deficient up to machine precision, the numerical stability of this procedure can sometimes be aided by using a truncated SVD or column-pivoted QR decomposition of $C$ ; see Nakatsukasa’s paper for details. An alternate implementation which outputs $\hat{B}$ as a compact SVD was developed by Tropp, Yurtsever, Udell, and Cevher (2017).

Relationship to Other Formats

As the name suggests, the generalized Nyström approximation format generalizes the Nyström approximation beyond psd matrices. Indeed, the standard Nyström approximation

$\hat{A} = (A\Omega) (\Omega^\top A \Omega)^\dagger (\Omega^\top A)$

is precisely the generalized Nyström approximation of $A$ with a test matrix $\Omega = \Phi$ .

Perhaps less obviously, the generalized Nyström approximation also generalizes the randomized SVD approximation. Indeed, the randomized SVD approximation $\hat{B} = (B\Omega)(B\Omega)^\dagger B$ is the generalized Nyström approximation with trivial right test matrix $\Phi = I$ .

Generalized Nyström Approximation = Sketch-and-Solve + Randomized SVD

What is the generalized Nyström approximation? There are several interpretations. For instance, if $p = k$ and $\Phi^\top B \Omega$ is invertible, the generalized Nyström approximation is the unique approximation satisfying the interpolatory condition

$\hat{B}\Omega = B\Omega \quad \text{and} \quad \hat{B}^\top\Phi = B^\top\Phi.$

Notwithstanding the validity and usefulness of other interpretations, my view is that the most useful interpretation of generalized Nyström approximation is the one we started with:

Generalized Nyström approximation is a sketched version of the randomized SVD approximation.

To see this insight in action, we will use it to analyze the generalized Nyström approximation with Gaussian test matrices. Let $\Omega \in \real^{n\times k}$ and $\Phi \in \real^{n\times p}$ be populated with independent standard Gaussian random entries, and as we have been, assume $p\ge k$ . Let us analyze the expected (squared) Frobenius-norm error of the generalized Nyström approximation.

We will use the following result for sketching with a Gaussian embedding, due to Bartan & Pilanci (2020) and discussed in this previous blog post.

Theorem 1 (sketch-and-solve): Consider a (matrix) least-squares problem
$W = Y^\dagger B= \argmin_W \norm{B - YW}_{\rm F}$
with dimensions $Y \in \real^{n \times k}$ and $B \in \real^{n\times n}$ . Let $\Phi \in \real^{n\times p}$ be a (standard) Gaussian test matrix, and instate the sketch-and-solve solution
$\hat{W} = (\Phi^\top Y)^\dagger \Phi^\top B = \argmin_W \norm{\Phi^\top B - (\Phi^\top Y)W}_{\rm F}.$
Then
$\expect \norm{B - Y\hat{W}}_{\rm F}^2 = \expect \norm{B - Y(\Phi^\top Y)^\dagger\Phi^\top B}_{\rm F}^2 = \left( 1 + \frac{k}{p - (k+1)} \right)\norm{B - YY^\dagger B}_{\rm F}^2.$

We can apply this result to the generalized Nyström approximation by setting $Y = B\Omega$ . Let $\expect_{\Phi}$ denote the expectation over $\Phi$ alone. Then

$\expect_\Phi \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2 = \left( 1 + \frac{k}{p - (k + 1)} \right)\norm{B - (B\Omega)(B\Omega)^\dagger B}_{\rm F}^2.$

But $(B\Omega)(B\Omega)^\dagger B$ is just the randomized SVD approximation. Invoking the randomized SVD bound (1) yields

$\begin{align*}\expect \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2 &= \expect_\Omega\left[\expect_\Phi \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2\right] \\&= \left( 1 + \frac{k}{p - (k + 1)} \right)\expect_\Omega \norm{B - (B\Omega)(B\Omega)^\dagger B}_{\rm F}^2 \\&\le \left( 1 + \frac{k}{p - (k + 1)} \right)\left[\min_{r < k-1}\left( 1 + \frac{r}{k - (r + 1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2\right].\end{align*}$

Voilà! We have obtained explicit bounds for the generalized Nyström method with little effort. We record this result as a theorem:

Theorem 2 (generalized Nyström approximation): With the present setting, it holds that
$\expect \norm{B - B\Omega(\Phi^\top B\Omega)^\dagger\Phi^\top B}_{\rm F}^2 \le \left( 1 + \frac{k}{p - (k + 1)} \right)\left[\min_{r < k-1}\left( 1 + \frac{r}{k - (r + 1)} \right) \norm{B - \lowrank{B}_r}_{\rm F}^2\right].$

This result is Theorem 4.3 in this paper of Tropp, Yurtsever, Udell, & Cevher (2017). A slight refinement of this bound appears in my paper with Robert Webber, and we show that our new bound is sharp on hard examples. Thus, Theorem 2 is nearly the best possible error bound for the generalized Nyström approximation.

Choice of Random Matrix

For most of this post, we have focused on the cases where the random test matrices $\Omega$ and $\Phi$ are unstructured matrices with Gaussian random entries. But can we use more structured random test matrices? Say, sparse test matrices? Do these lead to faster low-rank approximation algorithms?

For the randomized SVD, the results are disappointing. Computing $Y = B\Omega$ with a sparse random matrix is fast. But then we compute $Q = \operatorname{orth}(Y)$ and compute $C = Q^\top B$ ; the matrix $Q$ is dense and unstructured, so computing $C = Q^\top B$ is slower and all benefits of the sparse test matrix have been erased.

The issue with the randomized SVD is that it’s a two-pass algorithm: The first pass, computing $B\Omega$ , can be done using a sparse random test matrix. But the second pass $Q^\top B$ requires a matrix product with a dense matrix $Q$ .

The situation is much better for the generalized Nyström approximation, which requires only a single pass and can be implemented only by multiplying $B$ against sparse matrices. Indeed, generating $\Omega$ and $\Phi$ to be sparse matrices, the generalized Nyström approximation can be written

$\hat{B} = Y(\Phi^* Y)^\dagger Z \quad \text{for } Y = B\Omega \text{ and } Z = \Phi^\top B.$

We see that the only interaction we need with the matrix $B$ has been isolated into matrix products with the random test matrices, and we obtain speedups by replacing using sparse random test matrices for $\Omega$ and $\Phi$ .

Structured test matrices, like sparse ones, can be very powerful. But basic theoretical questions remain about their properties. We tackle these theoretical questions in my new paper (joint with Chris Camaño, Raphael Meyer, and Joel Tropp), and we provide experiments demonstrating how structured sketching matrices can lead to large speedups in generalized Nyström approximation and other linear algebra tasks. I think it’s a really neat paper, and my wonderful collaborator Chris did some really beautiful experiments for it. I hope you’ll check it out!