Low-Rank Approximation Toolbox: Nyström Approximation

Welcome to a new series for this blog, Low-Rank Approximation Toolbox. As I discussed in a previous post, many matrices we encounter in applications are well-approximated by a matrix with a small rank. Efficiently computing low-rank approximations has been a major area of research, with applications in everything from classical problems in computational physics and signal processing to trendy topics like data science. In this series, I want to explore some broadly useful algorithms and theoretical techniques in the field of low-rank approximation.

I want to begin this series by talking about one of the fundamental types of low-rank approximation, the Nyström approximation of a $N\times N$ (real symmetric or complex Hermitian) positive semidefinite (psd) matrix $A$ . Given an arbitrary $N\times k$ “test matrix” $\Omega$ , the Nyström approximation is defined to be

(1) $A\langle \Omega\rangle := A\Omega \, (\Omega^*A\Omega)^{-1} \, \Omega^*A.$

This formula is sensible whenever $\Omega^*A\Omega$ is invertible; if $\Omega^*A\Omega$ is not invertible, then the inverse ${}^{-1}$ should be replaced by the Moore–Penrose pseudoinverse ${}^\dagger$ . For simplicity, I will assume that $\Omega^* A \Omega$ is invertible in this post, though everything we discuss will continue to work if this assumption is dropped. I use ${}^*$ to denote the conjugate transpose of a matrix, which agrees with the ordinary transpose ${}^\top$ for real matrices. I will use the word self-adjoint to refer to a matrix which satisfies $A=A^*$ .

The Nyström approximation (1) answers the question

What is the “best” rank- $k$ approximation to the psd matrx $A$ provided only with the matrix–matrix product $A\Omega$ , where $\Omega$ is a known $N\times k$ matrix ( $k\ll N$ )?

Indeed, if we let $Y = A\Omega$ , we observe that the Nyström approximation can be written entirely using $Y$ and $\Omega$ :

$A\langle \Omega\rangle = Y \, (\Omega^* Y)^{-1}\, Y^*.$

This is central advantage of the Nyström approximation: to compute it, the only access to the matrix $A$ I need is the ability to multiply the matrices $A$ and $\Omega$ . In particular, I only need a single pass over the entries of $A$ to compute the Nyström approximation. This allows the Nyström approximation to be used in settings when other low-rank approximations wouldn’t work, such as when $A$ is streamed to me as a sum of matrices that must be processed as they arrive and then discarded.

Choosing the Test Matrix

Every choice of $N\times k$ test matrix $\Omega$ defines a rank- $k$ Nyström approximation¹ $A\langle \Omega\rangle$ by (1). Unfortunately, the Nyström approximation won’t be a good low-rank approximation for every choice of $\Omega$ . For an example of what can go wrong, if we pick $\Omega$ to have columns selected from the eigenvectors of $A$ with small eigenvalues, the approximation $A\langle \Omega\rangle$ will be quite poor.

The very best choice of $\Omega$ would be the $k$ eigenvectors associated with the largest eigenvalues. Unfortunately, computing the eigenvectors to high accuracy is computationally costly. Fortunately, we can get decent low-rank approximations out of much simpler $\Omega$ ‘s:

Random: Perhaps surprisingly, we get a fairly low-rank approximation out of just choosing $\Omega$ to be a random matrix, say, populated with statistically independent standard normal random entries. Intuitively, a random matrix is likely to have columns with meaningful overlap with the large-eigenvalue eigenvectors of $A$ (and indeed with any $k$ fixed orthonormal vectors). One can also pick more exotic kinds of random matrices, which can have computational benefits.
Random then improve: The more similar the test matrix $\Omega$ is to the large-eigenvalue eigenvectors of $A$ , the better the low-rank approximation will be. Therefore, it makes sense to use the power method (usually called subspace iteration in this context) to improve a random initial test matrix $\Omega_{\rm init}$ to be closer to the eigenvectors of $A$ .² See section 11.6 of the following survey for details.
Column selection: If $\Omega$ consists of columns $i_1,i_2,\ldots,i_k$ of the identity matrix, then $A\Omega$ just consists of columns $i_1,\ldots,i_k$ of $A$ : In MATLAB notation,

$A(:,\{i_1,\ldots,i_k\}) = A\Omega \quad \text{for}\quad \Omega = I(:,\{i_1,i_2,\ldots,i_k\}).$
This is highly appealing as it allows us to approximate the matrix $A$ by only reading a small fraction of its entries (provided $k\ll N$ )! Producing a good low-rank approximation requires selecting the right column indices $i_1,\ldots,i_k$ (usually under the constraint of reading a small number of entries from $A$ ). In my research with Yifan Chen, Joel A. Tropp, and Robert J. Webber, I’ve argued that the most well-rounded algorithm for this task is a randomly pivoted partial Cholesky decomposition.

The Projection Formula

Now that we’ve discussed the choice of test matrix, we shall explore the quality of the Nyström approximation as measured by the size of the residual $A - A\langle \Omega\rangle$ . As a first step, we shall show that the residual is psd. This means that $A\langle \Omega\rangle$ is an underapproximation to $A$ .

The positive semidefiniteness of the residual follows from the following projection formula for the Nyström approximation:

$A\langle \Omega \rangle = A^{1/2} P_{A^{1/2}\Omega} A^{1/2}.$

Here, $P_{A^{1/2}\Omega}$ denotes the the orthogonal projection onto the column space of the matrix $A^{1/2}\Omega$ . To deduce the projection formula, we break down $A$ as $A = A^{1/2}\cdot A^{1/2}$ in (1):

$A\langle \Omega\rangle = A^{1/2} \left( A^{1/2}\Omega \left[ (A^{1/2}\Omega)^* A^{1/2}\Omega \right]^{-1} (A^{1/2}\Omega)^* \right) A^{1/2}.$

The fact that the paranthesized quantity is $P_{A^{1/2}\Omega}$ can be verified in a variety of ways, such as by QR factorization.³

With the projection formula in hand, we easily obtain the following expression for the residual:

$A - A\langle \Omega\rangle = A^{1/2} (I - P_{A^{1/2}\Omega}) A^{1/2}.$

To show that this residual is psd, we make use of the conjugation rule.

Conjugation rule: For a matrix $B$ and a self-adjoint matrix $H$ , if $H$ is psd then $B^*HB$ is psd. If $B$ is invertible, then the converse holds: if $B^*HB$ is psd, then $H$ is psd.

The matrix $I - P_{A^{1/2}\Omega}$ is an orthogonal projection and therefore psd. Thus, by the conjugation rule, the residual of the is Nyström approximation is psd:

$A - A\langle \Omega\rangle = \left(A^{1/2}\right)^* (I-P_{A^{1/2}\Omega})A^{1/2} \quad \text{is psd}.$

Optimality of the Nyström Approximation

There’s a question we’ve been putting off that can’t be deferred any longer:

Is the Nyström approximation actually a good low-rank approximation?

As we discussed earlier, the answer to this question depends on the test matrix $\Omega$ . Different choices for $\Omega$ give different approximation errors. See the following papers for Nyström approximation error bounds with different choices of $\Omega$ . While the Nyström approximation can be better or worse depending on the choice of $\Omega$ , what is true about Nyström approximation for every choice of $\Omega$ is that:

The Nyström approximation $A\langle \Omega\rangle$ is the best possible rank- $k$ approximation to $A$ given the information $A\Omega$ .

In precise terms, I mean the following:

Theorem: Out of all self-adjoint matrices $\hat{A}$ spanned by the columns of $A\Omega$ with a psd residual $A - \hat{A}$ , the Nyström approximation has the smallest error as measured by either the spectral or Frobenius norm (or indeed any unitarily invariant norm, see below).

Let’s break this statement down a bit. This result states that the Nyström approximation is the best approximation $\hat{A}$ to $A$ under three conditions:

$\hat{A}$ is self-adjoint.
$\hat{A}$ is spanned by the columns of $A\Omega$ .

I find these first two requirements to be natural. Since $A$ is self-adjoint, it makes sense to require our approximation $\hat{A}$ to be as well. The stipulation that $\hat{A}$ is spanned by the columns $A\Omega$ seems like a very natural requirement given we want to think of approximations which only use the information $A\Omega$ . Additionally, requirement 2 ensures that $\hat{A}$ has rank at most $k$ , so we are really only considering low-rank approximations to $A$ .

The last requirement is less natural:

The residual $A - \hat{A}$ is psd.

This is not an obvious requirement to impose on our approximation. Indeed, it was a nontrivial calculation using the projection formula to show that the Nyström approximation itself satisfies this requirement! Nevertheless, this third stipulation is required to make the theorem true. The Nyström approximation (1) is the best “underapproximation” to the matrix $A$ to in the span of $A\Omega$ .

Intermezzo: Unitarily Invariant Norms and the Psd Order

To prove our theorem about the optimality of the Nyström approximation, we shall need two ideas from matrix theory: unitarily invariant norms and the psd order. We shall briefly describe each in turn.

A norm $\left\|\cdot\right\|_{\rm UI}$ defined on the set of $N\times N$ matrices is said to be unitarily invariant if the norm of a matrix does not change upon left- or right-multiplication by a unitary matrix:

$\left\|UBV\right\|_{\rm UI} = \left\|B\right\|_{\rm UI} \quad \text{for all unitary matrices $U$ and $V$.}$

Recall that a unitary matrix $U$ (called a real orthogonal matrix if $U$ is real-valued) is one that obeys $U^*U = UU^* = I$ . Unitary matrices preserve the Euclidean lengths of vectors, which makes the class of unitarily invariant norms highly natural. Important examples include the spectral, Frobenius, and nuclear matrix norms:

The unitarily invariant norm of a matrix $B$ depends entirely on its singular values $\sigma(B)$ . For instance, the spectral, Frobenius, and nuclear norms take the forms

$\begin{align*}\left\|B\right\|_{\rm op} &= \sigma_1(B),& &\text{(spectral)} \\\left\|B\right\|_{\rm F} &= \sqrt{\sum_{j=1}^N (\sigma_j(B))^2},& &\text{(Frobenius)} \\\left\|B\right\|_{*} &=\sum_{j=1}^N \sigma_j(B)).& &\text{(nuclear)}\end{align*}$

In addition to being entirely determined by the singular values, unitarily invariant norms are non-decreasing functions of the singular values: If the $j$ th singular value of $B$ is larger than the $j$ th singular value of $C$ for $1\le j\le N$ , then $\left\|B\right\|_{\rm UI}\ge \left\|C\right\|_{\rm UI}$ for every unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$ . For more on unitarily invariant norms, see this short and information-packed blog post from Nick Higham.

Our second ingredient is the psd order (also known as the Loewner order). A self-adjoint matrix $A$ is larger than a self-adjoint matrix $H$ according to the psd order, written $A\succeq H$ , if the difference $A-H$ is psd. As a consequence, $A\succeq 0$ if and only if $A$ is psd, where $0$ here denotes the zero matrix of the same size as $A$ . Using the psd order, the positive semidefiniteness of the Nyström residual can be written as $A - A\langle \Omega\rangle \succeq 0$ .

If $A$ and $H$ are both psd matrices and $A$ is larger than $H$ in the psd order, $A\succeq H\succeq 0$ , it seems natural to expect that $A$ is larger than $H$ in norm. Indeed, this intuitive statement is true, at least when one restricts oneself to unitarily invariant norms.

Psd order and norms. If $A\succeq H\succeq 0$ , then $\left\|A\right\|_{\rm UI} \ge \left\|H\right\|_{\rm UI}$ for every unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$ .

This fact is a consequence of the following observations:

If $A\succeq H$ , then the eigenvalues of $A$ are larger than $H$ in the sense that the $j$ th largest eigenvalue of $A$ is larger than the $j$ th largest eigenvalue of $H$ .
The singular values of a psd matrix are its eigenvalues.
Unitarily invariant norms are non-decreasing functions of the singular values.

Optimality of the Nyström Approximation: Proof

In this section, we’ll prove our theorem showing the Nyström approximation is the best low-rank approximation satisfying properties 1, 2, and 3. To this end, let $\hat{A}$ be any matrix satisfying properties 1, 2, and 3. Because of properties 1 (self-adjointness) and 2 (spanned by columns of $A\Omega$ ), $\hat{A}$ can be written in the form

$\hat{A} = A\Omega \, T \, (A\Omega)^* = A \Omega \, T \, \Omega^*A,$

where $T$ is a self-adjoint matrix. To make this more similar to the projection formula, we can factor $A^{1/2}$ on both sides to obtain

$\hat{A} = A^{1/2} (A^{1/2}\Omega\, T\, \Omega^*A^{1/2}) A^{1/2}.$

To make this more comparable to the projection formula, we can reparametrize by introducing a matrix $Q$ with orthonormal columns with the same column space as $A^{1/2}\Omega$ . Under this parametrization, $\hat{A}$ takes the form

$\hat{A} = A^{1/2} \,QMQ^*\, A^{1/2} \quad \text{where} \quad M\text{ is self-adjoint}.$

The residual of this approximation is

(2) $A - \hat{A} = A^{1/2} (I - QMQ^*)A^{1/2}.$

We now make use of the of conjugation rule again. To simplify things, we make the assumption that $A$ is invertible. (As an exercise, see if you can adapt this argument to the case when this assumption doesn’t hold!) Since $A - \hat{A}\succeq 0$ is psd (property 3), the conjugation rule tells us that

$I - QMQ^*\succeq 0.$

What does this observation tell us about $M$ ? We can apply the conjugation rule again to conclude

$Q^*(I - QMQ^*)Q = Q^*Q - (Q^*Q)M(Q^*Q) = I-M \succeq 0.$

(Notice that $Q^*Q = I$ since $Q$ has orthonormal columns.)

We are now in a place to show that $A - \hat{A}\succeq A - A\langle \Omega\rangle$ . Indeed,

$\begin{align*}A - \hat{A} - (A-A\langle \Omega\rangle) &= A\langle\Omega\rangle - \hat{A} \\&= A^{1/2}\underbrace{QQ^*}_{=P_{A^{1/2}\Omega}}A^{1/2} - A^{1/2}QMQ^*A^{1/2} \\&=A^{1/2}Q(I-M)Q^*A^{1/2}\\&\succeq 0.\end{align*}$

The second line is the projection formula together with the observation that $P_{A^{1/2\Omega}} = QQ^*$ and the last line is the conjugation rule combined with the fact that $I-M$ is psd. Thus, having shown

$A - \hat{A} \succeq A - A\langle\Omega\rangle \succeq 0,$

we conclude

$\|A - \hat{A}\|_{\rm UI} \ge \left\|A - A\langle \Omega\rangle\right\|_{\rm UI} \quad \text{for every unitarily invariant norm $\left\|\cdot\right\|_{\rm UI}$}.$