Low-Rank Approximation Toolbox: Randomized SVD

The randomized SVD and its relatives are workhorse algorithms for low-rank approximation. In this post, I hope to provide a fairly brief introduction to this useful method. In the following post, we’ll look into its analysis.

The randomized SVD is dead-simple to state: To approximate an $m\times n$ matrix $B$ by a rank- $k$ matrix, perform the following steps:

Collect information. Form $Y = B\Omega$ where $\Omega$ is an appropriately chosen $n\times k$ random matrix.
Orthogonalize. Compute an orthonormal basis $Q$ for the column space of $Y$ by, e.g., thin QR factorization.
Project. Form $C \coloneqq Q^*B$ . (Here, ${}^*$ denotes the conjugate transpose of a complex matrix, which reduces to the ordinary transpose if the matrix is real.)

If all one cares about is a low-rank approximation to the matrix $B$ , one can stop the randomized SVD here, having obtained the approximation

$B \approx X \coloneqq Q C.$

As the name “randomized SVD” suggests, one can easily “upgrade” the approximation $X = QC$ to a compact SVD format:

SVD. Compute a compact SVD of the matrix $C$ : $C = \hat{U}\Sigma V^*$ where $\hat{U}$ and $\Sigma$ and $k\times k$ and $V$ is $n\times k$ .
Clean up. Set $U \coloneqq Q\hat{U}$ .

We now have approximated $B$ by a rank- $k$ matrix $X$ in compact SVD form:

$B \approx X = QC = U\Sigma V^*.$

One can use the factors $U$ , $\Sigma$ , and $V$ of the approximation $X \approx B$ as estimates of the singular vectors and values of the matrix $B$ .

What Can You Do with the Randomized SVD?

The randomized SVD has many uses. Pretty much anywhere you would ordinarily use an SVD is a candidate for the randomized SVD. Applications include model reduction, fast direct solvers, computational biology, uncertainty quantification, among numerous others.

To speak in broad generalities, there are two ways to use the randomized SVD:

As an approximation. First, we could use $X$ as a compressed approximation to $B$ . Using the format $X = QC$ , $X$ requires just $(m+n)k$ numbers of storage, whereas $B$ requires a much larger $mn$ numbers of storage. As I detailed at length in my blog post on low-rank matrices, many operations are cheap for the low-rank matrix $X$ . For instance, we can compute a matrix–vector product $Xz$ in roughly $(m+n)k$ operations rather than the $mn$ operations to compute $Bz$ . For these use cases, we don’t need the “SVD” part of the randomized SVD.
As an approximate SVD. Second, we can actually use the $U$ , $\Sigma$ , and $V$ produced by the randomized SVD as approximations to the singular vectors and values of $B$ . In these applications, we should be careful. Just because $X$ is a good approximation to $B$ , it is not necessarily the case that $U$ , $\Sigma$ , and $V$ are close to the singular vectors and values of $B$ . To use the randomized SVD in this context, it is safest to use posterior diagnostics (such as the ones developed in this paper of mine) to ensure that the singular values/vectors of interest are computed to a high enough accuracy. A useful rule of thumb is the smallest two to five singular values and vectors computed by the randomized SVD are suspect and should be used in applications only with extreme caution. When the appropriate care is taken and for certain problems, the randomized SVD can provide accurate singular vectors far faster than direct methods.

Accuracy of the Randomized SVD

How accurate is the approximation $X$ produced by the randomized SVD? No rank- $k$ approximation can be more accurate than the best rank- $k$ approximation $\lowrank{B}_k$ to the matrix $B$ , furnished by an exact SVD of $B$ truncated to rank $k$ . Thus, we’re interested in how much larger $B - X$ is than $B - \lowrank{B}_k$ .

Frobenius Norm Error

A representative result describing the error of the randomized SVD is the following:

(1) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2 \le \min_{r \le k-2} \left( 1 + \frac{r}{k-(r+1)} \right) \left\|B - \lowrank{B}_r\right\|^2_{\rm F}.$

This result states that the average squared Frobenius-norm error of the randomized SVD is comparable with the best rank- $r$ approximation error for any $r\le k-2$ . In particular, choosing $k = r+2$ , we see that the randomized SVD error is at most $(1+r)$ times the best rank- $r$ approximation

(2) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2 \le (1+r) \left\|B - \lowrank{B}_r \right\|^2_{\rm F} \quad \text{for $k=r+2$}.$

Choosing $k = 2r+1$ , we see that the randomized SVD has at most twice the error of the best rank- $r$ approximation

(3) $\mathbb{E} \left\|B - X\right\|_{\rm F}^2 \le 2 \left\|B - \lowrank{B}_r\right\|^2_{\rm F} \quad \text{for $k=2r+1$}.$

In effect, these results tell us that if we want an approximation that is nearly as good as the best rank- $r$ approximation, using an approximation rank of, say, $k = r+2$ or $k = 2r$ for the randomized SVD suffices. These results hold even for worst-case matrices. For nice matrices with steadily decaying singular values, the randomized SVD can perform even better than equations (2)–(3) would suggest.

Spectral Norm Error

The spectral norm is often a better error metric for matrix computations than the Frobenius norm. Is the randomized SVD also comparable with the best rank- $k$ approximation when we use the spectral norm? In this setting, a representative error bound is

(4) $\mathbb{E} \left\|B - X\right\|^2 \le \min_{r \le k-2} \left( 1 + \frac{2r}{k-(r+1)} \right) \left(\left\|B - \lowrank{B}_r \right\|^2 + \frac{\mathrm{e}^2}{k-r} \left\|B - \lowrank{B}_r\right\|^2_{\rm F} \right).$

The spectral norm error of the randomized SVD depends on the Frobenius norm error of the best rank- $r$ approximation for $r \le k-2$ .

Recall that the spectral and Frobenius norm can be defined in terms of the singular values of the matrix $B$ :

$\norm{B} = \sigma_1(B),\quad \left\|B\right\|_{\rm F}^2 = \sum_i \sigma_i(B)^2.$

Rewriting (4) using these formulas, we get

$\mathbb{E} \left\|B - X\right\|^2 \le \min_{r \le k-2} \left( 1 + \frac{2r}{k-(r+1)} \right) \left(\sigma_{r+1}^2 + \frac{\mathrm{e}^2}{k-r} \sum_{i>r} \sigma_i^2 \right).$

Despite being interested in the largest singular value of the error matrix $B-X$ , this bound demonstrates that the randomized SVD incurs errors based on the entire tail of $B$ ‘s singular values $\sum_{i>r}\sigma_i^2$ . The randomized SVD is much worse than the best rank- $k$ approximation for a matrix $B$ with a long detail of slowly declining singular values.

Improving the Approximation: Subspace Iteration and Block Krylov Iteration

We saw that the randomized SVD produces nearly optimal low-rank approximations when we measure using the Frobenius norm. When we measure using the spectral norm, we have a split decision: If the singular values decay rapidly, the randomized SVD is near-optimal; if the singular values decay more slowly, the randomized SVD can suffer from large errors.

Fortunately, there are two fixes to improve the randomized SVD in the presence of slowly decaying singular values:

Subspace iteration: To improve the quality of the randomized SVD, we can combine it with the famous power method for eigenvalue computations. Instead of $Y = B\Omega$ , we use the powered expression $Y = B(B^*B)^q\Omega$ . Powering has the effect of amplifying the large singular values of $B$ and dimishing the influence of the tail.
Block Krylov iteration: Subspace iteration is effective, but fairly wasteful. To compute $B(B^*B)^q\Omega$ we compute the block Krylov sequence
$B\Omega, BB^*B\Omega, BB^*BB^*B\Omega,\ldots,B(B^*B)^q\Omega$
and throw away all but the last term. To make more economical use of resources, we can use the entire sequence as our information matrix $Y$ :
$Y = \begin{bmatrix} B\Omega & BB^*B\Omega & BB^*BB^*B\Omega& \cdots & B(B^*B)^q\Omega\end{bmatrix}.$
Storing the entire block Krylov sequence is more memory-intensive but makes more efficient use of matrix products than subspace iteration.

Both subspace iteration and block Krylov iteration should be carefully implemented to produce accurate results.

Both subspace iteration and block Krylov iteration diminish the effect of a slowly decaying tail of singular values on the accuracy of the randomized SVD. The more slowly the tail decays, the larger one must take the number of iterations $q$ to obtain accurate results.