Note to Self: Sketch-and-Solve with a Gaussian Embedding

In the comments to my post Does Sketching Work?, I was asked a very interesting question, which I will paraphrase as follows:

Is the sketch-and-solve least-squares solution \hat{x} = \operatorname*{argmin}_{\hat{x}} \norm{(SA)\hat{x} - Sb} an unbiased estimate of the true least squares solution of x = \operatorname*{argmin}_x \norm{Ax - b}?

In this post, I will answer this question for the special case in which S is a Gaussian sketch, and I will also compute the expected least-squares residual \expect \norm{A\hat{x}-b}^2. Throughout this post, I will assume knowledge of sketching; see my previous two posts for a refresher if needed. For this post, A will have dimensions n\times d and S will have dimensions \ell\times n, and these parameters are related n\ge \ell \ge d.

I thought I knew the answer to this question—that sketch-and-solve suffers from inversion bias. Indeed, the fact I remembered is that

(1)   \[\expect\Big[\big((SA)^\top (SA)\big)^{-1}\Big] \ne \big(A^\top A\big)^{-1}\]

for the sketching matrices S in common use. The issue of the inversion bias (1) is discussed in this paper.

However, the inversion bias (1) does not imply that sketch-and-solve is biased. In fact, for a Gaussian sketch1Note that, typically, we use a sketch S where the variance of the entries is 1/d. The sketch-and-solve solution (SA)^\dagger (Sb) does not change under a scaling of the matrix S. Thus, we are free to take the entries to have variance one.

(2)   \[S \in \real^{\ell\times n} \quad \text{with } S_{ij} \sim \text{Normal}(0,1) \text{ iid},\]

sketch-and-solve is unbiased.

Theorem (Gaussian sketch-and-solve is unbiased). Let S be a Gaussian sketch (2) and let A \in \real^{n\times d} be a full column-rank matrix. Then

(3)   \[\expect[(SA)^\dagger(Sb)] = A^\dagger b. \]

That is, the sketch-and-solve solution (SA)^\dagger(Sb) is an unbiased estimate for the least-squares solution A^\dagger b.

Here, {}^\dagger is the Moore–Penrose pseudoinverse, which effectuates the least-squares solution:

    \[A^\dagger b = \operatorname*{argmin}_{x\in\real^d} \norm{Ax - b}.\]

In particular, by the normal equations, we have the relation

(4)   \[A^\dagger = (A^\top A)^{-1} A^\top,\]

valid for any matrix A with full column-rank.

Let us prove this theorem. The Gaussian sketch (1) is orthogonally invariant: that is, V^\top SU has the same distribution as S for any orthogonal matrices U and V. Thus, we are free to reparametrize. Let

    \[A = U\twobyone{\Sigma}{0}V^\top\]

be a (full) singular value decomposition of A. Make the changes of variables

    \[A \mapsto U^\top A V, \quad S \mapsto V^\top SU, \quad b \mapsto U^\top b.\]

After this change of variables, S still has the same distribution (2) and

    \[A = \twobyone{\Sigma}{0}.\]

Partition the reparametrized b and S conformally with A:

    \[b = \twobyone{b_1}{b_2}, \quad S = \onebytwo{S_1}{S_2}.\]

Under this change of variables, the least-squares solution is

(5)   \[A^\dagger b = \Sigma^{-1}b_1.\]

Now, we are ready to prove the claim (3). Observe that SA = S_1\Sigma. Begin by using the normal equations (4) to write out the sketch-and-solve solution

    \[(SA)^\dagger (Sb) = [(SA)^\top (SA)]^{-1}(SA)^\top (Sb).\]

Observe that SA = S_1\Sigma. Thus,

    \[(SA)^\dagger (Sb) = [\Sigma S_1^\top S_1\Sigma]^{-1}\Sigma S_1^\top \onebytwo{S_1}{S_2}\twobyone{b_1}{b_2}.\]

Simplify to obtain

    \[(SA)^\dagger (Sb) = \Sigma^{-1}(S_1^\top S_1)^{-1} S_1^\top (S_1b_1 + S_2 b_2).\]

Using the normal equations (4) and the solution formula (5), we obtain

(6)   \[(SA)^\dagger (Sb) = \Sigma^{-1}b_1 + \Sigma^{-1} S_1^\dagger S_2b_2 = A^\dagger b + \Sigma^{-1} S_1^\dagger S_2b_2. \]

By the definition (2) of S, S_1 and S_2 are independent and \expect[S_2] = 0. Thus, taking expectations, we obtain

    \[\expect[(SA)^\dagger (Sb)] = A^\dagger b + \Sigma^{-1} \expect[S_1^\dagger] \expect[S_2]b_2 = A^\dagger b.\]

The desired claim (3) is proven.

Note that the solution formula (6) holds for any sketching matrix S. We only use Gaussianity at the last step, where we invoke three properties (1) S is mean zero; (2) S is orthogonally invariant; and (3) conditional on the first d columns of S, the last n-d columns of S are mean-zero.

The Least-Squares Residual for Gaussian Sketch-and-Solve

The solution formula (6) can be used to understand other properties of the Gaussian sketch-and-solve solution. In this section, we use (6) to understand the expected squared residual norm

    \[\expect \norm{A\hat{x} - b}^2,\]

where \hat{x} = (SA)^\dagger (Sb) is the sketch-and-solve solution. We write the expected residual norm as

    \[\expect\norm{A\hat{x} - b}^2 = \expect\norm{\twobyone{b_1 + S_1^\dagger S_2b_2}{0} - \twobyone{b_1}{b_2}}^2 = \norm{b_2}^2 + \expect\norm{S_1^\dagger S_2b_2}^2.\]

Now, use the Gaussian–inverse Gaussian matrix product formula to compute the expectation, obtaining

    \[\expect\norm{A\hat{x} - b}^2 = \left(1+\frac{d}{\ell-d-1}\right)\norm{b_2}^2.\]

Finally, we recognize \norm{b_2} is the optimal least-squares residual \min_x \norm{Ax - b}. Thus, we have shown

    \[\expect\norm{A\hat{x} - b}^2 = \left(1+\frac{d}{\ell-d-1}\right)\min_{x\in\real^d} \norm{Ax-b}^2.\]

Note that this is an exact formula for the expected squared residual for the sketch-and-solve solution! Thus, the sketch-and-solve solution has an expected squared residual that factor 1 + d/(\ell-d-1) larger than the optimal value. In particular,

    \[\expect\norm{A\hat{x} - b}^2 = \left(1+\varepsilon\right)\min_{x\in\real^d} \norm{Ax-b}^2 \quad \text{when } \ell = \frac{d}{\varepsilon} + d + 1.\]

Remark (Existing literature): When I originally wrote this post, I had not seen these results anywhere in the literature. Thanks to Michał Dereziński, who provided me with the following references: the following lecture notes of Mert Pilanci, equation 1 in this recent paper of Michał’s, and this recent survey by Michał and Michael Mahoney. I find it striking that these basic and beautiful results appear to only have been explicitly recorded very recently. Very big thanks also to Rob Webber for providing help in preparing this post.

Update (11/21/24): After this post was initially released, Mert Pilanci provided me with an earlier reference (2020); see also this paper for extensions. He also mentions an even earlier paper (2017) that does similar calculations in a statistical setting, but does not obtain exactly these formulas. See also this paper for exact and asymptomatically exact formulas for sketch-and-solve in the statistical setting.

Leave a Reply

Your email address will not be published. Required fields are marked *