Randomized Kaczmarz: How Should You Sample?

The randomized Kaczmarz method is a method for solving systems of linear equations:

(1)   \[\text{Find $x$ such that } Ax = b. \]

Throughout this post, the matrix A will have dimensions n\times d. Beginning from an initial iterate x_0 = 0, randomized Kaczmarz works as follows. For t = 0,1,2,\ldots:

  • Sample a random row index i_t with probability \prob \{ i_t = j \} = p_j.
  • Update to enforce the equation a_{i_t}^\top x = b_{i_t} holds exactly:

        \[x_{t+1} \coloneqq x_t + \frac{b_{i_t} - a_{i_t}^\top x_t}{\norm{a_{i_t}}^2} a_{i_t}.\]

    Throughout this post, a_j^\top denotes the jth row of A.

What selection probabilities p_j should we use? The answer to this question may depend on whether the system (1) is consistent, i.e., whether it possesses a solution x. For this post, we assume (1) is consistent; see this previous post for a discussion of the inconsistent case.

The classical selection probabilities for randomized Kaczmarz were proposed by Strohmer and Vershynin in their seminal paper:

(1)   \[p_j = \frac{\norm{a_j}^2}{\norm{A}_{\rm F}^2} \quad \text{for } j = 1,2,\ldots,n. \]

Computing these selection probabilities requires a full pass over the matrix, which can be expensive for the largest problems.1This initial pass can sometimes be avoided by using rejection sampling, though. A computationally appealing alternative is to implement randomized Kaczmarz with uniform selection probabilities

(2)   \[p_j = \frac{1}{n} \quad \text{for } j = 1,2,\ldots,n. \]

Ignoring computational cost, which sampling rule leads to faster convergence: (1) or (2)?

Surprisingly, to me at least, the simpler strategy (2) often works better than (1). Here is a simple example. Define a matrix A \in \real^{20\times 20} with entries A_{ij} = \min(i,j)^2, and choose the right-hand side b\in\real^{20} with standard Gaussian random entries. The convergence of standard RK with sampling rule (1) and uniform RK with sampling rule (2) is shown in the plot below. After a million iterations, the difference in final accuracy is dramatic: the final relative error 0.00012 was uniform RK and 0.67 for standard RK!

Error for randomized Kaczmarz with both squared row norm sampling ("standard") and uniformly random rows on matrix with entries min(i,j)^2. Uniform randomized Kaczmarz achieves significantly smaller final error

In fairness, uniform RK does not always outperform standard RK. If we change the matrix entries to A_{ij} = \min(i,j), then the performance of both methods is similar, with both methods ending with a relative error of about 0.07.

Error for randomized Kaczmarz with both squared row norm sampling ("standard") and uniformly random rows on matrix with entries min(i,j). Uniform randomized Kaczmarz and standard randomized Kaczmarz achieve comparable final errors

Another experiment, presented in section 4.1 of Strohmer and Vershynin’s original paper, provides an example where standard RK converges a bit more than twice as fast as uniform RK (called “simple RK” in their paper). Still, taken all together, these experiments demonstrate that standard RK (sampling probabilities (1)) is often not dramatically better than uniform RK (sampling probabilities (2)), and uniform RK can be much better than standard RK.

Randomized Kaczmarz Error Bounds

Why does uniform RK often outperform standard RK? To answer these questions, let’s look at the error bounds for the RK method.

The classical analysis of standard RK shows the method is geometrically convergent

(3)   \[\expect\left[ \norm{x_t - x_\star}^2 \right] \le (1 - \kappa_{\rm dem}(A)^{-2})^t \norm{x_\star}^2. \]

Here,

(4)   \[\kappa_{\rm dem}(A) = \frac{\norm{A}_{\rm F}}{\sigma_{\rm min}(A)} = \sqrt{\sum_i \left(\frac{\sigma_i(A)}{\sigma_{\rm min}(A)}\right)^2} \]

is known as the Demmel condition number and \sigma_i(A) are the singular values of A. Recall that we have assumed the system Ax = b is consistent, possessing a solution x_\star. If there are multiple solutions, we let x_\star denote the solution of minimum norm.

What about uniform RK? Let D_A = \diag ( \norm{a_i}^{-1} : i=1,\ldots,n ) denote a diagonal matrix containing the inverse row norms, and introduce the row-equilibrated matrix D_A A. The row-equilibrated matrix D_A A has been obtained from A by rescaling each of its rows to have unit norm.

Uniform RK can then be related to standard RK run on the row-equilibrated matrix:

Fact (uniform sampling and row equilibration): Uniform RK on the system Ax = b produces the same sequence of (random) iterates \hat{x}_t as standard RK applied to the row-equilibrated system (D_A A)x = D_A b.

Therefore, by (3), the iterates \hat{x}_t of uniform RK satisfy the bound

(5)   \[\expect\left[ \norm{\hat{x}_t - x_\star}^2 \right] \le (1 - \kappa_{\rm dem}(D_A A)^{-2})^t \norm{x_\star}^2.\]

Thus, at least using the error bounds (3) and (5), whether standard or uniform RK is better depends on which matrix has a smaller Demmel condition number: A or D_A A.

Row Equilibration and the Condition Number

Does row equilibration increase or decrease its condition number? What is the optimal way of scaling the rows of a matrix to minimize its condition number? These are classical questions in numerical linear algebra, and they were addressed in a classical 1969 paper of van der Sluis. These results were then summarized and generalized in Higham’s delightful monograph Accuracy and Stability of Numerical Algorithms. Here, we present answers to these questions using a variant of van der Sluis’ argument.

First, let’s introduce some more concepts and notation. Define the spectral norm condition number

    \[\kappa(A) \coloneqq \frac{\sigma_{\rm max}(A)}{\sigma_{\rm min}(A)}\]

The spectral norm and Demmel condition numbers are always comparable \kappa(A) \le \kappa_{\rm dem}(A)\le \sqrt{\min(n,d)}\cdot \kappa(A). Also, let \mathrm{Diag} denote the set of all (nonsingular) diagonal matrices.

Our first result shows us that row equilibration never hurts the Demmel condition number by much. In fact, the row-equilibrated matrix produces a nearly optimal Demmel condition number when compared to any row scaling:

Theorem 1 (Row equilibration is a nearly optimal row scaling). Let A\in\real^{n\times d} be wide n\le d and full-rank, and let D_AA denote the row-scaling of A to have unit row norms. Then

    \[\kappa_{\rm dem}(D_AA) \le \sqrt{n}\cdot \min_{D \in \mathrm{Diag}} \kappa (DA) \le \sqrt{n}\cdot \min_{D \in \mathrm{Diag}} \kappa_{\rm dem} (DA).\]

By scaling the rows of a square or wide matrix to have unit norm, we bring the Demmel condition number to within a \sqrt{n} factor of the optimal row scaling. In fact, we even bring the Demmel condition number to within a \sqrt{n} factor of the optimal spectral norm condition number for any row scaling.

Since the convergence rate for randomized Kaczmarz is \kappa_{\rm dem}(A)^{-2}, this result shows that implementing randomized Kaczmarz with uniform sampling yields to a convergence rate that is within a factor of n of the optimal convergence rate using any possible sampling distribution.

This result shows us that row equilibration can’t hurt the Demmel condition number by much. But can it help? The following proposition shows that it can help a lot for some problems.

Proposition 2 (Row equilibration can help a lot). Let A\in\real^{n\times d} be wide n\le d and full-rank, and let \gamma denote the maximum ratio between two row norms:

    \[\gamma \coloneqq \frac{ \max_i \norm{a_i}}{\min_i \norm{a_i}}.\]

Then the Demmel condition number of the original matrix A satisfies

    \[\kappa_{\rm dem}(A) \le \gamma \cdot \kappa_{\rm dem}(D_A A).\]

Moreover, for every \gamma\ge 1, there exists a matrix A_\gamma where this bound is nearly attained:

    \[\kappa_{\rm dem}(A_\gamma) \ge \sqrt{1-\frac{1}{n}} \cdot \gamma \cdot \kappa_{\rm dem}(D_{A_\gamma}A_\gamma).\]

Taken together, Theorem 1 and Proposition 2 show that row equilibration often improves the Demmel condition number, and never increases it by that much. Consequently, uniform RK often converges faster than standard RK for square (or short wide) linear systems, and it never converges much slower.

Proof of Theorem 1

We follow Higham’s approach. Each of the n rows of D_AA each have unit norm, so

(7)   \[\norm{D_AA}_{\rm F} = \sqrt{n}.\]

The minimum singular value of D_A A can be written in terms of the Moore–Penrose pseudoinverse (D_A A)^\dagger = A^\dagger D_A^{-1} as follows

    \[\frac{1}{\sigma_{\rm min}(D_A A)} = \norm{A^\dagger D_A^{-1}}.\]

Here, \norm{\cdot} denotes the spectral norm. Then for any nonsingular diagonal matrix D, we have

(8)   \[\frac{1}{\sigma_{\rm min}(D_A A)} = \norm{A^\dagger D^{-1} (DD_A^{-1})} \le \norm{A^\dagger D^{-1}} \norm{DD_A^{-1}} = \frac{\norm{DD_A^{-1}}}{\sigma_{\rm min}(DA)}. \]

Since the matrix DD_A^{-1} is diagonal its spectral norm is

    \[\norm{DD_A^{-1}} = \max \left\{ \frac{|D_{ii}|}{|(D_A)_{ii}|} : 1\le i \le n \right\}.\]

The diagonal entries of D_A are \norm{a_i}^{-1}, so

    \[\norm{DD_A^{-1}} = \max \left\{ |D_{ii}|\norm{a_i} : 1\le i\le n \right\}\]

is the maximum row norm of the scaled matrix DA. The maximum row norm is always less than the largest singular value of DA, so \norm{DD_A^{-1}} \le \sigma_{\rm max}(DA). Therefore, combining this result, (7), and (9), we obtain

    \[\kappa_{\rm dem}(D_AA) \le \sqrt{n} \cdot \frac{\sigma_{\rm max}(DA)}{\sigma_{\rm min}(DA)} = \sqrt{n}\cdot \kappa (DA).\]

Since this bound holds for every D \in \mathrm{Diag}, we are free to minimize over D, leading to the first inequality in the theorem:

    \[\kappa_{\rm dem}(D_AA) \le \sqrt{n}\cdot \min_{D \in \mathrm{Diag}} \kappa (DA).\]

Since the spectral norm condition number is smaller than the Demmel condition number, we obtain the second bound in the theorem.

Proof of Proposition 2

Write A = D_A^{-1}(D_AA). Using the Moore–Penrose pseudoinverse again, write

(10)   \[\kappa_{\rm dem}(A) = \norm{D_A^{-1}(D_AA)}_{\rm F} \norm{(D_A A)^\dagger D_A}.\]

The Frobenius norm and spectral norm satisfy a (mixed) submultiplicative property

    \[\norm{BC}_{\rm F} \le \norm{B}\norm{C}_{\rm F}, \quad \norm{BC} \le\norm{B}\norm{C}.\]

Applying this result to (1), we obtain

    \[\kappa_{\rm dem}(A) \le \norm{D_A^{-1}}\norm{D_AA}_{\rm F} \norm{(D_A A)^\dagger} \norm{D_A}.\]

We recognize \gamma = \norm{D_A^{-1}}\norm{D_A} and \kappa_{\rm dem}(D_A A) = \norm{D_AA}_{\rm F} \norm{(D_A A)^\dagger}. We conclude

    \[\kappa_{\rm dem}(A) \le \gamma \cdot \kappa_{\rm dem}(D_A A).\]

To show this bound is nearly obtained, introduce A_\gamma = \diag(\gamma,\gamma,\ldots,\gamma,1). Then D_{A_\gamma} A_\gamma = I with \kappa_{\rm dem}(D_{A_\gamma}A_{\gamma}) = \sqrt{n} and

    \[\kappa_{\rm dem}(A_\gamma) = \frac{\norm{A_{\gamma}}_{\rm F}}{\sigma_{\rm min}(A_\gamma)} = \frac{\sqrt{(n-1)\gamma^2+1}}{1} \ge \sqrt{n} \cdot \sqrt{1-\frac{1}{n}} \cdot \gamma.\]

Therefore,

    \[\kappa_{\rm dem}(A_\gamma) \ge \sqrt{1-\frac{1}{n}} \cdot \gamma \cdot \kappa_{\rm dem}(D_{A_\gamma}A_\gamma).\]

Practical Guidance

What does this theory mean for practice? Ultimately, single-row randomized Kaczmarz is often not the best algorithm for the job for ordinary square (or short–wide) linear systems, anyway—block Kaczmarz or (preconditioned) Krylov methods have been faster in my experience. But, supposing that we have locked in (single-row) randomized Kaczmarz as our algorithm, how should we implement it?

This question is hard to answer, because there are examples where standard RK and uniform RK both converge faster than the other. Theorem 1 suggests uniform RK can require as many as n\times more iterations than standard RK on a worst-case example, which can be a big difference for large problems. But, particularly for badly row-scaled problems, Proposition 2 shows that uniform RK can dramatically outcompete standard RK. Ultimately, I would give two answers.

First, if the matrix A has already been carefully designed to be well-conditioned and computing the row norms is not computationally burdensome, then standard RK may be worth the effort. Despite this theory suggesting it can do quite badly, it took a bit of effort to construct a simple example of a “bad” matrix where uniform RK significantly outcompeted standard RK. (On most examples I constructed, the rate of convergence of the two methods were similar.)

Second, particularly for the largest systems where you only want to make a small number of total passes over the matrix, expending a full pass over the matrix to compute the row norms is a significant expense. And, for poorly row-scaled matrices, sampling using the squared row norms can hurt the convergence rate. Based on these observations, given a matrix of unknown row scaling and conditioning or given a small budget of passes over the matrix, I would use the uniform RK method over the standard RK method.

Finally, let me again emphasize that the theoretical results Theorem 1 and Proposition 2 only apply to square or wide matrices A. Uniform RK also appears to work for consistent systems with a tall matrix, but I am unaware of a theoretical result comparing the Demmel condition numbers of D_AA and A that applies to tall matrices. And for inconsistent systems of equations, it’s a whole different story.

Edit: After initial publication of this post, Mark Schmidt shared that the observation that uniform RK can outperform standard RK was made nearly ten years ago in section 4.2 of the following paper. They support this observation with a different mathematical justification

Randomized Kaczmarz is Asympotically Unbiased for Least Squares

I am delighted to share that my paper Randomized Kaczmarz with tail averaging, joint with Gil Goldshlager and Rob Webber, has been posted to arXiv. This paper in particular really benefited from a lot of feedback from discussions with friends, colleagues, and experts, and I’d like to thank everyone who gave us feedback on this paper.

In this post, I want to provide a different and complementary perspective on the results of this paper, and provide some more elementary results and derivations that didn’t make the main paper.

The randomized Kaczmarz (RK) method is an iterative method for solving systems of linear equations Ax = b, whose dimensions will be n\times d throughout this post. Beginning from a trivial initial solution x_0=0, the method works by repeating the following two steps for t = 0,1,2,\ldots:

  1. Randomly sample a row i_t of A with probability

        \[\prob\{ i_t = j\} = \frac{\norm{a_j}^2}{\norm{A}_{\rm F}^2}.\]

    Throughout this post, a_j^\top denotes the jth row of A.
  2. Orthogonally project x_t onto the solution space of the equation a_{i_t}^\top x = b_{i_t}, obtaining x_{t+1} \coloneqq x_t + (b_{i_t} - a_{i_t}^\top x_t)/\norm{a_{i_t}}^2\cdot a_{i_t}.

The main selling point of RK is that it only interacts with the matrix A through row accesses, which makes the method ideal for very large problems where only a few rows of A can fit in memory at a time.

When applied to a consistent system of linear equations (i.e., a system possessing a solution x_\star satisfying Ax_\star = b), RK is geometrically convergent:

(1)   \[\expect\left[ \norm{x_t - x_\star}^2 \right] \le (1 - \kappa_{\rm dem}^{-2})^t \norm{x_\star}^2. \]

Here,

(2)   \[\kappa_{\rm dem} = \frac{\norm{A}_{\rm F}}{\sigma_{\rm min}(A)} = \sqrt{\sum_i \left(\frac{\sigma_i(A)}{\sigma_{\rm min}(A)}\right)^2} \]

is known as the Demmel condition number, and \sigma_i(A) are the singular values of A.1Personally, I find it convenient to write the Demmel condition number as \kappa_{\rm dem} = \kappa \cdot \sqrt{\operatorname{srank}(A)}, where \kappa = \sigma_{\rm max}(A) / \sigma_{\rm min}(A) is the ordinary condition number and \operatorname{srank}(A) = \norm{A}_{\rm F}^2/\sigma_{\rm max}^2(A) is the stable rank of the matrix A. The stable rank is a smooth proxy for the rank or dimensionality of the matrix A. Using this parameter, the rate of convergence is roughly \e^{-t / (\kappa^2 \operatorname{srank}(A))}, so it takes roughly \kappa^2 \operatorname{srank}(A) row accesses to reduce the error by a constant factor. Compare this to gradient descent, which requires \kappa^2 n row accesses. However, for inconsistent problems, RK does not converge at all. Indeed, since each step of RK updates the iterate x_t such that an equation a_{i_t}^\top x = b_{i_t} hold exactly and no solution x_\star satisfies all the equations simultaneously, the RK iterates continue to stochastically fluctuate no matter how long the algorithm is run.

However, while the RK iterates continue to randomly fluctuate when applied to an inconsistent system, their expected value does converge. In fact, it converges to the least-squares solution2If the matrix A is rank-deficient, then we define x_\star to be the minimum-norm least-squares solution x_\star = A^\dagger b, which can be expressed using the Moore–Penrose pseudoinverse {}^\dagger. x_\star to the system, defined as

    \[x_\star = \operatorname{argmin}_{x \in \real^d} \norm{b - Ax}.\]

Put differently, the bias of the RK iterates x_t as estimators to the least-squares solution x_\star converge to zero, and the rate of convergence is geometric. Specifically, we have the following theorem:

Theorem 1 (Exponentially Decreasing Bias): The RK iterates x_t have an exponentially decreasing bias

(3)   \[\norm{\mathbb{E}[x_t] - x_\star}^2 \le (1 - \kappa_{\rm dem}^{-2})^{2t} \norm{x_\star}^2. \]

Observe that the rate of convergence (3) for the bias is twice as fast as the rate of convergence for the error in (1). This factor of two was previously observed by Gower and Richtarik in the context of consistent systems of equations.

The proof of Theorem 1 is straightforward, and we will present it at the bottom of this post. First, we will discuss a couple of implications. First, we develop convergent versions of the RK algorithm using tail averaging. Second, we explore what happens when we implement RK with different sampling probabilities \prob \{i_t = j\}.

Tail Averaging

It may seem like Theorem 1 has little implication for practice. After all, just because the expected value of x_t becomes closer and closer to x_\star, it need not be the case that x_t is close to x. However, we can improve the quality of the approximate solution x_t by averaging.

There are multiple different possible ways we could use averaging. A first idea would be to run RK multiple times, obtaining multiple solutions x^{(1)}_t,\ldots,x^{(m)}_t which could then be averaged together. This approach is inefficient as each solution x_t^{(i)} is computed separately.

A better strategy is tail averaging. Fix a burn-in time t_{\rm b}, chosen so that the bias \norm{\expect[x_{t_{\rm b}}] - x_\star} is small. For each s> t_{\rm b}, x_s is a nearly unbiased approximation to the least-squares solution x_\star. To reduce variance, we can average these estimators together

    \[\overline{x}_t = \frac{x_{t_{\rm b} +1} + x_{t_{\rm b}+2} + \cdots + x_t}{t-t_{\rm b}}.\]

The estimator \overline{x}_t is the tail-averaged randomized Kaczmarz (TARK) estimator. By Theorem 1, we know the TARK estimator has an exponentially small bias:

    \[\norm{\expect[\overline{x}_t] - x_\star} \le (1 - \kappa_{\rm dem}^{-2})^{2(t_{\rm b}+1)} \norm{x_\star}^2.\]

In our paper, we also prove a bound for the mean-square error:

    \[\expect [\norm{\overline{x}_t - x_\star}^2] \le (1-\kappa_{\rm dem}^{-2})^{t_{\rm b}+1} \norm{x_\star}^2 + \frac{2\kappa_{\rm dem}^4}{t-t_{\rm b}} \frac{\norm{b-Ax_\star}^2}{\norm{A}_{\rm F}^2}.\]

We see that the rate of convergence is geometric in the burn-in time t_{\rm b} and occurs at an algebraic, Monte Carlo, rate in the final time t. While the Monte Carlo rate of convergence may be unappealing, it is known that this rate of convergence is optimal for any method that accesses row–entry pairs (a_i,b_i); see our paper for a new derivation of this fact.

Tail averaging can be an effective method for squeezing a bit more accuracy out of the RK method for least-squares problems. The figure below shows the error of different RK methods applied to a random least-squares problem, including plain RK, RK with thread averaging (RKA), and RK with underrelaxation (RKU);3The number of threads is set to 10, and the underrelaxation parameter is 1/\sqrt{t}. We found this underrelaxation parameter to lead to a smaller error than the other popular underrelaxation parameter schedule 1/t. see the paper’s Github for code. For this problem, tail-averaged randomized Kaczmarz achieves the lowest error of all of the methods considered, being 6× smaller than RKA, 22× smaller than RK, and 10⁶× smaller than RKU.

Error versus iteration for different randomized Kaczmarz methods. Tail-averaged randomized Kaczmarz shows the lowest error

What Least-Squares Solution Are We Converging To?

A second implication of Theorem 1 comes from reanalyzing the RK algorithm if we change the sampling probabilities. Recall that the standard RK algorithm draws random row indices i_t using squared row norm sampling:

    \[\prob \{ i_t = j \} = \frac{\norm{a_j}^2}{\norm{A}_{\rm F}^2} \eqqcolon p^{\rm RK}_j.\]

We have notated the RK sampling probability for row j as p_j^{\rm RK}.

Using the standard RK sampling procedure can sometimes be difficult. To implement it directly, we must make a full pass through the matrix to compute the sampling probabilities.4If we have an upper bound on the squared row norms, there is an alternative procedure based on rejection sampling that avoids this precomputation step. It can be much more convenient to sample rows i_t uniformly at random \prob \{ i_t = j \} = 1/n.

To do the following analysis in generality, consider performing RK using general sampling probabilities p_j:

    \[\prob \{i_t = j\} = p_j.\]

What happens to the bias of the RK iterates then?

Define a diagonal matrix

    \[D \coloneqq \diag\left( \sqrt{\frac{p_j}{p_j^{\rm RK}}} : j =1,\ldots,n\right).\]

One can check that the RK algorithm with non-standard sampling probabilities \{p_j\} is equivalent to the standard RK algorithm run on the diagonally reweighted least-squares problem

    \[x_{\rm weighted} = \argmin_{x\in\real^d} \norm{Db-(DA)x}.\]

In particular, applying TARK with uniform sampling probabilities, the tail-averaged iterates \overline{x}_t will converge to the weighted least-squares problem x_{\rm weighted} rather than the original least-squares solution x_\star.

I find this very interesting. In the standard RK method, the squared row norm sampling distribution is chosen to ensure rapid convergence of the RK iterates to the solution of a consistent system of linear equations. However, for a consistent system, the RK method will always converge to the same solution no matter what row sampling strategy is chosen (as long as every non-zero row has a positive probability of being picked). In the least-squares context, however, the conclusion is very different: the choice of row sampling distribution not only affects the rate of convergence, but also which solution is being converged to!

As the plot below demonstrates, the original least-squares solution and re-weighted least-squares solution can sometimes be quite different from each other. This plot shows the results of fitting a function with many kinks (show as a black dashed curve) using a polynomial function p(u) = \sum_{j=0}^{10} x_j u^j[mfn}Note that, for this experiment we represent the polynomial p(u) = \sum_{i=0}^{10} x_j u^j using its monomial coefficients x_j, which has issues with numerical stability. It’s better to use a representation using Chebyshev polynomials. We use this example only to illustrate the difference between the weighted and original least-squares solution.[/mfn} at 10000 equispaced points. We compare the unweighted least-squares solution (orange solid curve) to the weighted least-squares solution using uniform RK weights p_j = 1/n (blue dash-dotted curve). These two curves differ meaningfully, with the weighted least-squares solution having higher error at the ends of the interval but more accuracy in the middle. These differences can be explained looking at the weights (diagonal entries of D, grey dotted curve), which are lower at the ends of the interval than in the center.

Weighted and un-weighted least-squares solution to polynomial regression problem. (Tail averaged) randomized Kaczmarz method with uniform sampling converges to the weighted least-squares solution, which is more accurate on the middle of the interval (where the weights are high) than the edges of the interval (where the weights are smaller)

Does this diagonal rescaling issue matter? Sometimes not. In many applications, the weighted and un-weighted least squares solutions will both be fine. Indeed, in the above example, neither the weighted nor un-weighted solutions are the “right” one; the weighted solution is more accurate in the interior of the domain and less accurate at the boundary. However, sometimes getting the true least-squares solution matters, or the amount of reweighting done by uniform sampling is too aggressive for a problem. In these cases, using the classical RK sampling probabilities may be necessary. Fortunately, rejection sampling can often be used to perform squared row norm sampling; see this blog post of mine for details.

Proof of Theorem 1

Let us now prove Theorem 1, the asymptotic unbiasedness of randomized Kaczmarz. We will assume throughout that A is full-rank; the rank-deficient case is similar but requires a bit more care.

Begin by rewriting the update rule in the following way

    \[x_{t+1} = \left( I - \frac{a_{i_t}^{\vphantom{\top}}a_{i_t}^\top}{\norm{a_{i_t}}^2} \right)x_t + \frac{b_{i_t}a_{i_t}}{\norm{a_{i_t}}^2}.\]

Now, letting \expect_{i_t} denote the average over the random index i_t, we compute

    \begin{align*}\expect_{i_t}[x_{t+1}] &= \sum_{j=1}^n \left[\left( I - \frac{a_j^{\vphantom{\top}} a_j^\top}{\norm{a_j}^2} \right)x_t + \frac{b_ja_j}{\norm{a_j}^2}\right] \prob\{i_t=j\}\\ &=\sum_{j=1}^n \left[\left( \frac{\norm{a_j}^2}{\norm{A}_{\rm F}^2}I - \frac{a_j^{\vphantom{\top}} a_j^\top}{\norm{A}_{\rm F}^2} \right)x_t + \frac{b_ja_j}{\norm{A}_{\rm F}^2}\right].\end{align*}

We can evaluate the sums \sum_{j=1}^n a_j^{\vphantom{\top}}a_j^\top = A^\top A and \sum_{j=1}^n b_j a_j = A^\top b directly. Therefore, we obtain

    \[\expect_{i_t}[x_{t+1}] = \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right) x_t + \frac{A^\top b}{\norm{A}_{\rm F}^2}.\]

Thus, taking the (total) expectation of both sides, we get

    \[\expect[x_{t+1}] = \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right) \expect[x_t] + \frac{A^\top b}{\norm{A}_{\rm F}^2}.\]

Iterating this equation and using the initial condition x_0 = 0, we obtain

(4)   \[\expect[x_t] = \left[\sum_{i=0}^{t-1} \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^i \right]\cdot\frac{A^\top b}{\norm{A}_{\rm F}^2}.\]


The equation (4) expresses the expected RK iterate \expect[x_t] using a matrix geometric series. Recall that the infinite geometric series with a scalar ratio |y|<1 satisfies the formula

    \[\sum_{i=0}^\infty y^i = (1-y)^{-1}.\]

Substituting x=1-y, we get

    \[\sum_{i=0}^\infty (1-x)^i = x^{-1}\]

for any 0 < x < 2. With a little effort, one can check that the same formula

    \[\sum_{i=0}^\infty (I-X)^i = X^{-1}\]

for a square matrix X satisfying 0 < \sigma_{\rm min}(X) \le \norm{X} < 2. These conditions hold for the matrix X = A^\top A / \norm{A}_{\rm F}^2 since A is full-rank, yielding

(5)   \[\left(\frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^{-1} = \sum_{i=0}^\infty \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^i. \]

In anticipation of plugging this result into (4), we can break the infinite sum on the right-hand side into two pieces:

(6)   \[\left(\frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^{-1} = \sum_{i=0}^{t-1} \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^i + \sum_{i=t}^\infty \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^i.\]

We are at the home stretch. Plugging (6) into (4) and using the normal equations x_\star = (A^\top A)^{-1} A^\top b, we obtain

    \[\expect[x_t] = x_\star - \left[\sum_{i=t}^\infty \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^i \right]\cdot\frac{A^\top b}{\norm{A}_{\rm F}^2}.\]

Rearrange to obtain

    \[\expect[x_t] - x_\star = - \left(I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^t \left[\sum_{i=0}^\infty \left( I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^i \right]\cdot\frac{A^\top b}{\norm{A}_{\rm F}^2}.\]

Now apply (5) and the normal equations again to obtain

    \[\expect[x_t] - x_\star = - \left(I - \frac{A^\top A}{\norm{A}_{\rm F}^2}\right)^t x_\star.\]

Take norms and use the submultiplicative property of the spectral norm to obtain

(7)   \[\norm{\expect[x_t] - x_\star}^2 \le \norm{I - \frac{A^\top A}{\norm{A}_{\rm F}^2}}^{2t} \norm{x}_\star^2. \]

To complete the proof we must evaluate the norm of the matrix I - A^\top A / \norm{A}_{\rm F}^2. Let A = U \Sigma V^\top be a (thin) singular value decomposition, where \Sigma = \diag(\sigma_{\rm max}(A),\ldots,\sigma_{\rm min}(A)). Then

    \[I - \frac{A^\top A}{\norm{A}_{\rm F}^2} = I - V \cdot\frac{\Sigma^2}{\norm{A}_{\rm F}^2} \cdot V^\top = V \left( I - \frac{\Sigma^2}{\norm{A}_{\rm F}^2}\right)V^\top.\]

We recognize this as a matrix whose eigenvectors are V and whose eigenvalues are the diagonal entries of the matrix

    \[I - \Sigma^2/\norm{A}_{\rm F}^2 = \diag(1 - \sigma^2_{\rm max}(A)/\norm{A}_{\rm F}^2,\ldots,1-\sigma_{\rm min}^2(A)/\norm{A}_{\rm F}^2).\]

All eigenvalues are nonnegative and the largest eigenvalue is 1 - \sigma_{\rm min}^2(A)/\norm{A}_{\rm F}^2 = 1 - \kappa_{\rm dem}^{-2}. We have invoked the definition of the Demmel condition number (2). Therefore, plugging into (7), we obtain

    \[\norm{\expect[x_t] - x_\star}^2 \le \left(1-\kappa_{\rm dem}^{-2}\right)^{2t} \norm{x}_\star^2,\]

as promised.