I am delighted to share that my paper *Embrace rejection: Kernel matrix approximation with randomly pivoted Cholesky* has been released on arXiv, joint with Joel Tropp and Rob Webber.

On the occasion of this release, I want to talk about *rejection sampling*, one of the most powerful ideas in probabilistic computing. My goal is to provide an accessible and general introduction to this technique, with a more “applied math” rather than statistical perspective. I will conclude with some discussion of how we use it in our paper.

## Motivating Application: Randomized Kaczmarz

Consider the task of solving a linear system of equations , where is a matrix and is a vector. For simplicity, we assume that this system is *consistent* (i.e., there exists a solution such that ). The randomized Kaczmarz algorithm is a mathematically elegant algorithm for solving this problem, due to Strohmer and Vershynin. Beginning with initial solution , randomized Kaczmarz repeatedly performs the following steps:

**Sample.**Generate a random row index according to the probability distribution . Here, and going forward, denotes the th row of and is the Frobenius norm.**Update.**Set .

Convergence theory of the randomized Kaczmarz iteration is very well-studied; see, e.g., section 4.1 of this survey for an introduction.

For this post, we will consider a more basic question: How can we perform the sampling step 1? If and are not very large, this is an easy problem: Simply sweep through all the rows and compute their norms . Then, sampling can be done using any algorithm for sampling from a weighted list of items. But what if is very large? Suppose, even, that it is computationally expensive to read through the matrix, even once, to compute the row norms. What can we do now?

Here is one solution is using rejection sampling. Suppose that we know a bound on the row norms of :

(1)

For instance, if the entries of take values in the range , then (1) holds with . To sample a row with probability , we perform the following the rejection sampling procedure:**Propose.**Draw a random row index uniformly at random (i.e., is equally likely to be any row index between and .)**Accept/reject.**Make a random choice: With probability , accept and set . With the remaining probability , reject and return to step 1.

Why does this procedure work? As an intuitive explanation, notice that each row is equally likely to be proposed and has probability proportional to of being accepted. Therefore, under this procedure, each row has probability proportional to of being selected, just as desired for randomized Kaczmarz. If this intuitive justification is unsatisfying, we’ll have a formal derivation of correctness below (in a more general context).

I find rejection sampling to be super neat. In this case, it allows us to use simple ingredients (uniform row sampling and an upper bound on the row norms) to accomplish a more complicated task (sampling according to the squared row norms). We can sample from an interesting probability distribution (the squared row norm distribution) by thinning out samples from a simple distribution. This is a powerful idea that has many applications. (And, I believe, many more applications are waiting to be discovered.)

## Rejection Sampling in General

Let’s now discuss rejection sampling in a bit more generality. Say I have a list of things each of which has a *weight* . Our goal is to choose a random index with probability proportional to , i.e., . We will call the *target distribution*. (Note that we do *not* require the weights to be normalized to satisfy .)

Suppose that sampling from the target distribution is challenging, but we can sample from a *proposal distribution* , i.e., we can efficiently generate random such that . Further, suppose that the proposal distribution dominates the target distribution in the sense that

Under this general setting, we can sample from the target distribution using rejection sampling, similar to above:

**Propose.**Draw a random row index from the proposal distribution.**Accept/reject.**Make a random choice: With probability , accept and set . With the remaining probability , reject and return to step 1.

We recognize the squared row-norm sampling above as an example of this general strategy with and for all .

Let us now convince ourselves more carefully that rejection sampling works. On any given execution of the rejection sampling loop, the probability of proposing index is , after which the probability of accepting is . Therefore, the probability of outputting at any given loop is

The probability that

*any*index on that loop is accepted is thus

The rejection sampling loop accepts with fixed positive probability each execution, so it eventually accepts with 100% probability. Thus, the probability of accepting conditional on

*some*index being accepted is

Therefore, rejection sampling outputs a sample from the target distribution as promised.

As this analysis shows, the probability of accepting on any given execution of the rejection sampling loop is , the ratio of the total mass of the target to the total mass of the proposal . Thus, rejection sampling will have a high acceptance rate if and a low acceptance rate if . The conclusion for practice is that rejection sampling is only computationally efficient if one has access to a good proposal distribution , that is both easy to sample from and close to the target .

Here, we have presented rejection sampling as a strategy for sampling from a discrete set of items , but this not all that rejection sampling can do. Indeed, one can also use rejection sampling for sampling a real-valued parameters or a multivariate parameter from a given (unnormalized) probability density function (or, if one likes, an unnormalized probability measure).

Before moving on, let us make one final observation. A very convenient thing about rejection sampling is that we only need the *unnormalized* probability weights and to implement the procedure, even though the total weights and are necessary to define the sampling probabilities and . This fact is necessary for many applications of rejection sampling. For instance, in the randomized Kaczmarz use, rejection sampling would be much less useful if we needed the total weight , as computing requires a full pass over the data to compute.

## Application: Randomly pivoted Cholesky

Another application of rejection sampling appears is to accelerate the randomly pivoted Cholesky algorithm, the subject of my recent paper. Here, I’ll provide an overview of the main idea with a focus on the role of rejection sampling in the procedure. See the paper for more details. Warning: Even as simplified here, this section presents a significantly more involved application of rejection sampling than we saw previously with randomized Kaczmarz!

Randomly pivoted Cholesky (RPCholesky) is a simple, but powerful algorithm for computing a low-rank approximation to a symmetric positive semidefinite matrix , and it can be used to accelerate kernel machine learning algorithms.

Conceptually, RPCholesky works as follows. Let denote the initial matrix and denote a trivial initial approximation. For , RPCholesky performs the following steps:

**Random sampling.**Draw a random index with probability .**Update approximation.**Update . Here, denotes the th column of .**Update matrix.**Update .

The result of this procedure is a rank- approximation . With an optimized implementation, RPCholesky requires only operations and evaluates just entries of the input matrix .

The standard RPCholesky algorithm is already fast, but there is room for improvement. The standard RPCholesky method processes the matrix one column at a time, but modern computers are faster at processing *blocks* of columns. Using the power of rejection sampling, we can develop faster versions of RPCholesky that take advantage of blocked computations. These blocked implementations are important: While we could handle a million-by-million matrix using the standard RPCholesky method, blocked implementations can be up to 40 faster. In this post, I’ll describe the simplest way of using rejection sampling to implement RPCholesky.

To set the stage, let’s first establish some notation and make a few observations. Let denote the first selected random indices. With a little matrix algebra, one can show that the approximation produced by steps of RPCholesky obeys the formula^{1}For those interested, observe that is an example of a Nyström approximation. The connection between Nyström approximation and Cholesky decomposition is explained in this previous post of mine.

(1)

Here, we are using MATLAB notation for indexing the matrix . The residual matrix is given byImportantly for us, observe that we can evaluate the th diagonal entry of using the formula

(2)

Compared to operating on the full input matrix , evaluating an entry is cheap, just requiring some arithmetic involving matrices and vectors of size .Here, we see a situation similar to randomized Kaczmarz. At each step of RPCholesky, we must sample a random index from a target distribution , but it is expensive to evaluate all of the ‘s. This is a natural setting for rejection sampling. As proposal distribution, we may use the initial diagonal of , . (One may verify that, as required, for all .) This leads to a *rejection sampling implementation for RPCholesky*:

**Sample indices.**For , sample random indices from the target distribution using rejection sampling with proposal distribution . Entries are evaluated on an as-needed basis using (2).**Build the approximation.**Form the approximation all at once using (1).

By using rejection sampling, we have gone from an algorithm which handles the matrix one column at a time to a new algorithm which processes columns of the matrix all at once through the formula (1). In the right situations, this new algorithm can be significantly faster than the original RPCholesky method.^{2}In its essence, this rejection sampling algorithm for RPCholesky was first proposed in this paper of mine, which uses rejection sampling to apply RPCholesky to infinite-dimensional positive definite kernel operators. The ability to handle infinite-dimensional sampling problems is another great strength of the rejection sampling approach.

We have now seen two extreme cases: the standard RPCholesky algorithm that processes the columns of one at a time and a rejection sampling implementation that operates on the columns of all at once. In practice, the fastest algorithm sits between these extremes, using rejection sampling to sample column indices in moderate-size blocks (say, between 100 and 1000 columns at a time). This “goldilocks” algorithm is the accelerated RPCholesky method that we introduce in our paper. Check it out for details!

## Conclusion: Rejection Sampling, an Algorithm Design Tool

We’ve now seen two applications, randomized Kaczmarz and randomly pivoted Cholesky, where rejection sampling can be used to speed up an algorithm. In both cases, we wanted to sample from a distribution over row or column indices, but it was expensive to perform a census to compute the probability weight of each item individually. Rejection sampling offers a different approach, where we generate samples from a cheap-to-sample proposal distribution and only evaluate the probability weights of the proposed items. In the right situations, this rejection sampling approach can lead to significant computational speedups.

Rejection sampling has been used by statistics-oriented folks for decades, but the power of this technique seems less well-known to researchers in applied math, scientific computation, and related areas. I think there are a lot more exciting ways that we can use rejection sampling to design better, faster randomized algorithms.