I am excited to share that our paper XTrace: Making the most of every sample in stochastic trace estimation has been published in the SIAM Journal on Matrix Analysis and Applications. (See also our paper on arXiv.)
Spurred by this exciting news, I wanted to take the opportunity to share one of my favorite results in randomized numerical linear algebra: a “speed limit” result of Meyer, Musco, Musco, and Woodruff that establishes a fundamental limitation on how accurate any trace estimation algorithm can be.
Let’s back up. Given an unknown square matrix , the trace of , defined to be the sum of its diagonal entries
The catch? We assume that we can only access the matrix through matrix–vector products (affectionately known as “matvecs”): Given any vector , we have access to . Our goal is to form an estimate that is as accurate as possible while using as few matvecs as we can get away with.
To simplify things, let’s assume the matrix is symmetric and positive (semi)definite. The classical algorithm for trace estimation is due to Girard and Hutchinson, producing a probabilistic estimate with a small average (relative) error:
If one wants high accuracy, this algorithm is expensive. To achieve just a 1% error () requires roughly matvecs!
This state of affairs was greatly improved by Meyer, Musco, Musco, and Woodruff. Building upon previous work, they proposed the Hutch++ algorithm and proved it outputs an estimate satisfying the following bound:
(1)Now, we only require roughly matvecs to achieve 1% error! Our algorithm, XTrace, satisfies the same error guarantee (1) as Hutch++. On certain problems, XTrace can be quite a bit more accurate than Hutch++.
The MMMW Trace Estimation “Speed Limit”
Given the dramatic improvement of Hutch++ and XTrace over Girard–Hutchinson, it is natural to hope: Is there an algorithm that does even better than Hutch++ and XTrace? For instance, is there an algorithm satisfying an even slightly better error bound of the form
Unfortunately not. Hutch++ and XTrace are essentially as good as it gets.
Let’s add some fine print. Consider an algorithm for the trace estimation problem. Whenever the algorithm wants, it can present a vector and receive back . The algorithm is allowed to be adaptive: It can use the matvecs it has already collected to decide which vector to present next. We measure the cost of the algorithm in terms of the number of matvecs alone, and the algorithm knows nothing about the psd matrix other what it learns from matvecs.
One final stipulation:
Simple entries assumption. We assume that the entries of the vectors presented by the algorithm are real numbers between and with up to digits after the decimal place.
To get a feel for this simple entries assumption, suppose we set . Then would be an allowed input vector, but would not be (too many digits after the decimal place). Similarly, would not be valid because its entries exceed . The simple entries assumption is reasonable as we typically represent numbers on digital computers by storing a fixed number of digits of accuracy.1We typically represent numbers on digital computers by floating point numbers, which essentially represent numbers using scientific notation like . For this analysis of trace estimation, we use fixed point numbers like (no powers of ten allowed)!
With all these stipulations, we are ready to state the “speed limit” for trace estimation proved by Meyer, Musco, Musco, and Woodruff:
Informal theorem (Meyer, Musco, Musco, Woodruff). Under the assumptions above, there is no trace estimation algorithm producing an estimate satisfying
We will see a slightly sharper version of the theorem below, but this statement captures the essence of the result.
To prove the MMMW theorem, we have to take a journey to the beautiful subject of communication complexity. The story is this. Alice and Bob are interested in solving a computational problem together. Alice has her input and Bob has his input , and they are interested in computing a function of both their inputs.
Unfortunately for the two of them, Alice and Bob are separated by a great distance, and can only communicate by sending single bits (0 or 1) of information over a slow network connection. Every bit of communication is costly. The field of communication complexity is dedicated to determining how efficiently Alice and Bob are able to solve problems of this form.
The Gap-Hamming problem is one example of a problem studied in communication complexity. As inputs, Alice and Bob receive vectors with and entries from a third party Eve. Eve promises Alice and Bob that their vectors and satisfy one of two conditions:
(2)Alice and Bob must work together, sending as few bits of communication as possible, to determine which case they are in.
There’s one simple solution to this problem: First, Bob sends his whole input vector to Alice. Each entry of takes one of the two value and can therefore be communicated in a single bit. Having received , Alice computes , determines whether they are in case 0 or case 1, and sends Bob a single bit to communicate the answer. This procedure requires bits of communication.
Can Alice and Bob still solve this problem with many fewer than bits of communication, say bits? Unfortunately not. The following theorem of Chakrabati and Regev shows that roughly bits of communication are needed to solve this problem:
Theorem (Chakrabati–Regev). Any algorithm which solves the Gap-Hamming problem that succeeds with at least probability for every pair of inputs and (satisfying one of the conditions (2)) must take bits of communication.
Here, is big-Omega notation, closely related to big-O notation and big-Theta notation . For the less familiar, it can be helpful to interpret , , and as all standing for “proportional to ”. In plain language, the theorem of Chakrabati and Regev result states that there is no algorithm for the Gap-Hamming problem that much more effective than the basic algorithm where Bob sends his whole input to Alice (in the sense of requiring less than bits of communication).
Reducing Gap-Hamming to Trace Estimation
This whole state of affairs is very sad for Alice and Bob, but what does it have to do with trace estimation? Remarkably, we can use hardness of the Gap-Hamming problem to show there’s no algorithm that fundamentally improves on Hutch++ and XTrace. The argument goes something like this:
- If there were a trace estimation algorithm fundamentally better than Hutch++ and XTrace, we could use it to solve Gap-Hamming in fewer than bits of communication.
- But no algorithm can solve Gap-Hamming in fewer than bits or communication.
- Therefore, no trace estimation algorithm is fundamentally better than Hutch++ and XTrace.
Step 2 is the work of Chakrabati and Regev, and step 3 follows logically from 1 and 2. Therefore, we are left to complete step 1 of the argument.
Assume we have access to a really good trace estimation algorithm. We will use it to solve the Gap-Hamming problem. For simplicity, assume is a perfect square. The basic idea is this:
- Have Alice and Bob reshape their inputs into matrices , and consider (but do not form!) the positive semidefinite matrix
- Observe that
- By working together, Alice and Bob can implement a trace estimation algorithm. Alice will be in charge of running the algorithm, but Alice and Bob must work together to compute matvecs. (Details below!)
- Using the output of the trace estimation algorithm, Alice determines whether they are in case 0 or 1 (i.e., where or ) and sends the result to Bob.
To complete this procedure, we just need to show how Alice and Bob can implement the matvec procedure using minimal communication. Suppose Alice and Bob want to compute for some vector with entries between and with up to decimal digits. First, convert to a vector whose entries are integers between and . Since , interconverting between and is trivial. Alice and Bob’s procedure for computing is as follows:
- Alice sends Bob .
- Having received , Bob forms and sends it to Alice.
- Having received , Alice computes and sends it to Bob.
- Having received , Bob computes and sends its to Alice.
- Alice forms .
Because and are and have entries, all vectors computed in this procedure are vectors of length with integer entries between and . We conclude the communication cost for one matvec is bits.
Consider an algorithm we’ll call BestTraceAlgorithm. Given any accuracy parameter , BestTraceAlgorithm requires at most matvecs and, for any positive semidefinite input matrix of any size, produces an estimate satisfying
(3)We assume that BestTraceAlgorithm is the best possible algorithm in the sense that no algorithm can achieve (3) on all (positive semidefinite) inputs with matvecs.
To solve the Gap-Hamming problem, Alice and Bob just need enough accuracy in their trace estimation to distinguish between cases 0 and 1. In particular, if
then Alice and Bob can distinguish between cases 0 and 1 in (2′)
Suppose that Alice and Bob apply trace estimation to solve the Gap-Hamming problem, using matvecs in total. The total communication is bits. Chakrabati and Regev showed that Gap-Hamming requires bits of communication (for some ) to solve the Gap-Hamming problem with probability. Thus, if , then Alice and Bob fail to solve the Gap-Hamming problem with at least probability. Thus,
The contrapositive of this statement is that if
Say Alice and Bob run BestTraceAlgorithm with parameter . Then, by (3) and Markov’s inequality,
Therefore, BestTraceAlgorithm requires at least
Using the fact that we’ve set , we conclude that any trace estimation algorithm, even BestTraceAlgorithm, requires
In particular, no trace estimation algorithm can achieve mean relative error using even matvecs. This proves the MMMW theorem.