I am excited to share that our paper XTrace: Making the most of every sample in stochastic trace estimation has been published in the SIAM Journal on Matrix Analysis and Applications. (See also our paper on arXiv.)
Spurred by this exciting news, I wanted to take the opportunity to share one of my favorite results in randomized numerical linear algebra: a “speed limit” result of Meyer, Musco, Musco, and Woodruff that establishes a fundamental limitation on how accurate any trace estimation algorithm can be.
Let’s back up. Given an unknown square matrix 
, the trace of 
, defined to be the sum of its diagonal entries 
      ![]()
To simplify things, let’s assume the matrix 
 is symmetric and positive (semi)definite. The classical algorithm for trace estimation is due to Girard and Hutchinson, producing a probabilistic estimate 
 with a small average (relative) error:
      ![Rendered by QuickLaTeX.com \[\expect\left[\frac{|\hat{\tr}-\tr(A)|}{\tr(A)}\right] \le \varepsilon \quad \text{using } m= \frac{\rm const}{\varepsilon^2} \text{ matvecs}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-07898a8a2214416cb97b8b72956e3e76_l3.png)
This state of affairs was greatly improved by Meyer, Musco, Musco, and Woodruff. Building upon previous work, they proposed the Hutch++ algorithm and proved it outputs an estimate 
 satisfying the following bound:
 (1)    ![Rendered by QuickLaTeX.com \[\expect\left[\frac{|\hat{\tr}-\tr(A)|}{\tr(A)}\right] \le \varepsilon \quad \text{using } m= \frac{\rm const}{\varepsilon} \text{ matvecs}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-b9f186640222c78e5064f83465038b78_l3.png)
The MMMW Trace Estimation “Speed Limit”
Given the dramatic improvement of Hutch++ and XTrace over Girard–Hutchinson, it is natural to hope: Is there an algorithm that does even better than Hutch++ and XTrace? For instance, is there an algorithm satisfying an even slightly better error bound of the form
      ![Rendered by QuickLaTeX.com \[\expect\left[\frac{|\hat{\tr}-\tr(A)|}{\tr(A)}\right] \le \varepsilon \quad \text{using } m= \frac{\rm const}{\varepsilon^{0.999}} \text{ matvecs}?\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-8bae682300a401dc80869764b302ce8c_l3.png)
Let’s add some fine print. Consider an algorithm for the trace estimation problem. Whenever the algorithm wants, it can present a vector 
 and receive back 
. The algorithm is allowed to be adaptive: It can use the matvecs 
 it has already collected to decide which vector 
 to present next. We measure the cost of the algorithm in terms of the number of matvecs alone, and the algorithm knows nothing about the psd matrix 
 other what it learns from matvecs.
One final stipulation:
Simple entries assumption. We assume that the entries of the vectors
presented by the algorithm are real numbers between
and
with up to
digits after the decimal place.
To get a feel for this simple entries assumption, suppose we set 
. Then 
 would be an allowed input vector, but 
 would not be (too many digits after the decimal place). Similarly, 
 would not be valid because its entries exceed 
. The simple entries assumption is reasonable as we typically represent numbers on digital computers by storing a fixed number of digits of accuracy.1We typically represent numbers on digital computers by floating point numbers, which essentially represent numbers using scientific notation like 
. For this analysis of trace estimation, we use fixed point numbers like 
 (no powers of ten allowed)!
With all these stipulations, we are ready to state the “speed limit” for trace estimation proved by Meyer, Musco, Musco, and Woodruff:
Informal theorem (Meyer, Musco, Musco, Woodruff). Under the assumptions above, there is no trace estimation algorithm producing an estimate
satisfying
We will see a slightly sharper version of the theorem below, but this statement captures the essence of the result.
Communication Complexity
To prove the MMMW theorem, we have to take a journey to the beautiful subject of communication complexity. The story is this. Alice and Bob are interested in solving a computational problem together. Alice has her input 
 and Bob has his input 
, and they are interested in computing a function 
 of both their inputs.
Unfortunately for the two of them, Alice and Bob are separated by a great distance, and can only communicate by sending single bits (0 or 1) of information over a slow network connection. Every bit of communication is costly. The field of communication complexity is dedicated to determining how efficiently Alice and Bob are able to solve problems of this form.
The Gap-Hamming problem is one example of a problem studied in communication complexity. As inputs, Alice and Bob receive vectors 
 with 
 and 
 entries from a third party Eve. Eve promises Alice and Bob that their vectors 
 and 
 satisfy one of two conditions: 
 (2)    ![]()
There’s one simple solution to this problem: First, Bob sends his whole input vector 
 to Alice. Each entry of 
 takes one of the two value 
 and can therefore be communicated in a single bit. Having received 
, Alice computes 
, determines whether they are in case 0 or case 1, and sends Bob a single bit to communicate the answer. This procedure requires 
 bits of communication.
Can Alice and Bob still solve this problem with many fewer than 
 bits of communication, say 
 bits? Unfortunately not. The following theorem of Chakrabati and Regev shows that roughly 
 bits of communication are needed to solve this problem:
Theorem (Chakrabati–Regev). Any algorithm which solves the Gap-Hamming problem that succeeds with at least
probability for every pair of inputs
and
(satisfying one of the conditions (2)) must take
bits of communication.
Here, 
 is big-Omega notation, closely related to big-O notation 
 and big-Theta notation 
. For the less familiar, it can be helpful to interpret 
, 
, and 
 as all standing for “proportional to 
”. In plain language, the theorem of Chakrabati and Regev result states that there is no algorithm for the Gap-Hamming problem that much more effective than the basic algorithm where Bob sends his whole input to Alice (in the sense of requiring less than 
 bits of communication).
Reducing Gap-Hamming to Trace Estimation
This whole state of affairs is very sad for Alice and Bob, but what does it have to do with trace estimation? Remarkably, we can use hardness of the Gap-Hamming problem to show there’s no algorithm that fundamentally improves on Hutch++ and XTrace. The argument goes something like this:
- If there were a trace estimation algorithm fundamentally better than Hutch++ and XTrace, we could use it to solve Gap-Hamming in fewer than 
 bits of communication. - But no algorithm can solve Gap-Hamming in fewer than 
 bits or communication. - Therefore, no trace estimation algorithm is fundamentally better than Hutch++ and XTrace.
 
Step 2 is the work of Chakrabati and Regev, and step 3 follows logically from 1 and 2. Therefore, we are left to complete step 1 of the argument.
Protocol
Assume we have access to a really good trace estimation algorithm. We will use it to solve the Gap-Hamming problem. For simplicity, assume 
 is a perfect square. The basic idea is this:
- Have Alice and Bob reshape their inputs 
 into matrices 
, and consider (but do not form!) the positive semidefinite matrix ![Rendered by QuickLaTeX.com \[A = (X+Y)^\top (X+Y).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-2b1df104e3d571430963573b54d5c4df_l3.png)
 - Observe that 
Thus, the two cases in (2) can be equivalently written in terms of![Rendered by QuickLaTeX.com \[\tr(A) = \tr(X^\top X) + 2\tr(X^\top Y) + \tr(Y^\top Y) = 2n + 2(x^\top y).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a0f6e916740a9c27733c659418cc15cf_l3.png)
:(2′)
![Rendered by QuickLaTeX.com \[\text{Case 0: } \tr(A)\ge 2n + 2\sqrt{n} \quad \text{or} \quad \text{Case 1: } \tr(A) \le 2n-2\sqrt{n}. \]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-1fea970636b48c827e4584d3655b9e3c_l3.png)
 - By working together, Alice and Bob can implement a trace estimation algorithm. Alice will be in charge of running the algorithm, but Alice and Bob must work together to compute matvecs. (Details below!)
 - Using the output of the trace estimation algorithm, Alice determines whether they are in case 0 or 1 (i.e., where 
 or 
) and sends the result to Bob. 
To complete this procedure, we just need to show how Alice and Bob can implement the matvec procedure using minimal communication. Suppose Alice and Bob want to compute 
 for some vector 
 with entries between 
 and 
 with up to 
 decimal digits. First, convert 
 to a vector 
 whose entries are integers between 
 and 
. Since 
, interconverting between 
 and 
 is trivial. Alice and Bob’s procedure for computing 
 is as follows:
- Alice sends Bob 
. - Having received 
, Bob forms 
 and sends it to Alice. - Having received 
, Alice computes 
 and sends it to Bob. - Having received 
, Bob computes 
 and sends its to Alice. - Alice forms 
. 
Because 
 and 
 are 
 and have 
 entries, all vectors computed in this procedure are vectors of length 
 with integer entries between 
 and 
. We conclude the communication cost for one matvec is 
 bits.
Analysis
Consider an algorithm we’ll call BestTraceAlgorithm. Given any accuracy parameter 
, BestTraceAlgorithm requires at most 
 matvecs and, for any positive semidefinite input matrix 
 of any size, produces an estimate 
 satisfying 
 (3)    ![Rendered by QuickLaTeX.com \[\expect\left[\frac{|\hat{\tr}-\tr(A)|}{\tr(A)}\right] \le \varepsilon.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a9165ea3846d540eb654af9f7f59a5c2_l3.png)
To solve the Gap-Hamming problem, Alice and Bob just need enough accuracy in their trace estimation to distinguish between cases 0 and 1. In particular, if
      ![Rendered by QuickLaTeX.com \[\left| \frac{\hat{\tr} - \tr(A)}{\tr(A)} \right| \le \frac{1}{\sqrt{n}},\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-8327e1b03c9ba9162fb737e40a192b3a_l3.png)
Suppose that Alice and Bob apply trace estimation to solve the Gap-Hamming problem, using 
 matvecs in total. The total communication is 
 bits. Chakrabati and Regev showed that Gap-Hamming requires 
 bits of communication (for some 
) to solve the Gap-Hamming problem with 
 probability. Thus, if 
, then Alice and Bob fail to solve the Gap-Hamming problem with at least 
 probability. Thus, 
      ![Rendered by QuickLaTeX.com \[\text{If } m < \frac{cn}{T} = \Theta\left( \frac{\sqrt{n}}{b+\log n} \right), \quad \text{then } \left| \frac{\hat{\tr} - \tr(A)}{\tr(A)} \right| > \frac{1}{\sqrt{n}} \text{ with probability at least } \frac{1}{3}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-74364fd915239fdaf0889d005378b7ba_l3.png)
      ![Rendered by QuickLaTeX.com \[\text{If }\left| \frac{\hat{\tr} - \tr(A)}{\tr(A)} \right| \le \frac{1}{\sqrt{n}}\text{ with probability at least } \frac{2}{3}, \quad \text{then } m \ge \Theta\left( \frac{\sqrt{n}}{b+\log n} \right).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-81fbd6e8e08c9759a6cbf5ac4f9782e2_l3.png)
Say Alice and Bob run BestTraceAlgorithm with parameter
      ![Rendered by QuickLaTeX.com \[\left| \frac{\hat{\tr} - \tr(A)}{\tr(A)} \right| \le \frac{1}{\sqrt{n}} \quad \text{with probability at least }\frac{2}{3}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-95db791999f5573659f14f5b0d4a386e_l3.png)
      ![]()
      ![]()
![Rendered by QuickLaTeX.com \[\expect\left[\frac{|\hat{\tr}-\tr(A)|}{\tr(A)}\right] \le \varepsilon \quad \text{using } m= \frac{\rm const}{\varepsilon^{0.999}} \text{ matvecs}.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-9034edba9bafb68809d24d54f5a830e6_l3.png)
One thought on “How Good Can Stochastic Trace Estimates Be?”