I am delighted to share that me, Joel A. Tropp, and Robert J. Webber‘s paper XTrace: Making the Most of Every Sample in Stochastic Trace Estimation has recently been released as a preprint on arXiv. In it, we consider the implicit trace estimation problem:
Implicit trace estimation problem: Given access to a square matrix via the matrix–vector product operation , estimate its trace .
Algorithms for this task have many uses such as log-determinant computations in machine learning, partition function calculations in statistical physics, and generalized cross validation for smoothing splines. I described another application to counting triangles in a large network in a previous blog post.
Our paper presents new trace estimators XTrace and XNysTrace which are highly efficient, producing accurate trace approximations using a small budget of matrix–vector products. In addition, these algorithms are fast to run and are supported by theoretical results which explain their excellent performance. I really hope that you will check out the paper to learn more about these estimators!
For the rest of this post, I’m going to talk about the most basic stochastic trace estimation algorithm, the Girard–Hutchinson estimator. This seemingly simple algorithm exhibits a number of nuances and forms the backbone for more sophisticated trace estimates such as Hutch++, Nyström++, XTrace, and XNysTrace. Toward the end, this blog post will be fairly mathematical, but I hope that the beginning will be fairly accessible to all.
Girard–Hutchinson Estimator: The Basics
The Girard–Hutchinson estimator for the trace of a square matrix is
Here, are random vectors, usually chosen to be statistically independent, and denotes the conjugate transpose of a vector or matrix. The Girard–Hutchinson estimator only depends on the matrix through the matrix–vector products .
Provided the random vectors are isotropic
the Girard–Hutchinson estimator is unbiased:
Let us confirm this claim in some detail. First, we use linearity of expectation to evaluate
Therefore, to prove that , it is sufficient to prove that for each .
When working with traces, there are two tricks that solve 90% of derivations. The first trick is that, if we view a number as a matrix, then a number equals its trace, . The second trick is the cyclic property: For a matrix and a matrix , we have . The cyclic property should be handled with care when one works with a product of three or more matrices. For instance, we have
One should think of the matrix product as beads on a closed loop of string. One can move the last bead to the front of the other two, , but not interchange two beads, .
With this trick in hand, let’s return to proving that for every . Apply our two tricks:
The expectation is a linear operation and the matrix is non-random, so we can bring the expectation into the trace as
Invoke the isotropy condition (2) and conclude:
Plugging this into (4) confirms the unbiasedness claim (3).
Continue to assume that the ‘s are isotropic (3) and now assume that are independent. By independence, the variance can be written as
Assuming that are identically distributed , we then get
The variance decreases like , which is characteristic of Monte Carlo-type algorithms. Since is unbiased (i.e, (3)), this means that the mean square error decays like so the average error (more precisely root-mean-square error) decays like
This type of convergence is very slow. If I want to decrease the error by a factor of , I must do the work!
Variance-reduced trace estimators like Hutch++ and our new trace estimator XTrace improve the rate of convergence substantially. Even in the worst case, Hutch++ and XTrace reduce the variance at a rate and (root-mean-square) error at rates :
For matrices with rapidly decreasing singular values, the variance and error can decrease much faster than this.
As the rate of convergence for the Girard–Hutchinson estimator is so slow, it is imperative to pick a distribution on test vectors that makes the variance of the single–sample estimate as low as possible. In this section, we will provide several explicit formulas for the variance of the Girard–Hutchinson estimator. Derivations of these formulas will appear at the end of this post. These variance formulas help illuminate the benefits and drawbacks of different test vector distributions.
To express the formulas, we will need some notation. For a complex number we use and to denote the real and imaginary parts. The variance of a random complex number is
The Frobenius norm of a matrix is
If is real symmetric or complex Hermitian with (real) eigenvalues , we have
denotes the ordinary transpose of and denotes the conjugate transpose of .
Real-Valued Test Vectors
We first focus on real-valued test vectors . Since is real, we can use the ordinary transpose rather than the conjugate transpose . Since is a number, it is equal to its own transpose:
The Girard–Hutchinson trace estimator applied to is the same as the Girard–Hutchinson estimator applied to the symmetric part of , .
For the following results, assume is symmetric, .
- Real Gaussian: are independent standard normal random vectors.
- Uniform signs (Rademachers): are independent random vectors with uniform coordinates.
- Real sphere: Assume are uniformly distributed on the real sphere of radius : .
These formulas continue to hold for nonsymmetric by replacing by its symmetric part on the right-hand sides of these variance formulas.
Complex-Valued Test Vectors
We now move our focus to complex-valued test vectors . As a rule of thumb, one should typically expect that the variance for complex-valued test vectors applied to a real symmetric matrix is about half the natural real counterpart—e.g., for complex Gaussians, you get about half the variance than with real Gaussians.
A square complex matrix has a Cartesian decomposition
denote the Hermitian and skew-Hermitian parts of . Similar to how the imaginary part of a complex number is real, the skew-Hermitian part of a complex matrix is Hermitian (and is skew-Hermitian). Since and are both Hermitian, we have
Consequently, the variance of can be broken into Hermitian and skew-Hermitian parts:
For this reason, we will state the variance formulas only for Hermitian , with the formula for general following from the Cartesian decomposition.
For the following results, assume is Hermitian, .
- Complex Gaussian: are independent standard complex random vectors, i.e., each has iid entries distributed as for standard normal random variables.
- Uniform phases (Steinhauses): are independent random vectors whose entries are uniform on the complex unit circle .
- Complex sphere: Assume are uniformly distributed on the complex sphere of radius : .
Let us finally address the question of what the best choice of test vectors is for the Girard–Hutchinson estimator. We will state two results with different restrictions on .
Our first result, due to Hutchinson, is valid for real symmetric matrices with real test vectors.
Optimality (independent test vectors with independent coordinates). If the test vectors are isotropic (2), independent from each other, and have independent entries, then for any fixed real symmetric matrix , the minimum variance for is obtained when are populated with random signs .
The next optimality results will have real and complex versions. To present the results for -valued and an -valued test vectors on unified footing, let denote either or . We let a -Hermitian matrix be either a real symmetric matrix (if ) or a complex Hermitian matrix (if ). Let a -unitary matrix be either a real orthogonal matrix (if ) or a complex unitary matrix (if ).
The condition that the vectors have independent entries is often too restrictive in practice. It rules out, for instance, the case of uniform vectors on the sphere. If we relax this condition, we get a different optimal distribution:
Optimality (independent test vectors). Consider any set of -Hermitian matrices which is invariant under -unitary similary transformations:
Assume that the test vectors are independent and isotropic (2). The worst-case variance is minimized by choosing uniformly on the -sphere: .
More simply, if you wants your stochastic trace estimator to be effective for a class of inputs (closed under -unitary similarity transformations) rather than a single input matrix , then the best distribution are test vectors drawn uniformly from the sphere. Examples of classes of matrices include:
- Fixed eigenvalues. For fixed real eigenvalues , the set of all -Hermitian matrices with these eigenvalues.
- Density matrices. The class of all trace-one psd matrices.
- Frobenius norm ball. The class of all -Hermitian matrices of Frobenius norm at most 1.
Derivation of Formulas
In this section, we provide derivations of the variance formulas. I have chosen to focus on derivations which are shorter but use more advanced techniques rather than derivations which are longer but use fewer tricks.
First assume is real. Since is real symmetric, has an eigenvalue decomposition , where is orthogonal and is a diagonal matrix reporting ‘s eigenvalues. Since the real Gaussian distribution is invariant under orthogonal transformations, has the same distribution as . Therefore,
Here, we used that the variance of a squared standard normal random variable is two.
For non-real matrix, we can break the matrix into its entrywise real and imaginary parts . Thus,
For a vector of uniform random signs, we have for every , so the second sum vanishes. Note that we have assumed symmetric, so the sum over can be replaced by two times the sum over :
Note that are pairwise independent. As a simple exercise, one can verify that the identity
holds for any pairwise independent family of random variances and numbers . Ergo,
In the second-to-last line, we use the fact that is a uniform random sign, which has variance . The final line is a consequence of the symmetry of .
Uniform on the Real Sphere
The simplest proof is I know is by the “camel principle”. Here’s the story (a lightly edited quotation from MathOverflow):
A father left 17 camels to his three sons and, according to the will, the eldest son was to be given a half of the camels, the middle son one-third, and the youngest son the one-ninth. The sons did not know what to do since 17 is not evenly divisible into either two, three, or nine parts, but a wise man helped the sons: he added his own camel, the oldest son took camels, the second son took camels, the third son camels and the wise man took his own camel and went away.
We are interested in a vector which is uniform on the sphere of radius . Performing averages on the sphere is hard, so we add a camel to the problem by “upgrading” to a spherically symmetric vector which has a random length. We want to pick a distribution for which the computation is easy. Fortunately, we already know such a distribution, the Gaussian distribution, for which we already calculated .
The Gaussian vector and the uniform vector on the sphere are related by
where is the squared length of the Gaussian vector . In particular, has the distribution of the sum of squared Gaussian random variables, which is known as a random variable with degrees of freedom.
Now, we take the camel back. Compute the variance of using the chain rule for variance:
Here, and denote the conditional variance and conditional expectation with respect to the random variable . The quick and dirty ways of working with these are to treat the random variable “like a constant” with respect to the conditional variance and expectation.
Plugging in the formula and treating “like a constant”, we obtain
As we mentioned, is a random variable with degrees of freedom and and are known quantities that can be looked up:
We know and . Plugging these all in, we get
Rearranging, we obtain
The trick is the same as for real Gaussians. By invariance of complex Gaussian random vectors under unitary transformations, we can reduce to the case where is a diagonal matrix populated with eigenvalues . Then
Here, we use the fact that is a random variable with two degrees of freedom, which has variance four.
The trick is the same as for uniform signs. A short calculation (remembering that is Hermitian and thus ) reveals that
The random variables are pairwise independent so we have
Since is uniformly distributed on the complex unit circle, we can assume without loss of generality that . Thus, letting be uniform on the complex unit circle,
The real and imaginary parts of have the same distribution so
so . Thus
Uniform on the Complex Sphere: Derivation 1 by Reduction to Real Case
There are at least three simple ways of deriving this result: the camel trick, reduction to the real case, and Haar integration. Each of these techniques illustrates a trick that is useful in its own right beyond the context of trace estimation. Since we have already seen an example of the camel trick for the real sphere, I will present the other two derivations.
Let us begin with the reduction to the real case. Let and denote the real and imaginary parts of a vector or matrix, taken entrywise. The key insight is that if is a uniform random vector on the complex sphere of radius , then
We’ve converted the complex vector into a real vector .
Now, we need to convert the complex matrix into a real matrix . To do this, recall that one way of representing complex numbers is by matrices:
Using this correspondence addition and multiplication of complex numbers can be carried by addition and multiplication of the corresponding matrices.
To convert complex matrices to real matrices, we use a matrix-version of the same representation:
One can check that addition and multiplication of complex matrices can be carried out by addition and multiplication of the corresponding “realified” matrices, i.e.,
holds for all complex matrices and .
We’ve now converted complex matrix and vector into real matrix and vector . Let’s compare to . A short calculation reveals
Since is a uniform random vector on the sphere of radius , is a uniform random vector on the sphere of radius . Thus, by the variance formula for the real sphere, we get
A short calculation verifies that and . Plugging this in, we obtain
Uniform on the Complex Sphere: Derivation 2 by Haar Integration
The proof by reduction to the real case requires some cumbersome calculations and requires that we have already computed the variance in the real case by some other means. The method of Haar integration is more slick, but it requires some pretty high-power machinery. Haar integration may be a little bit overkill for this problem, but this technique is worth learning as it can handle some truly nasty expected value computations that appear, for example, in quantum information.
We seek to compute
The first trick will be to write this expession using a single matrix trace using the tensor (Kronecker) product . For those unfamiliar with the tensor product, the main properties we will be using are
We saw in the proof of unbiasedness that
Therefore, by (6),
Thus, to evaluate , it will be sufficient to evaluate . Forunately, there is a useful formula for these expectation provided by a field of mathematics known as representation theory (see Lemma 1 in this paper):
Here, is the orthogonal projection onto the space of symmetric two-tensors . Therefore, we have that
To evalute the trace on the right-hand side of this equation, there is another formula (see Lemma 6 in this paper):
Therefore, we conclude
Proof of Optimality Properties
In this section, we provide proofs of the two optimality properties.
Optimality: Independent Vectors with Independent Coordinates
Assume is real and symmetric and suppose that is isotropic (2) with independent coordinates. The isotropy condition
implies that , where is the Kronecker symbol. Using this fact, we compute the second moment:
The variance is minimized by choosing with as small as possible. Since , the smallest possible value for is , which is obtained by populating with random signs.
Optimality: Independent Vectors
This result appears to have first been proven by Richard Kueng in unpublished work. We use an argument suggested to me by Robert J. Webber.
Assume is a class of -Hermitian matrices closed under -unitary similarity transformations and that is an isotropic random vector (2). Decompose the test vector as
First, we shall show that the variance is reduced by replacing with a vector drawn uniformly from the sphere
Note that such a can be generated as for a uniformly random -unitary matrix . Therefore, we have
Now apply Jensen’s inequality only over the randomness in to obtain
Finally, note that since is closed under -unitary similarity transformations, the supremum over for is the same as the supremum of , so we obtain
We have successfully proven (7). This argument is a specialized version of a far more general result which appears as Proposition 4.1 in this paper.
Next, we shall prove
where is still defined as in (8). Indeed, using the chain rule for variance, we obtain
Here, we have used that is uniform on the sphere and thus . By definition, is the length of divided by . Therefore,
Therefore, by Jensen’s inequality,
which proves (9).