# Don’t Use Gaussians in Stochastic Trace Estimation

Suppose we are interested in estimating the trace of an matrix that can be only accessed through matrix–vector products . The classical method for this purpose is the GirardHutchinson estimator

where the vectors are independent, identically distributed (iid) random vectors satisfying the isotropy condition

Examples of vectors satisfying this condition include

Stochastic trace estimation has a number of applications: log-determinant computations in machine learningpartition function calculations in statistical physicsgeneralized cross validation for smoothing splines, and triangle counting in large networks. Several improvements to the basic Girard–Hutchinson estimator have been developed recently. I am partial to XTrace, an improved trace estimator that I developed with my collaborators.

This post is addressed at the question:

Which distribution should be used for the test vectors for stochastic trace estimation?

Since the Girard–Hutchinson estimator is unbiased , the variance of is equal to the mean-square error. Thus, the lowest variance trace estimate is the most accurate. In my previous post on trace estimation, I discussed formulas for the variance of the Girard–Hutchinson estimator with different choices of test vectors. In that post, I stated the formulas for different choices of test vectors (Gaussian, random signs, sphere) and showed how those formulas could be proven.

In this post, I will take the opportunity to editorialize on which distribution to pick. The thesis of this post is as follows:

The sphere distribution is essentially always preferable to the Gaussian distribution for trace estimation.

To explain why, let’s focus on the case when is real and symmetric.1The same principles hold in the general case, but the variance formulas are more delicate to state. See my previous post for the formulas. Let be the eigenvalues of and define the eigenvalue mean

Then the variance of the Girard–Hutchinson estimator with Gaussian vectors is

For vectors drawn from the sphere, we have

The sphere distribution improves on the Gaussian distribution in two ways. First, the variance of is smaller than by a factor of . This improvement is quite minor. Second, and more importantly, is proportional to the sum of ‘s squared eigenvalues whereas is proportional to the sum of ‘s squared eigenvalues after having been shifted to be mean-zero!

The difference between Gaussian and sphere test vectors can be large. To see this, consider a matrix with eigenvalues uniformly distributed between and with a (Haar orthgonal) random matrix of eigenvectors. For simplicity, since the variance of all Girard–Hutchinson estimates is proportional to , we take . Below show the variance of Girard–Hutchinson estimator for different distributions for the test vector. We see that the sphere distribution leads to a trace estimate which has a variance 300× smaller than the Gaussian distribution. For this example, the sphere and random sign distributions are similar.

## Which Distribution Should You Use: Signs vs. Sphere

The main point of this post is to argue against using the Gaussian distribution. But which distribution should you use: Random signs? The sphere distribution? The answer, for most applications, is one of those two, but exactly which depends on the properties of the matrix .

The variance of the Girard–Hutchinson estimator with the random signs estimator is

Thus, depends on the size of the off-diagonal entries of ; does not depend on the diagonal of at all! For matrices with small off-diagonal entries (such as diagonally dominant matrices), the random signs distribution is often the best.

However, for other problems, the sphere distribution is preferable to random signs. The sphere distribution is rotation-invariant, so is independent of the eigenvectors of the (symmetric) matrix , depending only on ‘s eigenvalues. By contrast, the variance of the Girard–Hutchinson estimator with the random signs distribution can significantly depend on the eigenvectors of the matrix . For a given set of eigenvalues and the worst-case choice of eigenvectors, will always be smaller than . In fact, is the minimum variance distribution for Girard–Hutchinson trace estimation for a matrix with fixed eigenvalues and worst-case eigenvectors; see this section of my previous post for details.

In my experience, random signs and the sphere distribution are both perfectly adequate for trace estimation and either is a sensible default if you’re developing software. The Gaussian distribution on the other hand… don’t use it unless you have a good reason to.

# How Good Can Stochastic Trace Estimates Be?

I am excited to share that our paper XTrace: Making the most of every sample in stochastic trace estimation has been published in the SIAM Journal on Matrix Analysis and Applications. (See also our paper on arXiv.)

Spurred by this exciting news, I wanted to take the opportunity to share one of my favorite results in randomized numerical linear algebra: a “speed limit” result of Meyer, Musco, Musco, and Woodruff that establishes a fundamental limitation on how accurate any trace estimation algorithm can be.

Let’s back up. Given an unknown square matrix , the trace of , defined to be the sum of its diagonal entries

The catch? We assume that we can only access the matrix through matrix–vector products (affectionately known as “matvecs”): Given any vector , we have access to . Our goal is to form an estimate that is as accurate as possible while using as few matvecs as we can get away with.

To simplify things, let’s assume the matrix is symmetric and positive (semi)definite. The classical algorithm for trace estimation is due to Girard and Hutchinson, producing a probabilistic estimate with a small average (relative) error:

If one wants high accuracy, this algorithm is expensive. To achieve just a 1% error () requires roughly matvecs!

This state of affairs was greatly improved by Meyer, Musco, Musco, and Woodruff. Building upon previous work, they proposed the Hutch++ algorithm and proved it outputs an estimate satisfying the following bound:

(1)

Now, we only require roughly matvecs to achieve 1% error! Our algorithm, XTrace, satisfies the same error guarantee (1) as Hutch++. On certain problems, XTrace can be quite a bit more accurate than Hutch++.

## The MMMW Trace Estimation “Speed Limit”

Given the dramatic improvement of Hutch++ and XTrace over Girard–Hutchinson, it is natural to hope: Is there an algorithm that does even better than Hutch++ and XTrace? For instance, is there an algorithm satisfying an even slightly better error bound of the form

Unfortunately not. Hutch++ and XTrace are essentially as good as it gets.

Let’s add some fine print. Consider an algorithm for the trace estimation problem. Whenever the algorithm wants, it can present a vector and receive back . The algorithm is allowed to be adaptive: It can use the matvecs it has already collected to decide which vector to present next. We measure the cost of the algorithm in terms of the number of matvecs alone, and the algorithm knows nothing about the psd matrix other what it learns from matvecs.

One final stipulation:

Simple entries assumption. We assume that the entries of the vectors presented by the algorithm are real numbers between and with up to digits after the decimal place.

To get a feel for this simple entries assumption, suppose we set . Then would be an allowed input vector, but would not be (too many digits after the decimal place). Similarly, would not be valid because its entries exceed . The simple entries assumption is reasonable as we typically represent numbers on digital computers by storing a fixed number of digits of accuracy.1We typically represent numbers on digital computers by floating point numbers, which essentially represent numbers using scientific notation like . For this analysis of trace estimation, we use fixed point numbers like (no powers of ten allowed)!

With all these stipulations, we are ready to state the “speed limit” for trace estimation proved by Meyer, Musco, Musco, and Woodruff:

Informal theorem (Meyer, Musco, Musco, Woodruff). Under the assumptions above, there is no trace estimation algorithm producing an estimate satisfying

We will see a slightly sharper version of the theorem below, but this statement captures the essence of the result.

## Communication Complexity

To prove the MMMW theorem, we have to take a journey to the beautiful subject of communication complexity. The story is this. Alice and Bob are interested in solving a computational problem together. Alice has her input and Bob has his input , and they are interested in computing a function of both their inputs.

Unfortunately for the two of them, Alice and Bob are separated by a great distance, and can only communicate by sending single bits (0 or 1) of information over a slow network connection. Every bit of communication is costly. The field of communication complexity is dedicated to determining how efficiently Alice and Bob are able to solve problems of this form.

The Gap-Hamming problem is one example of a problem studied in communication complexity. As inputs, Alice and Bob receive vectors with and entries from a third party Eve. Eve promises Alice and Bob that their vectors and satisfy one of two conditions:

(2)

Alice and Bob must work together, sending as few bits of communication as possible, to determine which case they are in.

There’s one simple solution to this problem: First, Bob sends his whole input vector to Alice. Each entry of takes one of the two value and can therefore be communicated in a single bit. Having received , Alice computes , determines whether they are in case 0 or case 1, and sends Bob a single bit to communicate the answer. This procedure requires bits of communication.

Can Alice and Bob still solve this problem with many fewer than bits of communication, say bits? Unfortunately not. The following theorem of Chakrabati and Regev shows that roughly bits of communication are needed to solve this problem:

Theorem (Chakrabati–Regev). Any algorithm which solves the Gap-Hamming problem that succeeds with at least probability for every pair of inputs and (satisfying one of the conditions (2)) must take bits of communication.

Here, is big-Omega notation, closely related to big-O notation and big-Theta notation . For the less familiar, it can be helpful to interpret , , and as all standing for “proportional to ”. In plain language, the theorem of Chakrabati and Regev result states that there is no algorithm for the Gap-Hamming problem that much more effective than the basic algorithm where Bob sends his whole input to Alice (in the sense of requiring less than bits of communication).

## Reducing Gap-Hamming to Trace Estimation

This whole state of affairs is very sad for Alice and Bob, but what does it have to do with trace estimation? Remarkably, we can use hardness of the Gap-Hamming problem to show there’s no algorithm that fundamentally improves on Hutch++ and XTrace. The argument goes something like this:

1. If there were a trace estimation algorithm fundamentally better than Hutch++ and XTrace, we could use it to solve Gap-Hamming in fewer than bits of communication.
2. But no algorithm can solve Gap-Hamming in fewer than bits or communication.
3. Therefore, no trace estimation algorithm is fundamentally better than Hutch++ and XTrace.

Step 2 is the work of Chakrabati and Regev, and step 3 follows logically from 1 and 2. Therefore, we are left to complete step 1 of the argument.

### Protocol

Assume we have access to a really good trace estimation algorithm. We will use it to solve the Gap-Hamming problem. For simplicity, assume is a perfect square. The basic idea is this:

• Have Alice and Bob reshape their inputs into matrices , and consider (but do not form!) the positive semidefinite matrix

• Observe that

Thus, the two cases in (2) can be equivalently written in terms of :

(2′)

• By working together, Alice and Bob can implement a trace estimation algorithm. Alice will be in charge of running the algorithm, but Alice and Bob must work together to compute matvecs. (Details below!)
• Using the output of the trace estimation algorithm, Alice determines whether they are in case 0 or 1 (i.e., where or ) and sends the result to Bob.

To complete this procedure, we just need to show how Alice and Bob can implement the matvec procedure using minimal communication. Suppose Alice and Bob want to compute for some vector with entries between and with up to decimal digits. First, convert to a vector whose entries are integers between and . Since , interconverting between and is trivial. Alice and Bob’s procedure for computing is as follows:

• Alice sends Bob .
• Having received , Bob forms and sends it to Alice.
• Having received , Alice computes and sends it to Bob.
• Having received , Bob computes and sends its to Alice.
• Alice forms .

Because and are and have entries, all vectors computed in this procedure are vectors of length with integer entries between and . We conclude the communication cost for one matvec is bits.

### Analysis

Consider an algorithm we’ll call BestTraceAlgorithm. Given any accuracy parameter , BestTraceAlgorithm requires at most matvecs and, for any positive semidefinite input matrix of any size, produces an estimate satisfying

(3)

We assume that BestTraceAlgorithm is the best possible algorithm in the sense that no algorithm can achieve (3) on all (positive semidefinite) inputs with matvecs.

To solve the Gap-Hamming problem, Alice and Bob just need enough accuracy in their trace estimation to distinguish between cases 0 and 1. In particular, if

then Alice and Bob can distinguish between cases 0 and 1 in (2′)

Suppose that Alice and Bob apply trace estimation to solve the Gap-Hamming problem, using matvecs in total. The total communication is bits. Chakrabati and Regev showed that Gap-Hamming requires bits of communication (for some ) to solve the Gap-Hamming problem with probability. Thus, if , then Alice and Bob fail to solve the Gap-Hamming problem with at least probability. Thus,

The contrapositive of this statement is that if

Say Alice and Bob run BestTraceAlgorithm with parameter . Then, by (3) and Markov’s inequality,

Therefore, BestTraceAlgorithm requires at least

Using the fact that we’ve set , we conclude that any trace estimation algorithm, even BestTraceAlgorithm, requires

In particular, no trace estimation algorithm can achieve mean relative error using even matvecs. This proves the MMMW theorem.

# Stochastic Trace Estimation

I am delighted to share that me, Joel A. Tropp, and Robert J. Webber‘s paper XTrace: Making the Most of Every Sample in Stochastic Trace Estimation has recently been released as a preprint on arXiv. In it, we consider the implicit trace estimation problem:

Implicit trace estimation problem: Given access to a square matrix via the matrix–vector product operation , estimate its trace .

Algorithms for this task have many uses such as log-determinant computations in machine learning, partition function calculations in statistical physics, and generalized cross validation for smoothing splines. I described another application to counting triangles in a large network in a previous blog post.

Our paper presents new trace estimators XTrace and XNysTrace which are highly efficient, producing accurate trace approximations using a small budget of matrix–vector products. In addition, these algorithms are fast to run and are supported by theoretical results which explain their excellent performance. I really hope that you will check out the paper to learn more about these estimators!

For the rest of this post, I’m going to talk about the most basic stochastic trace estimation algorithm, the GirardHutchinson estimator. This seemingly simple algorithm exhibits a number of nuances and forms the backbone for more sophisticated trace estimates such as Hutch++, Nyström++, XTrace, and XNysTrace. Toward the end, this blog post will be fairly mathematical, but I hope that the beginning will be fairly accessible to all.

## Girard–Hutchinson Estimator: The Basics

The GirardHutchinson estimator for the trace of a square matrix is

(1)

Here, are random vectors, usually chosen to be statistically independent, and denotes the conjugate transpose of a vector or matrix. The Girard–Hutchinson estimator only depends on the matrix through the matrix–vector products .

### Unbiasedness

Provided the random vectors are isotropic

(2)

the Girard–Hutchinson estimator is unbiased:

(3)

Let us confirm this claim in some detail. First, we use linearity of expectation to evaluate

(4)

Therefore, to prove that , it is sufficient to prove that for each .

When working with traces, there are two tricks that solve 90% of derivations. The first trick is that, if we view a number as a matrix, then a number equals its trace, . The second trick is the cyclic property: For a matrix and a matrix , we have . The cyclic property should be handled with care when one works with a product of three or more matrices. For instance, we have

However,

One should think of the matrix product as beads on a closed loop of string. One can move the last bead to the front of the other two, , but not interchange two beads, .

With this trick in hand, let’s return to proving that for every . Apply our two tricks:

The expectation is a linear operation and the matrix is non-random, so we can bring the expectation into the trace as

Invoke the isotropy condition (2) and conclude:

Plugging this into (4) confirms the unbiasedness claim (3).

### Variance

Continue to assume that the ‘s are isotropic (3) and now assume that are independent. By independence, the variance can be written as

Assuming that are identically distributed , we then get

The variance decreases like , which is characteristic of Monte Carlo-type algorithms. Since is unbiased (i.e, (3)), this means that the mean square error decays like so the average error (more precisely root-mean-square error) decays like

This type of convergence is very slow. If I want to decrease the error by a factor of , I must do the work!

Variance-reduced trace estimators like Hutch++ and our new trace estimator XTrace improve the rate of convergence substantially. Even in the worst case, Hutch++ and XTrace reduce the variance at a rate and (root-mean-square) error at rates :

For matrices with rapidly decreasing singular values, the variance and error can decrease much faster than this.

## Variance Formulas

As the rate of convergence for the Girard–Hutchinson estimator is so slow, it is imperative to pick a distribution on test vectors that makes the variance of the single–sample estimate as low as possible. In this section, we will provide several explicit formulas for the variance of the Girard–Hutchinson estimator. Derivations of these formulas will appear at the end of this post. These variance formulas help illuminate the benefits and drawbacks of different test vector distributions.

To express the formulas, we will need some notation. For a complex number we use and to denote the real and imaginary parts. The variance of a random complex number is

The Frobenius norm of a matrix is

If is real symmetric or complex Hermitian with (real) eigenvalues , we have

(5)

denotes the ordinary transpose of and denotes the conjugate transpose of .

### Real-Valued Test Vectors

We first focus on real-valued test vectors . Since is real, we can use the ordinary transpose rather than the conjugate transpose . Since is a number, it is equal to its own transpose:

Therefore,

The Girard–Hutchinson trace estimator applied to is the same as the Girard–Hutchinson estimator applied to the symmetric part of , .

For the following results, assume is symmetric, .

1. Real Gaussian: are independent standard normal random vectors.

2. Uniform signs (Rademachers): are independent random vectors with uniform coordinates.

3. Real sphere: Assume are uniformly distributed on the real sphere of radius : .

These formulas continue to hold for nonsymmetric by replacing by its symmetric part on the right-hand sides of these variance formulas.

### Complex-Valued Test Vectors

We now move our focus to complex-valued test vectors . As a rule of thumb, one should typically expect that the variance for complex-valued test vectors applied to a real symmetric matrix is about half the natural real counterpart—e.g., for complex Gaussians, you get about half the variance than with real Gaussians.

A square complex matrix has a Cartesian decomposition

where

denote the Hermitian and skew-Hermitian parts of . Similar to how the imaginary part of a complex number is real, the skew-Hermitian part of a complex matrix is Hermitian (and is skew-Hermitian). Since and are both Hermitian, we have

Consequently, the variance of can be broken into Hermitian and skew-Hermitian parts:

For this reason, we will state the variance formulas only for Hermitian , with the formula for general following from the Cartesian decomposition.

For the following results, assume is Hermitian, .

1. Complex Gaussian: are independent standard complex random vectors, i.e., each has iid entries distributed as for standard normal random variables.

2. Uniform phases (Steinhauses): are independent random vectors whose entries are uniform on the complex unit circle .

3. Complex sphere: Assume are uniformly distributed on the complex sphere of radius : .

## Optimality Properties

Let us finally address the question of what the best choice of test vectors is for the Girard–Hutchinson estimator. We will state two results with different restrictions on .

Our first result, due to Hutchinson, is valid for real symmetric matrices with real test vectors.

Optimality (independent test vectors with independent coordinates). If the test vectors are isotropic (2), independent from each other, and have independent entries, then for any fixed real symmetric matrix , the minimum variance for is obtained when are populated with random signs .

The next optimality results will have real and complex versions. To present the results for -valued and an -valued test vectors on unified footing, let denote either or . We let a -Hermitian matrix be either a real symmetric matrix (if ) or a complex Hermitian matrix (if ). Let a -unitary matrix be either a real orthogonal matrix (if ) or a complex unitary matrix (if ).

The condition that the vectors have independent entries is often too restrictive in practice. It rules out, for instance, the case of uniform vectors on the sphere. If we relax this condition, we get a different optimal distribution:

Optimality (independent test vectors). Consider any set of -Hermitian matrices which is invariant under -unitary similary transformations:

Assume that the test vectors are independent and isotropic (2). The worst-case variance is minimized by choosing uniformly on the -sphere: .

More simply, if you wants your stochastic trace estimator to be effective for a class of inputs (closed under -unitary similarity transformations) rather than a single input matrix , then the best distribution are test vectors drawn uniformly from the sphere. Examples of classes of matrices include:

• Fixed eigenvalues. For fixed real eigenvalues , the set of all -Hermitian matrices with these eigenvalues.
• Density matrices. The class of all trace-one psd matrices.
• Frobenius norm ball. The class of all -Hermitian matrices of Frobenius norm at most 1.

## Derivation of Formulas

In this section, we provide derivations of the variance formulas. I have chosen to focus on derivations which are shorter but use more advanced techniques rather than derivations which are longer but use fewer tricks.

### Real Gaussians

First assume is real. Since is real symmetric, has an eigenvalue decomposition , where is orthogonal and is a diagonal matrix reporting ‘s eigenvalues. Since the real Gaussian distribution is invariant under orthogonal transformations, has the same distribution as . Therefore,

Here, we used that the variance of a squared standard normal random variable is two.

For non-real matrix, we can break the matrix into its entrywise real and imaginary parts . Thus,

### Uniform Signs

First, compute

For a vector of uniform random signs, we have for every , so the second sum vanishes. Note that we have assumed symmetric, so the sum over can be replaced by two times the sum over :

Note that are pairwise independent. As a simple exercise, one can verify that the identity

holds for any pairwise independent family of random variances and numbers . Ergo,

In the second-to-last line, we use the fact that is a uniform random sign, which has variance . The final line is a consequence of the symmetry of .

### Uniform on the Real Sphere

The simplest proof is I know is by the “camel principle”. Here’s the story (a lightly edited quotation from MathOverflow):

A father left 17 camels to his three sons and, according to the will, the eldest son was to be given a half of the camels, the middle son one-third, and the youngest son the one-ninth. The sons did not know what to do since 17 is not evenly divisible into either two, three, or nine parts, but a wise man helped the sons: he added his own camel, the oldest son took camels, the second son took camels, the third son camels and the wise man took his own camel and went away.

We are interested in a vector which is uniform on the sphere of radius . Performing averages on the sphere is hard, so we add a camel to the problem by “upgrading” to a spherically symmetric vector which has a random length. We want to pick a distribution for which the computation is easy. Fortunately, we already know such a distribution, the Gaussian distribution, for which we already calculated .

The Gaussian vector and the uniform vector on the sphere are related by

where is the squared length of the Gaussian vector . In particular, has the distribution of the sum of squared Gaussian random variables, which is known as a random variable with degrees of freedom.

Now, we take the camel back. Compute the variance of using the chain rule for variance:

Here, and denote the conditional variance and conditional expectation with respect to the random variable . The quick and dirty ways of working with these are to treat the random variable “like a constant” with respect to the conditional variance and expectation.

Plugging in the formula and treating “like a constant”, we obtain

As we mentioned, is a random variable with degrees of freedom and and are known quantities that can be looked up:

We know and . Plugging these all in, we get

Rearranging, we obtain

### Complex Gaussians

The trick is the same as for real Gaussians. By invariance of complex Gaussian random vectors under unitary transformations, we can reduce to the case where is a diagonal matrix populated with eigenvalues . Then

Here, we use the fact that is a random variable with two degrees of freedom, which has variance four.

### Random Phases

The trick is the same as for uniform signs. A short calculation (remembering that is Hermitian and thus ) reveals that

The random variables are pairwise independent so we have

Since is uniformly distributed on the complex unit circle, we can assume without loss of generality that . Thus, letting be uniform on the complex unit circle,

The real and imaginary parts of have the same distribution so

so . Thus

### Uniform on the Complex Sphere: Derivation 1 by Reduction to Real Case

There are at least three simple ways of deriving this result: the camel trick, reduction to the real case, and Haar integration. Each of these techniques illustrates a trick that is useful in its own right beyond the context of trace estimation. Since we have already seen an example of the camel trick for the real sphere, I will present the other two derivations.

Let us begin with the reduction to the real case. Let and denote the real and imaginary parts of a vector or matrix, taken entrywise. The key insight is that if is a uniform random vector on the complex sphere of radius , then

We’ve converted the complex vector into a real vector .

Now, we need to convert the complex matrix into a real matrix . To do this, recall that one way of representing complex numbers is by matrices:

Using this correspondence addition and multiplication of complex numbers can be carried by addition and multiplication of the corresponding matrices.

To convert complex matrices to real matrices, we use a matrix-version of the same representation:

One can check that addition and multiplication of complex matrices can be carried out by addition and multiplication of the corresponding “realified” matrices, i.e.,

holds for all complex matrices and .

We’ve now converted complex matrix and vector into real matrix and vector . Let’s compare to . A short calculation reveals

Since is a uniform random vector on the sphere of radius , is a uniform random vector on the sphere of radius . Thus, by the variance formula for the real sphere, we get

A short calculation verifies that and . Plugging this in, we obtain

### Uniform on the Complex Sphere: Derivation 2 by Haar Integration

The proof by reduction to the real case requires some cumbersome calculations and requires that we have already computed the variance in the real case by some other means. The method of Haar integration is more slick, but it requires some pretty high-power machinery. Haar integration may be a little bit overkill for this problem, but this technique is worth learning as it can handle some truly nasty expected value computations that appear, for example, in quantum information.

We seek to compute

The first trick will be to write this expession using a single matrix trace using the tensor (Kronecker) product . For those unfamiliar with the tensor product, the main properties we will be using are

(6)

We saw in the proof of unbiasedness that

Therefore, by (6),

Thus, to evaluate , it will be sufficient to evaluate . Forunately, there is a useful formula for these expectation provided by a field of mathematics known as representation theory (see Lemma 1 in this paper):

Here, is the orthogonal projection onto the space of symmetric two-tensors . Therefore, we have that

To evalute the trace on the right-hand side of this equation, there is another formula (see Lemma 6 in this paper):

Therefore, we conclude

## Proof of Optimality Properties

In this section, we provide proofs of the two optimality properties.

### Optimality: Independent Vectors with Independent Coordinates

Assume is real and symmetric and suppose that is isotropic (2) with independent coordinates. The isotropy condition

implies that , where is the Kronecker symbol. Using this fact, we compute the second moment:

Thus

The variance is minimized by choosing with as small as possible. Since , the smallest possible value for is , which is obtained by populating with random signs.

### Optimality: Independent Vectors

This result appears to have first been proven by Richard Kueng in unpublished work. We use an argument suggested to me by Robert J. Webber.

Assume is a class of -Hermitian matrices closed under -unitary similarity transformations and that is an isotropic random vector (2). Decompose the test vector as

First, we shall show that the variance is reduced by replacing with a vector drawn uniformly from the sphere

(7)

where

(8)

Note that such a can be generated as for a uniformly random -unitary matrix . Therefore, we have

Now apply Jensen’s inequality only over the randomness in to obtain

Finally, note that since is closed under -unitary similarity transformations, the supremum over for is the same as the supremum of , so we obtain

We have successfully proven (7). This argument is a specialized version of a far more general result which appears as Proposition 4.1 in this paper.

Next, we shall prove

(9)

where is still defined as in (8). Indeed, using the chain rule for variance, we obtain

Here, we have used that is uniform on the sphere and thus . By definition, is the length of divided by . Therefore,

Therefore, by Jensen’s inequality,

Thus

which proves (9).