# Note to Self: Gaussian Hypercontractivity

The purpose of this note is to describe the Gaussian hypercontractivity inequality. As an application, we’ll obtain a weaker version of the Hanson–Wright inequality.

## The Noise Operator

We begin our discussion with the following question:

Let be a function. What happens to , on average, if we perturb its inputs by a small amount of Gaussian noise?

Let’s be more specific about our noise model. Let be an input to the function and fix a parameter (think of as close to 1). We’ll define the noise corruption of to be

(1)

Here, is the standard multivariate Gaussian distribution. In our definition of , we both add Gaussian noise and shrink the vector by a factor . In particular, we highlight two extreme cases:

• No noise. If , then there is no noise and .
• All noise. If , then there is all noise and . The influence of the original vector has been washed away completely.

The noise corruption (1) immediately gives rise to the noise operator1The noise operator is often called the Hermite operator. The noise operator is related to the Ornstein–Uhlenbeck semigroup operator by a change of variables, . . Let be a function. The noise operator is defined to be:

(2)

The noise operator computes the average value of when evaluated at the noisy input . Observe that the noise operator maps a function to another function . Going forward, we will write to denote .

To understand how the noise operator acts on a function , we can write the expectation in the definition (2) as an integral:

Here, denotes the (Euclidean) length of . We see that is the convolution of with a Gaussian density. Thus, acts to smooth the function .

See below for an illustration. The red solid curve is a function , and the blue dashed curve is .

As we decrease from to , the function is smoothed more and more. When we finally reach , has been smoothed all the way into a constant.

## Random Inputs

The noise operator converts a function to another function . We can evaluate these two functions at a Gaussian random vector , resulting in two random variables and .

We can think of as a modification of the random variable where “a fraction of the variance of has been averaged out”. We again highlight the two extreme cases:

• No noise. If , . None of the variance of has been averaged out.
• All noise. If , is a constant random variable. All of the variance of has been averaged out.

Just as decreasing smoothes the function until it reaches a constant function at , decreasing makes the random variable more and more “well-behaved” until it becomes a constant random variable at . This “well-behavingness” property of the noise operator is made precise by the Gaussian hypercontractivity theorem.

## Moments and Tails

In order to describe the “well-behavingness” properties of the noise operator, we must answer the question:

How can we measure how well-behaved a random variable is?

There are many answers to this question. For this post, we will quantify the well-behavedness of a random variable by using the norm.2Using norms is a common way of measuring the niceness of a function or random variable in applied math. For instance, we can use Sobolev norms or reproducing kernel Hilbert space norms to measure the smoothness of a function in approximation theory, as I’ve discussed before on this blog.

The norm of a (-valued) random variable is defined to be

(3)

The th power of the norm is sometimes known as the th absolute moment of .

The norms of random variables control the tails of a random variable—that is, the probability that a random variable is large in magnitude. A random variables with small tails is typically thought of as a “nice” or “well-behaved” random variable. Random quantities with small tails are usually desirable in applications, as they are more predictable—unlikely to take large values.

The connection between tails and norms can be derived as follows. First, write the tail probability for using th powers:

Then, we apply Markov’s inequality, obtaining

(4)

We conclude that a random variable with finite norm (i.e., ) has tails that decay at at a rate or faster.

## Gaussian Contractivity

Before we introduce the Gaussian hypercontractivity theorem, let’s establish a weaker property of the noise operator, contractivity.

Proposition 1 (Gaussian contractivity). Choose a noise level and a power , and let be a Gaussian random vector. Then contracts the norm of :

This result shows that the noise operator makes the random variable no less nice than was.

Gaussian contractivity is easy to prove. Begin using the definition of the noise operator (2) and norm (3):

Now, we can apply Jensen’s inequality to the convex function , obtaining

Finally, realize that for the independent normal random vectors, we have

Thus, has the same distribution as . Thus, using in place of , we obtain

Gaussian contractivity (Proposition 1) is proven.

## Gaussian Hypercontractivity

The Gaussian contractivity theorem shows that is no less well-behaved than is. In fact, is more well-behaved than is. This is the content of the Gaussian hypercontractivity theorem:

Theorem 2 (Gaussian hypercontractivity): Choose a noise level and a power , and let be a Gaussian random vector. Then

In particular, for ,

We have highlighted the case because it is the most useful in practice.

This result shows that as we take smaller, the random variable becomes more and more well-behaved, with tails decreasing at a rate

The rate of tail decrease becomes faster and faster as becomes closer to zero.

We will prove the Gaussian hypercontractivity at the bottom of this post. For now, we will focus on applying this result.

## Multilinear Polynomials

A multilinear polynomial is a multivariate polynomial in the variables in which none of the variables is raised to a power higher than one. So,

(5)

is multilinear, but

is not multilinear (since is squared).

For multilinear polynomials, we have the following very powerful corollary of Gaussian hypercontractivity:

Corollary 3 (Absolute moments of a multilinear polynomial of Gaussians). Let be a multilinear polynomial of degree . (That is, at most variables occur in any monomial of .) Then, for a Gaussian random vector and for all ,

Let’s prove this corollary. The first observation is that the noise operator has a particularly convenient form when applied to a multilinear polynomial. Let’s test it out on our example (5) from above. For

we have

We see that the expectation applies to each variable separately, resulting in each replaced by . This trend holds in general:

Proposition 4 (noise operator on multilinear polynomials). For any multilinear polynomial , .

We can use Proposition 4 to obtain bounds on the norms of multilinear polynomials of a Gaussian random variable. Indeed, observe that

Thus, by Gaussian hypercontractivity, we have

The final step of our argument will be to compute . Write as

Since is multilinear, for . Since is degree-, . The multilinear monomials are orthonormal with respect to the inner product:

(See if you can see why!) Thus, by the Pythagorean theorem, we have

Similarly, the coefficients of are . Thus,

Thus, putting all of the ingredients together, we have

Setting (equivalently ), Corollary 3 follows.

## Hanson–Wright Inequality

To see the power of the machinery we have developed, let’s prove a version of the Hanson–Wright inequality.

Theorem 5 (suboptimal Hanson–Wright). Let be a symmetric matrix with zero on its diagonal and be a Gaussian random vector. Then

Hanson–Wright has all sorts of applications in computational mathematics and data science. One direct application is to obtain probabilistic error bounds for the error incurred by a stochastic trace estimation formulas.

This version of Hanson–Wright is not perfect. In particular, it does not capture the Bernstein-type tail behavior of the classical Hanson–Wright inequality

But our suboptimal Hanson–Wright inequality is still pretty good, and it requires essentially no work to prove using the hypercontractivity machinery. The hypercontractivity technique also generalizes to settings where some of the proofs of Hanson–Wright fail, such as multilinear polynomials of degree higher than two.

Let’s prove our suboptimal Hanson–Wright inequality. Set . Since has zero on its diagonal, is a multilinear polynomial of degree two in the entries of . The random variable is mean-zero, and a short calculation shows its norm is

Thus, by Corollary 3,

(6)

In fact, since the norms are monotone, (6) holds for as well. Therefore, the standard tail bound for norms (4) gives

(7)

Now, we must optimize the value of to obtain the sharpest possible bound. To make this optimization more convenient, introduce a parameter

In terms of the parameter, the bound (7) reads

The tail bound is minimized by taking , yielding the claimed result

## Proof of Gaussian Hypercontractivity

Let’s prove the Gaussian hypercontractivity theorem. For simplicity, we will stick with the case, but the higher-dimensional generalizations follow along similar lines. The key ingredient will be the Gaussian Jensen inequality, which made a prominent appearance in a previous blog post of mine. Here, we will only need the following version:

Theorem 6 (Gaussian Jensen). Let be a twice differentiable function and let be jointly Gaussian random variables with covariance matrix . Then

(8)

holds for all test functions if, and only if,

(9)

Here, denotes the entrywise product of matrices and is the Hessian matrix of the function .

To me, this proof of Gaussian hypercontractivity using Gaussian Jensen (adapted from Paata Ivanishvili‘s excellent post) is amazing. First, we reformulate the Gaussian hypercontractivity property a couple of times using some functional analysis tricks. Then we do a short calculation, invoke Gaussian Jensen, and the theorem is proved, almost as if by magic.

### Part 1: Tricks

Let’s begin with “tricks” part of the argument.

Trick 1. To prove Gaussian hypercontractivity holds for all functions , it is sufficient to prove for all nonnegative functions .

Indeed, suppose Gaussian hypercontractivity holds for all nonnegative functions . Then, for any function , apply Jensen’s inequality to conclude

Thus, assuming hypercontractivity holds for the nonnegative function , we have

Thus, the conclusion of the hypercontractivity theorem holds for as well, and the Trick 1 is proven.

Trick 2. To prove Gaussian hypercontractivity for all , it is sufficient to prove the following “bilinearized” Gaussian hypercontractivity result:

holds for all with . Here, is the Hölder conjugate to .

Indeed, this follows3This argument may be more clear to parse if we view and as functions on equipped with the standard Gaussian measure . This result is just duality for the norm. from the dual characterization of the norm of :

Trick 2 is proven.

Trick 3. Let be a pair of standard Gaussian random variables with correlation . Then the bilinearized Gaussian hypercontractivity statement is equivalent to

Indeed, define for the random variable in the definition of the noise operator . The random variable is standard Gaussian and has correlation with , concluding the proof of Trick 3.

Finally, we apply a change of variables as our last trick:

Trick 4. Make the change of variables and , yielding the final equivalent version of Gaussian hypercontractivity:

for all functions and (in the appropriate spaces).

### Part 2: Calculation

We recognize this fourth equivalent version of Gaussian hypercontractivity as the conclusion (8) to Gaussian Jensen with

. Thus, to prove Gaussian hypercontractivity, we just need to check the hypothesis (9) of the Gaussian Jensen inequality (Theorem 6).

We now enter the calculation part of the proof. First, we compute the Hessian of :

We have written for the Hölder conjugate to . By Gaussian Jensen, to prove Gaussian hypercontractivity, it suffices to show that

is negative semidefinite for all . There are a few ways we can make our lives easier. Write this matrix as

Scaling by nonnegative and conjugation both preserve negative semidefiniteness, so it is sufficient to prove

Since the diagonal entries of are negative, at least one of ‘s eigenvalues is negative.4Indeed, by the Rayleigh–Ritz variational principle, the smallest eigenvalue of a symmetric matrix is Taking for to be each of the standard basis vectors, shows that the smallest eigenvalue of is smaller than the smallest diagonal entry of . Therefore, to prove is negative semidefinite, we can prove that its determinant (= product of its eigenvalues) is nonnegative. We compute

Now, just plug in the values for , , :

Thus, . We conclude is negative semidefinite, proving the Gaussian hypercontractivity theorem.

# Note to Self: Norm of a Gaussian Random Vector

Let be a standard Gaussian vector—that is, a vector populated by independent standard normal random variables. What is the expected length of ? (Here, and throughout, denotes the Euclidean norm of a vector.) The length of is the square root of the sum of independent standard normal random variables

which is known as a random variable with degrees of freedom. (Not to be confused with a random variable!) Its mean value is given by the rather unpleasant formula

where is the gamma function. If you are familiar with the definition of the gamma function, the derivation of this formula is not too hard—it follows from a change of variables to -dimensional spherical coordinates.

This formula can be difficult to interpret and use. Fortunately, we have the rather nice bounds

(1)

This result appears, for example, page 11 of this paper. These bounds show that, for large , is quite close to . The authors of the paper remark that this inequality can be probed by induction. I had difficulty reproducing the inductive argument for myself. Fortunately, I found a different proof which I thought was very nice, so I thought I would share it here.

Our core tool will be Wendel’s inequality (see (7) in Wendel’s original paper): For and , we have

(2)

Let us first use Wendel’s inequality to prove (1). Indeed, invoke Wendel’s inequality with and and multiply by to obtain

which simplifies directly to (1).

Now, let’s prove Wendel’s inequality (2). The key property for us will be the strict log-convexity of the gamma function: For real numbers and ,

(3)

We take this property as established and use it to prove Wendel’s inequality. First, use the log-convexity property (3) with to obtain

Divide by and use the property that to conclude

(4)

This proves the upper bound in Wendel’s inequality (2). To prove the lower bound, invoke the upper bound (4) with in place of and in place of to obtain

Multiplying by , dividing by , and using again yields

finishing the proof of Wendel’s inequality.

Notes. The upper bound in (1) can be proven directly by Lyapunov’s inequality: , where we use the fact that is the sum of random variables with mean one. The weaker lower bound follows from a weaker version of Wendel’s inequality, Gautschi’s inequality.

After the initial publication of this post, Sam Buchanan mentioned another proof of the lower bound using the Gaussian Poincaré inequality. This inequality states that, for a function ,

To prove the lower bound, set which has gradient . Thus,

Rearrange to obtain .

# Note to Self: Hanson–Wright Inequality

This post is part of a new series for this blog, Note to Self, where I collect together some notes about an idea related to my research. This content may be much more technical than most of the content of this blog and of much less wide interest. My hope in sharing this is that someone will find this interesting and useful for their own work.

This post is about a fundamental tool of high-dimensional probability, the Hanson–Wright inequality. The Hanson–Wright inequality is a concentration inequality for quadratic forms of random vectors—that is, expressions of the form where is a random vector. Many statements of this inequality in the literature have an unspecified constant ; our goal in this post will be to derive a fairly general version of the inequality with only explicit constants.

The core object of the Hanson–Wright inequality is a subgaussian random variable. A random variable is subgaussian if the probability it exceeds a threshold in magnitude decays as

(1)

The name subgaussian is appropriate as the tail probabilities of Gaussian random variables exhibit the same square-exponential decrease .

A (non-obvious) fact is that if is subgaussian in the sense (1) and centered (), then ‘s cumulant generating function (cgf)

is subquadratic: There is a constant (independent of and ), for which

(2)

Moreover,1See Proposition 2.5.2 of Vershynin’s High-Dimensional Probability. a subquadratic cgf (2) also implies the subgaussian tail property (1), with a different parameter .

Since properties (1) and (2) are equivalent (up to a change in the parameter ), we are free to fix a version of property (2) as our definition for a (centered) subgaussian random variable.

Definition (subgaussian random variable): A centered random variable is said to be -subgaussian or subgaussian with variance proxy if its cgf is subquadratic:

(3)

For instance, a mean-zero Gaussian random variable with variance has cgf

(4)

and is thus subgaussian with variance proxy equal to its variance.

Here is a statement of the Hanson–Wright inequality as it typically appears with unspecified constants (see Theorem 6.2.1 of Vershynin’s High-Dimensional Probability):

Theorem (Hanson–Wright): Let be a random vector with independent centered -subgaussian entries and let be a square matrix. Then

where is a constant (not depending on , , , or ).2Here, and denote the Frobenius and spectral norms.

This type of concentration is exactly the same type as provided by Bernstein’s inequality (which I discussed in my post on concentration inequalities). In particular, for small deviations , the tail probabilities decay are subgaussian with variance proxy :

For large deviations , this switches to subexponential tail probabilities with decay rate :

Mediating these two parameter regimes are the size of the matrix , as measured by its Frobenius and spectral norms, and the degree of subgaussianity of , measured by the variance proxy .

## Diagonal-Free Hanson–Wright

Now we come to a first version of the Hanson–Wright inequality with explicit constants, first for a matrix which is diagonal-free—that is, having all zeros on the diagonal. I obtained this version of the inequality myself, though I am very sure that this version of the inequality or an improvement thereof appears somewhere in the literature.

Theorem (Hanson–Wright, explicit constants, diagonal-free): Let random vector with independent centered -subguassian entries and let be a diagonal-free square matrix. Then we have the cgf bound

As a consequence, we have the concentration bound

Let us begin proving this result. Our proof will follow the same steps as Vershynin’s proof in High-Dimensional Probability (which in turn is adapted from an article by Rudelson and Vershynin), but taking care to get explicit constants. Unfortunately, proving all of the relevant tools from first principles would easily triple the length of this post, so I make frequent use of results from the literature.

We begin by the decoupling bound (Theorem 6.1.1 in Vershynin’s High-Dimensional Probability), which allows us to replace one with an independent copy at the cost of a factor of four:

(5)

We seek to compare the bilinear form to the Gaussian bilinear form where and are independent standard Gaussian vectors. We begin with the following cgf bound for the Gaussian quadratic form :

This equation is the result of Example 2.12 in Boucheron, Lugosi, and Massart’s Concentration Inequalities. By applying this result to the Hermitian dilation of in ‘s place, one obtains a similar result for the decoupled bilinear form :

(6)

We now seek to compare to . To do this, we first evaluate the cgf of only over the randomness in . Since we’re only taking an expectation over the random variable , we can apply the subquadratic tail condition (3) to obtain

(7)

Now we perform a similar computation for the quantity in which has been replaced by the Gaussian vector :

We stress that this is an equality since the cgf of a Gaussian random variable is given by (4). Thus we can substitute the left-hand side of the above display into the right-hand side of (7), yielding

(8)

We now perform this same trick again using the randomness in :

(9)

Packaging up (8) and (9) gives

(10)

Combining all these results (5), (6), and (10), we obtain

This cgf implies the desired probability bound as a consequence of the following fact (see Boucheron, Lugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):

Fact (Bernstein concentration from Bernstein cgf bound): Suppose that a random variable satisfies the cgf bound for . Then

## General Hanson–Wright

Now, here’s a more general result (with worse constants) which permits the matrix to possess a diagonal.

Theorem (Hanson–Wright, explicit constants): Let random vector with independent centered -subguassian entries and let be an arbitrary square matrix. Then we have the cgf bound

As a consequence, we have the concentration bound

Decompose the matrix into its diagonal and off-diagonal portions. For any two random variables and (possibly highly dependent), we can bound the cgf of their sum using the following “union bound”:

(11)

The two equality statements are the definition of the cumulant generating function and the inequality is Cauchy–Schwarz.

Using the “union bound”, it is sufficient to obtain bounds for the cgfs of the diagonal and off-diagonal parts and . We begin with the diagonal part. We compute

(12)

For the cgf of , we use the following bound, taken from Appendix B of the following paper:

Substituting this result into (12) gives

(13)

For the second inequality, we used the facts that and .

We now look at the off-diagonal part . We use a version of the decoupling bound (5) where we compare to , where we’ve both replaced one copy of with an independent copy and reinstated the diagonal of (see Remark 6.1.3 in Vershynin’s High-Dimensional Probability):

We can now just repeat the rest of the argument for the diagonal-free Hanson–Wright inequality, yielding the same conclusion

(14)

Combining (11), (13), and (14), we obtain

As with above, this cgf bound implies the desired probability bound.