Note to Self: Gaussian Hypercontractivity

The purpose of this note is to describe the Gaussian hypercontractivity inequality. As an application, we’ll obtain a weaker version of the Hanson–Wright inequality.

The Noise Operator

We begin our discussion with the following question:

Let $f : \real^n \to \real$ be a function. What happens to $f$ , on average, if we perturb its inputs by a small amount of Gaussian noise?

Let’s be more specific about our noise model. Let $x \in \real^n$ be an input to the function $f$ and fix a parameter $0 \le \varrho \le 1$ (think of $\varrho$ as close to 1). We’ll define the noise corruption of $x$ to be

(1) $\tilde{x}_\varrho = \varrho \cdot x + \sqrt{1-\varrho^2} \cdot g, \quad \text{where } g\sim \operatorname{Normal}(0,I).$

Here, $\operatorname{Normal}(0,I)$ is the standard multivariate Gaussian distribution. In our definition of $\tilde{x}_\varrho$ , we both add Gaussian noise $\sqrt{1-\varrho^2} \cdot g$ and shrink the vector $x$ by a factor $\varrho$ . In particular, we highlight two extreme cases:

No noise. If $\varrho = 1$ , then there is no noise and $\tilde{x}_\varrho = x$ .
All noise. If $\varrho = 0$ , then there is all noise and $\tilde{x}_\varrho = g$ . The influence of the original vector $x$ has been washed away completely.

The noise corruption (1) immediately gives rise to the noise operator¹ $T_\varrho$ . Let $f : \real^n \to \real$ be a function. The noise operator $T_\varrho$ is defined to be:

(2) $(T_\varrho f)(x) = \expect[f(\tilde{x}_\varrho)] = \expect_{g\sim \operatorname{Normal}(0,I)}[f( \varrho \cdot x + \sqrt{1-\varrho^2}\cdot g)].$

The noise operator computes the average value of $f$ when evaluated at the noisy input $\tilde{x}_\varrho$ . Observe that the noise operator maps a function $f : \real^n \to \real$ to another function $T_\varrho f : \real^n \to \real$ . Going forward, we will write $T_\varrho f(x)$ to denote $(T_\varrho f)(x)$ .

To understand how the noise operator acts on a function $f$ , we can write the expectation in the definition (2) as an integral:

$T_\varrho f(x) = \int_{\real^d} f(\varrho x + y) \frac{1}{(2\pi (1-\varrho^2))^{d/2}}\e^{-\frac{|y|^2}{2(1-\varrho^2)}} \, \mathrm{d} y.$

Here, $|y|$ denotes the (Euclidean) length of $y\in\real^d$ . We see that $T_\varrho f$ is the convolution of $f(\varrho x)$ with a Gaussian density. Thus, $T_\varrho$ acts to smooth the function $f$ .

See below for an illustration. The red solid curve is a function $f$ , and the blue dashed curve is $T_{0.95}f$ .

As we decrease $\varrho$ from $1$ to $0$ , the function $T_\varrho f$ is smoothed more and more. When we finally reach $\varrho = 0$ , $T_\varrho f$ has been smoothed all the way into a constant.

Random Inputs

The noise operator converts a function $f$ to another function $T_\varrho f$ . We can evaluate these two functions at a Gaussian random vector $x\sim \operatorname{Normal}(0,I)$ , resulting in two random variables $f(x)$ and $T_\varrho f(x)$ .

We can think of $T_\varrho f(x)$ as a modification of the random variable $f(x)$ where “a $1-\varrho^2$ fraction of the variance of $x$ has been averaged out”. We again highlight the two extreme cases:

No noise. If $\varrho = 1$ , $T_\varrho f(x) = f(x)$ . None of the variance of $x$ has been averaged out.
All noise. If $\varrho=0$ , $T_\varrho f(x) = \expect_{g\sim\operatorname{Normal}(0,I)}[f(g)]$ is a constant random variable. All of the variance of $x$ has been averaged out.

Just as decreasing $\varrho$ smoothes the function $T_\varrho f$ until it reaches a constant function at $\varrho = 0$ , decreasing $\varrho$ makes the random variable $T_\varrho f(x)$ more and more “well-behaved” until it becomes a constant random variable at $\varrho = 0$ . This “well-behavingness” property of the noise operator is made precise by the Gaussian hypercontractivity theorem.

Moments and Tails

In order to describe the “well-behavingness” properties of the noise operator, we must answer the question:

How can we measure how well-behaved a random variable is?

There are many answers to this question. For this post, we will quantify the well-behavedness of a random variable by using the $L_p$ norm.²

The $L_p$ norm of a ( $\real$ -valued) random variable $y$ is defined to be

(3) $\norm{y}_p \coloneqq \left( \expect[|y|^p] \right)^{1/p}.$

The $p$ th power of the $L_p$ norm $\norm{y}_p^p$ is sometimes known as the $p$ th absolute moment of $y$ .

The $L_p$ norms of random variables control the tails of a random variable—that is, the probability that a random variable is large in magnitude. A random variables with small tails is typically thought of as a “nice” or “well-behaved” random variable. Random quantities with small tails are usually desirable in applications, as they are more predictable—unlikely to take large values.

The connection between tails and $L_p$ norms can be derived as follows. First, write the tail probability $\prob \{|y| \ge t\}$ for $t > 0$ using $p$ th powers:

$\prob \{|y| \ge t\} = \prob\{ |y|^p \ge t^p \}.$

Then, we apply Markov’s inequality, obtaining

(4) $\prob \{|y| \ge t\} = \prob \{ |y|^p \ge t^p \} \le \frac{\expect [|y|^p]}{t^p} = \frac{\norm{y}_p^p}{t^p}.$

We conclude that a random variable with finite $L_p$ norm (i.e., $\norm{y}_p < +\infty$ ) has tails that decay at at a rate $1/t^p$ or faster.

Gaussian Contractivity

Before we introduce the Gaussian hypercontractivity theorem, let’s establish a weaker property of the noise operator, contractivity.

Proposition 1 (Gaussian contractivity). Choose a noise level $0 < \varrho \le 1$ and a power $p\ge 1$ , and let $x\sim \operatorname{Normal}(0,I)$ be a Gaussian random vector. Then $T_\varrho$ contracts the $L_p$ norm of $f(x)$ :
$\norm{T_\varrho f(x)}_p \le \norm{f(x)}_p.$

This result shows that the noise operator makes the random variable $T_\varrho f(x)$ no less nice than $f(x)$ was.

Gaussian contractivity is easy to prove. Begin using the definition of the noise operator (2) and $L_p$ norm (3):

$\norm{T_\varrho f(x)}_p^p = \expect_{x\sim \operatorname{Normal}(0,I)} \left[ \left|\expect_{g\sim \operatorname{Normal}(0,I)}[f(\varrho x + \sqrt{1-\varrho^2}\cdot g)]\right|^p\right]$

Now, we can apply Jensen’s inequality to the convex function $t\mapsto t^p$ , obtaining

$\norm{T_\varrho f(x)}_p^p \le \expect_{x,g\sim \operatorname{Normal}(0,I)} \left[ \left|f(\varrho x + \sqrt{1-\varrho^2}\cdot g)\right|^p\right].$

Finally, realize that for the independent normal random vectors $x,g\sim \operatorname{Normal}(0,I)$ , we have

$\varrho x + \sqrt{1-\varrho^2}\cdot g \sim \operatorname{Normal}(0,I).$

Thus, $\varrho x + \sqrt{1-\varrho^2}\cdot g$ has the same distribution as $x$ . Thus, using $x$ in place of $\varrho x + \sqrt{1-\varrho^2}\cdot g$ , we obtain

$\norm{T_\varrho f(x)}_p^p \le \expect_{x\sim \operatorname{Normal}(0,I)} \left[ \left|f(x)\right|^p\right] = \norm{f(x)}_p^p.$

Gaussian contractivity (Proposition 1) is proven.

Gaussian Hypercontractivity

The Gaussian contractivity theorem shows that $T_\varrho f(x)$ is no less well-behaved than $f(x)$ is. In fact, $T_\varrho f(x)$ is more well-behaved than $f$ is. This is the content of the Gaussian hypercontractivity theorem:

Theorem 2 (Gaussian hypercontractivity): Choose a noise level $0 < \varrho \le 1$ and a power $p\ge 1$ , and let $x\sim \operatorname{Normal}(0,I)$ be a Gaussian random vector. Then
$\norm{T_\varrho f(x)}_{1+(p-1)/\varrho^2} \le \norm{f(x)}_p.$
In particular, for $p=2$ ,
$\norm{T_\varrho f(x)}_{1+\varrho^{-2}} \le \norm{f(x)}_2.$

We have highlighted the $p=2$ case because it is the most useful in practice.

This result shows that as we take $\varrho$ smaller, the random variable $T_\varrho f(x)$ becomes more and more well-behaved, with tails decreasing at a rate

$\prob \{ |T_\varrho f(x)| \ge t \} \le \frac{\norm{T_\varrho f(x)}_{1+(p-1)/\varrho^2}^{1+(p-1)/\varrho^2}}{t^{1 + (p-1)/\varrho^2}} \le \frac{\norm{f(x)}_p^{1+(p-1)/\varrho^2}}{t^{1 + (p-1)/\varrho^2}}.$

The rate of tail decrease becomes faster and faster as $\varrho$ becomes closer to zero.

We will prove the Gaussian hypercontractivity at the bottom of this post. For now, we will focus on applying this result.

Multilinear Polynomials

A multilinear polynomial $f(x) = f(x_1,\ldots,x_d)$ is a multivariate polynomial in the variables $x_1,\ldots,x_d$ in which none of the variables $x_1,\ldots,x_d$ is raised to a power higher than one. So,

(5) $1+x_1x_2$

is multilinear, but

$1+x_1+x_1x_2^2$

is not multilinear (since $x_2$ is squared).

For multilinear polynomials, we have the following very powerful corollary of Gaussian hypercontractivity:

Corollary 3 (Absolute moments of a multilinear polynomial of Gaussians). Let $f$ be a multilinear polynomial of degree $k$ . (That is, at most $k$ variables $x_{i_1}\cdots x_{i_k}$ occur in any monomial of $f$ .) Then, for a Gaussian random vector $x\sim \operatorname{Normal}(0,I)$ and for all $q \ge 2$ ,
$\norm{f(x)}_q \le (q-1)^{k/2} \norm{f(x)}_2.$

Let’s prove this corollary. The first observation is that the noise operator has a particularly convenient form when applied to a multilinear polynomial. Let’s test it out on our example (5) from above. For

$f(x) = 1+x_1x_2,$

we have

$\begin{align*}T_\varrho f(x) &= \expect_{g_1,g_2 \sim \operatorname{Normal}(0,1)} \left[1+ (\varrho x_1 + \sqrt{1-\varrho^2}\cdot g_1)(\varrho x_2 + \sqrt{1-\varrho^2}\cdot g_2)\right].\\&= 1 + \expect[\varrho x_1 + \sqrt{1-\varrho^2}\cdot g_1]\expect[\varrho x_2 + \sqrt{1-\varrho^2}\cdot g_2]\\&= 1+ (\varrho x_1)(\varrho x_2) \\&= f(\varrho x).\end{align*}$

We see that the expectation applies to each variable separately, resulting in each $x_i$ replaced by $\varrho x_i$ . This trend holds in general:

Proposition 4 (noise operator on multilinear polynomials). For any multilinear polynomial $f$ , $T_\varrho f(x) = f(\varrho x)$ .

We can use Proposition 4 to obtain bounds on the $L_p$ norms of multilinear polynomials of a Gaussian random variable. Indeed, observe that

$f(x) = f(\varrho \cdot x/\varrho) = T_\varrho f(x/\varrho).$

Thus, by Gaussian hypercontractivity, we have

$\norm{f(x)}_{1+\varrho^{-2}}=\norm{T_\varrho f(x/\varrho)}_{1+\varrho^{-2}} \le \norm{f(x/\varrho)}_2.$

The final step of our argument will be to compute $\norm{f(x/\varrho)}_2$ . Write $f$ as

$f(x) = \sum_{i_1,\ldots,i_s} a_{i_1,\ldots,i_s} x_{i_1} \cdots x_{i_s}.$

Since $f$ is multilinear, $i_j\ne i_\ell$ for $j\ne \ell$ . Since $f$ is degree- $k$ , $s\le k$ . The multilinear monomials $x_{i_1}\cdots x_{i_s}$ are orthonormal with respect to the $L_2$ inner product:

$\expect[(x_{i_1}\cdots x_{i_s}) \cdot (x_{i_1'}\cdots x_{i_s'})] = \begin{cases} 0 &\text{if } \{i_1,\ldots,i_s\} \ne \{i_1',\ldots,i_{s'}\}, \\1, & \text{otherwise}.\end{cases}$

(See if you can see why!) Thus, by the Pythagorean theorem, we have

$\norm{f(x)}_2^2 = \sum_{i_1,\ldots,i_s} a_{i_1,\ldots,i_s}^2.$

Similarly, the coefficients of $f(x/\varrho)$ are $a_{i_1,\ldots,i_s} / \varrho$ . Thus,

$\norm{f(x/\varrho)}_2^2 = \sum_{i_1,\ldots,i_s} \varrho^{-2s} a_{i_1,\ldots,i_s}^2 \le \varrho^{-2k} \sum_{i_1,\ldots,i_s} a_{i_1,\ldots,i_s}^2 = \varrho^{-2k}\norm{f(x)}_2^2.$

Thus, putting all of the ingredients together, we have

$\norm{f(x)}_{1+\varrho^{-2}}=\norm{T_\varrho f(x/\varrho)}_p \le \norm{f(x/\varrho)}_2 \le \varrho^{-k} \norm{f(x)}_2.$

Setting $q = 1+\varrho^{-2}$ (equivalently $\varrho = 1/\sqrt{q-1}$ ), Corollary 3 follows.

Hanson–Wright Inequality

To see the power of the machinery we have developed, let’s prove a version of the Hanson–Wright inequality.

Theorem 5 (suboptimal Hanson–Wright). Let $A$ be a symmetric matrix with zero on its diagonal and $x\sim \operatorname{Normal}(0,I)$ be a Gaussian random vector. Then

$\prob \{|x^\top A x| \ge t \} \le \exp\left(- \frac{t}{\sqrt{2}\mathrm{e}\norm{A}_{\rm F}} \right) \quad \text{for } t\ge \sqrt{2}\mathrm{e}\norm{A}_{\rm F}.$

Hanson–Wright has all sorts of applications in computational mathematics and data science. One direct application is to obtain probabilistic error bounds for the error incurred by a stochastic trace estimation formulas.

This version of Hanson–Wright is not perfect. In particular, it does not capture the Bernstein-type tail behavior of the classical Hanson–Wright inequality

$\prob\{|x^\top Ax| \ge t\} \le 2\exp \left( -\frac{t^2}{4\norm{A}_{\rm F}^2+4\norm{A}t} \right).$

But our suboptimal Hanson–Wright inequality is still pretty good, and it requires essentially no work to prove using the hypercontractivity machinery. The hypercontractivity technique also generalizes to settings where some of the proofs of Hanson–Wright fail, such as multilinear polynomials of degree higher than two.

Let’s prove our suboptimal Hanson–Wright inequality. Set $f(x) = x^\top Ax$ . Since $A$ has zero on its diagonal, $f$ is a multilinear polynomial of degree two in the entries of $x$ . The random variable $f(x)$ is mean-zero, and a short calculation shows its $L_2$ norm is

$\norm{f(x)}_2 = \sqrt{\Var(f(x))} = \sqrt{2} \norm{A}_{\rm F}.$

Thus, by Corollary 3,

(6) $\norm{f(x)}_q \le (q-1) \norm{f(x)}_2 \le \sqrt{2} q \norm{A}_{\rm F} \quad \text{for every } q\ge 2.$

In fact, since the $L_q$ norms are monotone, (6) holds for $1\le q\le 2$ as well. Therefore, the standard tail bound for $L_q$ norms (4) gives

(7) $\prob \{|x^\top A x| \ge t \} \le \frac{\norm{f(x)}_q^q}{t^q} \le \left( \frac{\sqrt{2}q\norm{A}_{\rm F}}{t} \right)^q\quad \text{for }q\ge 1.$

Now, we must optimize the value of $q$ to obtain the sharpest possible bound. To make this optimization more convenient, introduce a parameter

$\alpha \coloneqq \frac{\sqrt{2}q\norm{A}_{\rm F}}{t}.$

In terms of the $\alpha$ parameter, the bound (7) reads

$\prob \{|x^\top A x| \ge t \} \le \exp\left(- \frac{t}{\sqrt{2}\norm{A}_{\rm F}} \alpha \ln \frac{1}{\alpha} \right) \quad \text{for } t\ge \frac{\sqrt{2}\norm{A}_{\rm F}}{\alpha}.$

The tail bound is minimized by taking $\alpha = 1/\mathrm{e}$ , yielding the claimed result

$\prob \{|x^\top A x| \ge t \} \le \exp\left(- \frac{t}{\sqrt{2}\mathrm{e}\norm{A}_{\rm F}} \right) \quad \text{for } t\ge \sqrt{2}\mathrm{e}\norm{A}_{\rm F}.$

Proof of Gaussian Hypercontractivity

Let’s prove the Gaussian hypercontractivity theorem. For simplicity, we will stick with the $d = 1$ case, but the higher-dimensional generalizations follow along similar lines. The key ingredient will be the Gaussian Jensen inequality, which made a prominent appearance in a previous blog post of mine. Here, we will only need the following version:

Theorem 6 (Gaussian Jensen). Let $b : \real^2 \to \real$ be a twice differentiable function and let $(x,\tilde{x}) \sim \operatorname{Normal}(0,\Sigma)$ be jointly Gaussian random variables with covariance matrix $\Sigma$ . Then
(8) $b(\expect[h_1(x)], \expect[h_2(\tilde{x})]) \ge \expect [b(h_1(x),h_2(\tilde{x}))]$
holds for all test functions $h_1,h_2 : \real \to \real$ if, and only if,
(9) $\Sigma \circ \nabla^2 b \quad\text{is negative semidefinite on all of $\real^2$}.$

Here, $\circ$ denotes the entrywise product of matrices and $\nabla^2 b : \real^2\to \real^{2\times 2}$ is the Hessian matrix of the function $b$ .

To me, this proof of Gaussian hypercontractivity using Gaussian Jensen (adapted from Paata Ivanishvili‘s excellent post) is amazing. First, we reformulate the Gaussian hypercontractivity property a couple of times using some functional analysis tricks. Then we do a short calculation, invoke Gaussian Jensen, and the theorem is proved, almost as if by magic.

Part 1: Tricks

Let’s begin with “tricks” part of the argument.

Trick 1. To prove Gaussian hypercontractivity holds for all functions $f$ , it is sufficient to prove for all nonnegative functions $f\ge 0$ .

Indeed, suppose Gaussian hypercontractivity holds for all nonnegative functions $f$ . Then, for any function $f$ , apply Jensen’s inequality to conclude

$\begin{align*} T_\varrho |f|(x) &= \expect_{g\sim \operatorname{Normal}(0,1)} \left| f(\varrho x+\sqrt{1-\varrho^2}\cdot g)\right| \\&\ge \left| \expect_{g\sim \operatorname{Normal}(0,1)} f(\varrho x+\sqrt{1-\varrho^2}\cdot g)\right| \\&= |T_\varrho f(x)|.\end{align*}$

Thus, assuming hypercontractivity holds for the nonnegative function $|f|$ , we have

$\norm{T_\varrho f(x)}_{1+(p-1)/\varrho^2} \le \norm{T_\varrho |f|(x)}_{1+(p-1)/\varrho^2} \le \norm{|f|(x)}_p = \norm{f}_p.$

Thus, the conclusion of the hypercontractivity theorem holds for $f$ as well, and the Trick 1 is proven.

Trick 2. To prove Gaussian hypercontractivity for all $f\ge 0$ , it is sufficient to prove the following “bilinearized” Gaussian hypercontractivity result:
$\expect[g(x) \cdot T_\varrho f(x)]\le \norm{g(x)}_{q'} \norm{f(x)}_p$
holds for all $g\ge 0$ with $\norm{g(x)}_{q'} < +\infty$ . Here, $q'=q/(q-1)$ is the Hölder conjugate to $q = 1+(p-1)/\varrho^2$ .

Indeed, this follows³ from the dual characterization of the norm of $T_\varrho f(x)$ :

$\norm{T_\varrho f(x)}_q = \sup_{\substack{\norm{g(x)} < +\infty \\ g\ge 0}} \frac{\expect[g(x) \cdot T_\varrho f(x)]}{\norm{g(x)}_{q'}}.$

Trick 2 is proven.

Trick 3. Let $x,\tilde{x}$ be a pair of standard Gaussian random variables with correlation $\rho$ . Then the bilinearized Gaussian hypercontractivity statement is equivalent to
$\expect[g(x) f(\tilde{x})]\le (\expect[(g(x)^{q'})])^{1/q'} (\expect[(f(\tilde{x})^{p})])^{1/p}.$

Indeed, define $\tilde{x} = \varrho x + \sqrt{1-\varrho^2} \cdot g$ for the random variable in the definition of the noise operator $T_\varrho$ . The random variable $\tilde{x}$ is standard Gaussian and has correlation $\varrho$ with $f$ , concluding the proof of Trick 3.

Finally, we apply a change of variables as our last trick:

Trick 4. Make the change of variables $u \coloneqq f^p$ and $v \coloneqq g^{q'}$ , yielding the final equivalent version of Gaussian hypercontractivity:
$\expect[v(x)^{1/q'} u(\tilde{x})^{1/p}]\le (\expect[v(x)])^{1/q'} (\expect[u(\tilde{x}))])^{1/p}$
for all functions $u$ and $v$ (in the appropriate spaces).

Part 2: Calculation

We recognize this fourth equivalent version of Gaussian hypercontractivity as the conclusion (8) to Gaussian Jensen with

$b(u,v) = u^{1/p}v^{1/q'}$

. Thus, to prove Gaussian hypercontractivity, we just need to check the hypothesis (9) of the Gaussian Jensen inequality (Theorem 6).

We now enter the calculation part of the proof. First, we compute the Hessian of $b$ :

$\nabla^2 b(u,v) = u^{1/p}v^{1/q'}\cdot\begin{bmatrix} - \frac{1}{pp'} u^{-2} & \frac{1}{pq'} u^{-1}v^{-1} \\ \frac{1}{pq'} u^{-1}v^{-1} & - \frac{1}{qq'} v^{-2}\end{bmatrix}.$

We have written $p'$ for the Hölder conjugate to $p$ . By Gaussian Jensen, to prove Gaussian hypercontractivity, it suffices to show that

$\nabla^2 b(u,v)\circ \twobytwo{1}{\varrho}{\varrho}{1}= u^{1/p}v^{1/q'}\cdot\begin{bmatrix} - \frac{1}{pp'} u^{-2} & \frac{\varrho}{pq'} u^{-1}v^{-1} \\ \frac{\varrho}{pq'} u^{-1}v^{-1} & - \frac{1}{qq'} v^{-2}\end{bmatrix}$

is negative semidefinite for all $u,v\ge 0$ . There are a few ways we can make our lives easier. Write this matrix as

$\nabla^2 b(u,v)\circ \twobytwo{1}{\varrho}{\varrho}{1}= u^{1/p}v^{1/q'}\cdot B^\top\begin{bmatrix} - \frac{p}{p'} & \varrho \\ \varrho & - \frac{q'}{q} \end{bmatrix}B \quad \text{for } B = \operatorname{diag}(p^{-1}u^{-1},(q')^{-1}v^{-1}).$

Scaling $A\mapsto \alpha \cdot A$ by nonnegative $\alpha$ and conjugation $A\mapsto B^\top A B$ both preserve negative semidefiniteness, so it is sufficient to prove

$H = \begin{bmatrix} - \frac{p}{p'} & \varrho \\ \varrho & - \frac{q'}{q} \end{bmatrix} \quad \text{is negative semidefinite}.$

Since the diagonal entries of $H$ are negative, at least one of $H$ ‘s eigenvalues is negative.⁴ Therefore, to prove $H$ is negative semidefinite, we can prove that its determinant (= product of its eigenvalues) is nonnegative. We compute

$\det H = \frac{pq'}{p'q} - \varrho^2 .$