Note to Self: Norm of a Gaussian Random Vector

Let g be a standard Gaussian vector—that is, a vector populated by independent standard normal random variables. What is the expected length \mathbb{E} \|g\| of g? (Here, and throughout, \|\cdot\| denotes the Euclidean norm of a vector.) The length of g is the square root of the sum of n independent standard normal random variables

    \[\|g\| = \sqrt{g_1^2 + \cdots + g_n^2},\]

which is known as a \chi random variable with n degrees of freedom. (Not to be confused with a \chi^\mathbf{2} random variable!) Its mean value is given by the rather unpleasant formula

    \[\mathbb{E} \|g\| = \sqrt{2} \frac{\Gamma((n+1)/2)}{\Gamma(n/2)},\]

where \Gamma(\cdot) is the gamma function. If you are familiar with the definition of the gamma function, the derivation of this formula is not too hard—it follows from a change of variables to n-dimensional spherical coordinates.

This formula can be difficult to interpret and use. Fortunately, we have the rather nice bounds

(1)   \[\sqrt{n-1} < \frac{n}{\sqrt{n+1}} < \mathbb{E} \|g\| < \sqrt{n}. \]

This result appears, for example, page 11 of this paper. These bounds show that, for large n, \mathbb{E} \|g\| is quite close to \sqrt{n}. The authors of the paper remark that this inequality can be probed by induction. I had difficulty reproducing the inductive argument for myself. Fortunately, I found a different proof which I thought was very nice, so I thought I would share it here.

Our core tool will be Wendel’s inequality (see (7) in Wendel’s original paper): For x > 0 and 0 < s < 1, we have

(2)   \[\frac{x}{(x+s)^{1-s}} < \frac{\Gamma(x+s)}{\Gamma(x)} < x^s. \]

Let us first use Wendel’s inequality to prove (1). Indeed, invoke Wendel’s inequality with x = n/2 and s = 1/2 and multiply by \sqrt{2} to obtain

    \[\frac{\sqrt{2} \cdot n/2}{(n/2+1/2)^{1/2}} < \sqrt{2}\frac{\Gamma((n+1)/2)}{\Gamma(n/2)} = \mathbb{E}\|g\| < \sqrt{2}\cdot \sqrt{n/2},\]

which simplifies directly to (1).

Now, let’s prove Wendel’s inequality (2). The key property for us will be the strict log-convexity of the gamma function: For real numbers x,y > 0 and 0 < s < 1,

(3)   \[\Gamma((1-s)x + sy) < \Gamma(x)^{1-s} \Gamma(y)^s. \]

We take this property as established and use it to prove Wendel’s inequality. First, use the log-convexity property (3) with y = x+1 to obtain

    \[\Gamma(x+s) = \Gamma((1-s)x + s(x+1)) < \Gamma(x)^{1-s} \Gamma(x+1)^s.\]

Divide by \Gamma(x) and use the property that \Gamma(x+1)/\Gamma(x) = x to conclude

(4)   \[\frac{\Gamma(x+s)}{\Gamma(x)} < \left( \frac{\Gamma(x+1)}{\Gamma(x)} \right)^s = x^s. \]

This proves the upper bound in Wendel’s inequality (2). To prove the lower bound, invoke the upper bound (4) with x+s in place of x and 1-s in place of s to obtain

    \[\frac{\Gamma(x+1)}{\Gamma(x+s)} < (x+s)^{1-s}.\]

Multiplying by \Gamma(x+s), dividing by (x+s)^{1-s}\Gamma(x), and using \Gamma(x+1)/\Gamma(x) = x again yields

    \[\frac{\Gamma(x+s)}{\Gamma(x)} > \frac{\Gamma(x+1)}{\Gamma(x)} \cdot \frac{1}{(x+s)^{1-s}} = \frac{x}{(x+s)^{1-s}},\]

finishing the proof of Wendel’s inequality.

Notes. The upper bound in (1) can be proven directly by Lyapunov’s inequality: \mathbb{E} \|g\| \le (\mathbb{E} \|g\|^2)^{1/2} = n^{1/2}, where we use the fact that \|g\|^2 = g_1^2 + \cdots + g_n^2 is the sum of n random variables with mean one. The weaker lower bound \mathbb{E} \|g\| \ge \sqrt{n-1} follows from a weaker version of Wendel’s inequality, Gautschi’s inequality.

After the initial publication of this post, Sam Buchanan mentioned another proof of the lower bound \mathbb{E} \|g\| \ge \sqrt{n-1} using the Gaussian Poincaré inequality. This inequality states that, for a function f : \real^n \to \real,

    \[\Var(f(g)) \le \mathbb{E} \| \nabla f(g)\|^2.\]

To prove the lower bound, set f(g) := \|g\| which has gradient \nabla f(g) = g/\|g\|. Thus,

    \[\mathbb{E} \| \nabla f(g)\|^2 = 1 \ge \Var(f(g)) = \mathbb{E} \|g\|^2 - (\mathbb{E} \|g\|)^2 = n -  (\mathbb{E} \|g\|)^2.\]

Rearrange to obtain \mathbb{E} \|g\| \ge \sqrt{n-1}.

Note to Self: Hanson–Wright Inequality

This post is part of a new series for this blog, Note to Self, where I collect together some notes about an idea related to my research. This content may be much more technical than most of the content of this blog and of much less wide interest. My hope in sharing this is that someone will find this interesting and useful for their own work.

This post is about a fundamental tool of high-dimensional probability, the Hanson–Wright inequality. The Hanson–Wright inequality is a concentration inequality for quadratic forms of random vectors—that is, expressions of the form x^\top A x where x is a random vector. Many statements of this inequality in the literature have an unspecified constant c > 0; our goal in this post will be to derive a fairly general version of the inequality with only explicit constants.

The core object of the Hanson–Wright inequality is a subgaussian random variable. A random variable Y is subgaussian if the probability it exceeds a threshold t in magnitude decays as

(1)   \[\mathbb{P}\{|Y|\ge t\} \le \mathrm{e}^{-t^2/a} \quad \text{for some $a>0$ and for all sufficiently large $t$.} \]

The name subgaussian is appropriate as the tail probabilities of Gaussian random variables exhibit the same square-exponential decrease \mathrm{e}^{-t^2/a}.

A (non-obvious) fact is that if Y is subgaussian in the sense (1) and centered (\mathbb{E} Y = 0), then Y‘s cumulant generating function (cgf)

    \[\xi_Y(t) := \log \mathbb{E} \exp(tY).\]

is subquadratic: There is a constant c > 0 (independent of Y and a), for which

(2)   \[\xi_Y(t) \le ca t^2 \quad \text{for all $t\in\mathbb{R}$}. \]

Moreover,1See Proposition 2.5.2 of Vershynin’s High-Dimensional Probability. a subquadratic cgf (2) also implies the subgaussian tail property (1), with a different parameter a > 0.

Since properties (1) and (2) are equivalent (up to a change in the parameter a), we are free to fix a version of property (2) as our definition for a (centered) subgaussian random variable.

Definition (subgaussian random variable): A centered random variable X is said to be v-subgaussian or subgaussian with variance proxy v if its cgf is subquadratic:

(3)   \[\xi_{x}(t) \le\frac{1}{2} vt^2 \quad \text{for all $t\in\mathbb{R}$.} \]

For instance, a mean-zero Gaussian random variable X with variance \sigma^2 has cgf

(4)   \[ \xi_X(t) = \frac{1}{2} \sigma^2 t^2,  \]

and is thus subgaussian with variance proxy v = \sigma^2 equal to its variance.

Here is a statement of the Hanson–Wright inequality as it typically appears with unspecified constants (see Theorem 6.2.1 of Vershynin’s High-Dimensional Probability):

Theorem (Hanson–Wright): Let x be a random vector with independent centered v-subgaussian entries and let A be a square matrix. Then

    \[\mathbb{P}\left\{\left|x^\top Ax - \mathbb{E} \left[x^\top A x\right]\right|\ge t \right\} \le 2\exp\left(- \frac{c\cdot t^2}{v^2\left\|A\right\|_{\rm F}^2 + v\left\|A\right\|t} \right),\]

where c>0 is a constant (not depending on v, x, t, or A).2Here, \|\cdot\|_{\rm F} and \|\cdot\| denote the Frobenius and spectral norms.

This type of concentration is exactly the same type as provided by Bernstein’s inequality (which I discussed in my post on concentration inequalities). In particular, for small deviations t, the tail probabilities decay are subgaussian with variance proxy \approx v^2\left\|A\right\|_{\rm F}^2:

    \[\mathbb{P}\left\{\left|x^\top Ax - \mathbb{E}\left[x^\top Ax\right]\right|\ge t \right\} \stackrel{\text{small $t$}}{\lessapprox} 2\exp\left(- \frac{c\cdot t^2}{v^2\left\|A\right\|_{\rm F}^2} \right)\]

For large deviations t, this switches to subexponential tail probabilities with decay rate \approx v\|A\|:

    \[\mathbb{P}\left\{\left|x^\top Ax - \mathbb{E}\left[x^\top Ax\right]\right|\ge t \right\} \stackrel{\text{large $t$}}{\lessapprox} 2\exp\left(- \frac{c\cdot t}{v\|A\|} \right).\]

Mediating these two parameter regimes are the size of the matrix A, as measured by its Frobenius and spectral norms, and the degree of subgaussianity of x, measured by the variance proxy v.

Diagonal-Free Hanson–Wright

Now we come to a first version of the Hanson–Wright inequality with explicit constants, first for a matrix which is diagonal-free—that is, having all zeros on the diagonal. I obtained this version of the inequality myself, though I am very sure that this version of the inequality or an improvement thereof appears somewhere in the literature.

Theorem (Hanson–Wright, explicit constants, diagonal-free): Let x random vector with independent centered v-subguassian entries and let A be a diagonal-free square matrix. Then we have the cgf bound

    \[\xi_{x^\top Ax}(t) \le \frac{16v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-4v\left\|A\right\|t)}.\]

As a consequence, we have the concentration bound

    \[\mathbb{P} \{ x^\top A x \ge t \} \le \exp\left( -\frac{t^2/2}{16v^2 \left\|A\right\|_{\rm F}^2+4v\left\|A\right\|t} \right).\]

Let us begin proving this result. Our proof will follow the same steps as Vershynin’s proof in High-Dimensional Probability (which in turn is adapted from an article by Rudelson and Vershynin), but taking care to get explicit constants. Unfortunately, proving all of the relevant tools from first principles would easily triple the length of this post, so I make frequent use of results from the literature.

We begin by the decoupling bound (Theorem 6.1.1 in Vershynin’s High-Dimensional Probability), which allows us to replace one x with an independent copy \tilde{x} at the cost of a factor of four:

(5)   \[\xi_{x^\top Ax}(t) \le \xi_{\tilde{x}^\top Ax}(4t). \]

We seek to compare the bilinear form \tilde{x}^\top Ax to the Gaussian bilinear form \tilde{g}^\top Ag where \tilde{g} and g are independent standard Gaussian vectors. We begin with the following cgf bound for the Gaussian quadratic form g^\top Ag:

    \[\xi_{g^\top Ag}(t) \le \frac{\left\|A\right\|_{\rm F}^2 \, t^2}{1-2\|A\|\, t}.\]

This equation is the result of Example 2.12 in Boucheron, Lugosi, and Massart’s Concentration Inequalities. By applying this result to the Hermitian dilation of A in A‘s place, one obtains a similar result for the decoupled bilinear form \tilde{g}^\top Ag:

(6)   \[\xi_{\tilde{g}^\top Ag}(t) \le \frac{\left\|A\right\|_{\rm F}^2 \, t^2}{2(1-\|A\|\, t)}. \]

We now seek to compare \xi_{\tilde{x}^\top Ax}(t) to \xi_{\tilde{g}^\top Ag}(t). To do this, we first evaluate the cgf of \tilde{x}^\top Ax only over the randomness in \tilde{x}. Since we’re only taking an expectation over the random variable \tilde{x}, we can apply the subquadratic tail condition (3) to obtain

(7)   \[\log \mathbb{E}_{\tilde{x}} \exp(t \, \tilde{x}^\top Ax) = \sum_{i=1}^n \log \mathbb{E}_{\tilde{x}} \exp(t \,\tilde{x}_i (Ax)_i) \le  \frac{1}{2} v \left(\sum_{i=1}^n (Ax)_i^2\right)t^2 \le \frac{1}{2} v\left\|Ax\right\|^2 \, t^2. \]

Now we perform a similar computation for the quantity \tilde{g}^\top Ax in which \tilde{x} has been replaced by the Gaussian vector \tilde{g}:

    \[\log \mathbb{E}_{\tilde{g}} \exp((\sqrt{v} t) \, \tilde{g}^\top Ax) = \frac{1}{2} v \left\|Ax\right\|^2 \, t^2.\]

We stress that this is an equality since the cgf of a Gaussian random variable is given by (4). Thus we can substitute the left-hand side of the above display into the right-hand side of (7), yielding

(8)   \[\log \mathbb{E}_{\tilde{x}} \exp(t \, \tilde{x}^\top Ax) \le \log \mathbb{E}_{\tilde{g}} \exp((\sqrt{v} t) \, \tilde{g}^\top Ax). \]

We now perform this same trick again using the randomness in x:

(9)   \[\log \mathbb{E}_{\tilde{g},x} \exp((\sqrt{v} t) \, \tilde{g}^\top Ax) \le \log \mathbb{E}_{\tilde{g}} \exp \left(\frac{1}{2} v^2 \left\|A^\top \tilde{g}\right\|^2t^2\right) = \log \mathbb{E}_{\tilde{g},g} \exp(v t \, \tilde{g}^\top Ag). \]

Packaging up (8) and (9) gives

(10)   \[\xi_{\tilde{x}^\top Ax}(t)\le \xi_{\tilde{g}^\top Ag}(vt). \]

Combining all these results (5), (6), and (10), we obtain

    \[\xi_{x^\top Ax}(t) \le \xi_{\tilde{x}^\top Ax}(4t) \le \xi_{\tilde{g}^\top Ag}(4vt) \le \frac{16v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-4v\left\|A\right\|t)}.\]

This cgf implies the desired probability bound as a consequence of the following fact (see Boucheron, Lugosi, and Massart’s Concentration Inequalities page 29 and Exercise 2.8):

Fact (Bernstein concentration from Bernstein cgf bound): Suppose that a random variable X satisfies the cgf bound \xi_X(t) \le \tfrac{vt^2}{2(1-ct)} for 0 < t < 1/c. Then

    \[\mathbb{P} \left\{ X\ge t \right\} \le \exp\left( -\frac{t^2/2}{v+ct} \right).\]

General Hanson–Wright

Now, here’s a more general result (with worse constants) which permits the matrix A to possess a diagonal.

Theorem (Hanson–Wright, explicit constants): Let x random vector with independent centered v-subguassian entries and let A be an arbitrary square matrix. Then we have the cgf bound

    \[\xi_{x^\top Ax-\mathbb{E} [x^\top A x]}(t) \le \frac{40v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left\|A\right\|t)}.\]

As a consequence, we have the concentration bound

    \[\mathbb{P} \{ x^\top A x-\mathbb{E} [x^\top A x] \ge t \} \le \exp\left( -\frac{t^2/2}{40v^2 \left\|A\right\|_{\rm F}^2+8v\left\|A\right\|t} \right).\]

Decompose the matrix A = D+F into its diagonal and off-diagonal portions. For any two random variables X and Y (possibly highly dependent), we can bound the cgf of their sum using the following “union bound”:

(11)   \begin{align*} \xi_{X+Y}(t) &= \log \mathbb{E} \left[\exp(tX)\exp(tY)\right] \\&\le \log \left(\left[\mathbb{E} \exp(2tX)\right]^{1/2}\left[\mathbb{E}\exp(2tY)\right]^{1/2}\right) \\&=\frac{1}{2} \xi_X(2t) + \frac{1}{2}\xi_Y(2t). \end{align*}

The two equality statements are the definition of the cumulant generating function and the inequality is Cauchy–Schwarz.

Using the “union bound”, it is sufficient to obtain bounds for the cgfs of the diagonal and off-diagonal parts x^\top D x - \mathbb{E}[x^\top Ax] and x^\top F x. We begin with the diagonal part. We compute

(12)   \begin{align*}\xi_{x^\top D x - \mathbb{E}[x^\top Ax]}(t) &= \log \mathbb{E} \exp\left(t \sum_{i=1}^n A_{ii}(x_i^2 - \mathbb{E}[x_i^2]) \right) \\ &= \sum_{i=1}^n  \log \mathbb{E} \exp\left((t A_{ii})\cdot(x_i^2 - \mathbb{E}[x_i^2]) \right). \end{align*}

For the cgf of x_i^2 - \mathbb{E}[x_i^2], we use the following bound, taken from Appendix B of the following paper:

    \[\log \mathbb{E} \exp\left(t(x_i^2 - \mathbb{E}[x_i^2]) \right) \le \frac{8v^2t^2}{1-2v|t|}.\]

Substituting this result into (12) gives

(13)   \[\xi_{x^\top D x - \mathbb{E}[x^\top Ax]}(t) \le \sum_{i=1}^n \frac{8v^2|A_{ii}|^2t^2}{1-2v|A_{ii}|t} \le \frac{8v^2\|A\|_{\rm F}^2t^2}{1-2v\|A\|t}\quad \text{for $t>0$}. \]

For the second inequality, we used the facts that \max_i |A_{ii}| \le \|A\| and \sum_i |A_{ii}|^2 \le \|A\|_{\rm F}^2.

We now look at the off-diagonal part x^\top F x. We use a version of the decoupling bound (5) where we compare x^\top F x to \tilde{x}^\top A x, where we’ve both replaced one copy of x with an independent copy and reinstated the diagonal of A (see Remark 6.1.3 in Vershynin’s High-Dimensional Probability):

    \[\xi_{x^\top F x}(t) \le \xi_{\tilde{x}^\top Ax}(4t).\]

We can now just repeat the rest of the argument for the diagonal-free Hanson–Wright inequality, yielding the same conclusion

(14)   \[ \xi_{x^\top Fx}(t) \le \frac{16v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-4v\left\|A\right\|t)}.  \]

Combining (11), (13), and (14), we obtain

    \begin{align*}\xi_{x^\top Ax-\mathbb{E} [x^\top A x]} &\le \frac{1}{2} \xi_{x^\top D x - \mathbb{E}[x^\top Ax]}(2t) + \frac{1}{2} \xi_{x^\top Fx}(2t) \\&\le \frac{8v^2\|A\|_{\rm F}^2t^2}{2(1-4v\|A\|t)} + \frac{32v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left|A\right|t)} \\&\le \frac{8v^2\|A\|_{\rm F}^2t^2}{2(1-4v\|A\|t)} + \frac{32v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left\|A\right\|t)} \\&\le \frac{40v^2\left\|A\right\|_{\rm F}^2\, t^2}{2(1-8v\left\|A\right\|t)}.\end{align*}

As with above, this cgf bound implies the desired probability bound.