Markov Musings 5: Poincaré Inequalities

In the previous posts, we’ve been using eigenvalues to understand the mixing of reversible Markov chains. Our main convergence result was as follows:

    \[\chi^2\left(\rho^{(n)} \, \middle|\middle| \, \pi\right) \le \left( \max \{ \lambda_2, -\lambda_n \} \right)^{2n} \chi^2\left(\rho^{(0)} \, \middle|\middle| \, \pi\right).\]

Here, \rho^{(n)} denotes the distribution of the chain at time n, \pi denotes the stationary distribution, \chi^2(\cdot \mid\mid \cdot) denotes the \chi^2 divergence, and 1 = \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_m \ge -1 denote the decreasingly ordered eigenvalues of the Markov transition matrix P.

Bounding the the rate of convergence requires an upper bound on \lambda_2 and a lower bound on \lambda_m. In this post, we will talk about techniques for bounding \lambda_2. For more on the smallest eigenvalue \lambda_m, see the previous post.

Setting

Let’s begin by establishing some notation, mostly the same as previous posts as this series. We work with a reversible Markov chain with transition matrix P and stationary distribution \pi.

As in previous posts, we identify vectors f \in \real^m and functions f : \{1,\ldots,m\} \to \real, treating them as one and the same f_i = f(i).

For a vector/function f, \expect_\pi[f] and \Var_\pi(f) denote the variance with respect to the stationary distribution \pi:

    \[\expect_\pi[f] = \sum_{i=1}^m f(i) \pi_i, \quad \Var_\pi(f) \coloneqq \expect_\pi[(f-\expect_\pi[f])^2].\]

We will make frequent use of the \pi-inner product

    \[\langle f, g\rangle \coloneqq \expect_\pi[f\cdot g] = \sum_{i=1}^m f(i) g(i) \pi_i.\]


We shall also use expressions such as \expect_{x \sim \sigma, y\sim \tau} [f(x,y)] to denote the expectation of f(x,y) where x is drawn from distribution \sigma and y is drawn from \tau.

We denote the eigenvalues of the transition matrix are 1 = \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_m \ge -1. The associated eigenvectors (eigenfunctions) \varphi_1,\ldots,\varphi_m are orthonormal in the \pi-inner product

    \[\langle \varphi_i ,\varphi_j\rangle = \begin{cases}1, & i = j, \\0, & i \ne j.\end{cases}\]

Variance and Local Variance

To discover methods for bounding \lambda_2, we begin by investigating a seemingly simple question:

How much variable is the output of a function f : \{1,\ldots,m\} \to \real?

There are two natural quantities which provide answers to this question: the variance and the local variance. Poincaré inequalities—the main subject of this post—establish a relation between these two numbers. As a consequence, Poincaré inequalities will provide a bound on \lambda_2.

Variance

We begin with the first of our two main characters, the variance \Var_\pi(f). The variance is a very familiar measure of variation, as it is defined for any random variable. It measures the average squared deviation of f(x) from its mean, where x is drawn from the stationary distribution \pi.

Another helpful formula for the variance is the exchangeable pairs formula:

    \[\Var_\pi(f) = \frac{1}{2} \expect_{x,y \sim \pi} [(f(x) - f(y))^2].\]

The exchangeable pairs formula states that the variance of f is proportional to the average square difference of f‘s values when measured at locations x and y sampled (independently) from the stationary distribution \pi.

Local Variance

The exchangeable pairs formula shows that variance is a measure of the global variability of the function: It measures the amount f varies across locations x and y sampled randomly from the entire set of possible states \{1,\ldots,m\}.

The local variance measures how much f varies between points x and y which are separated by just one step of the Markov chain, thus providing a more local measure of variability. Let x_0 \sim \pi be sampled from the stationary distribution, and let x_1 denote one step of the Markov chain after x_0. The local variance is

    \[\mathcal{E}(f) = \frac{1}{2} \expect [(f(x_0) - f(x_1))^2].\]

Other names for the local variance include the Dirichlet form and the quadratic form of the Markov chain.

An important note: The variance of a function f depends only on the stationary distribution \pi. By contrast, the local variance depends on the Markov transition matrix P.

Poincaré Inequalities

If f does not vary much over a single step of the Markov chain, then it seems reasonable to expect that it doesn’t vary much globally. This intuition is made quantitative using Poincaré inequalities.

Definition (Poincaré inequality). A Markov chain is said to satisfy a Poincaré inequality with constant \alpha if

(1)   \[\Var_\pi(f)\le \alpha \cdot \mathcal{E}(f) \quad \text{for every function } f.\]

Poincaré Inequalities and Mixing

Poincaré inequalities are intimately related with the speed of mixing for a Markov chain.

To see why, consider a function f with small local variance. Because f has small local variance, f(x_0) is close to f(x_1), f(x_1) is close to f(x_2), etc.; the function f does not change much over a single step of the Markov chain. Does this mean that the (global) variance of f will also be small? Not necessarily. If the Markov chain takes a long time to mix, the small local variance can accumulate to a large global variance over many steps of the Markov chain. Thus, a slowly mixing chain has a large Poincaré constant \alpha. Conversely, if the chain mixes rapidly, the Poincaré constant \alpha is small.

This relation between mixing and Poincaré inequalities is quantified by the following theorem:

Theorem (Poincaré inequalities from eigenvalues). The Markov chain satisfies a Poincaré inequality with constant

    \[\alpha= \frac{1}{1-\lambda_2}.\]

This is the smallest possible Poincaré inequality for the Markov chain.

One way to interpret this result is that the eigenvalue \lambda_2 gives you Poincaré inequality (1). But we can flip this result around: Poincaré inequalities (1) establish bounds on the eigenvalue \lambda_2.

Corollary (Eigenvalue bounds from Poincaré inequalities). If the Markov chain satisfies a Poincaré inequality (1) for a certain constant \alpha, then

    \[\lambda_2 \le \frac{1}{1-\alpha}.\]

A View to the Continuous Setting

For a particularly vivid example of a Poincaré inequality, it will be helpful to take a brief detour to the world of continuous Markov processes. This series has—to this point—exclusively focused on Markov chains x_0,x_1,x_2,\ldots that have finitely many possible states and are indexed by discrete times 0,1,2,\ldots.. We can generalize Markov chains by lifting both of these restrictions, considering Markov processes (x_t)_{t\ge 0} which take values in continuous space (such as the real line \real) and are indexed by continuous times t\ge 0.

The mathematical details for Markov processes are a lot more complicated than for their Markov chain siblings, so we will keep it light on details.

For this example, our Markov process will be the Ornstein–Uhlenbeck process. This process has the somewhat mysterious form

    \[x_t = e^{-t}x_0 + e^{-t} B_{e^{2t}-1},\]

where (B_s)_{s\ge 0} denotes a (standard) Brownian motion, independent of the starting state x_0. At time s, the Brownian motion B_s has a Gaussian distribution with variance s. Thus,

Conditional on its starting value x_0, the Ornstein–Uhlenbeck process x_t is has a Gaussian distribution with mean e^{-t}x_0 and variance 1-e^{-2t}.

From this observation, it appears that the stationary distribution of the Ornstein–Uhlenbeck process is the standard Gaussian distribution. Indeed, this is the case, and the Ornstein–Uhlenbeck process converges to stationarity exponentially fast.

Since we have exponential convergence to stationarity,1And, as can be checked, the Ornstein–Uhlenbeck process is reversible, in the appropriate sense. there’s a Poincaré inequality lurking in the background, known as the Gaussian Poincaré inequality. Letting Z denote a standard Gaussian random variable, Gaussian Poincaré inequality states that

(2)   \[\Var(f(Z)) \le \expect \big[(f'(Z))^2\big].\]

The right-hand side of this inequality is the local variance of the Ornstein–Uhlenbeck process, equal to the expected squared derivative:

    \[\mathcal{E}(f) = \expect \big[(f'(Z))^2\big].\]

The Gaussian Poincaré inequality presents a very clear demonstration of what a Poincaré inequality is: The global variance of the function f(Z) is controlled by its local variability, here quantified by the expected squared derivative:

    \[\mathcal{E}(f) = \expect \big[(f'(Z))^2\big].\]

For general Markov chains or processes, it can remain helpful to thinking of the local variance \mathcal{E}(f) as a generalization of the “expected squared derivative” of the function f.

Our main interest in Poincaré inequalities in this post is instrumental, we seek to use Poincaré inequalities to understand the mixing properties of Markov chains. But the Gaussian Poincaré inequality demonstrates that Poincaré inequalities are also interesting on their own terms. The inequality (2) is a useful inequality for bounding the variance of a function of a Gaussian random variable. As an immediate example, observe that the function f(x) = \tanh(x) has derivative bounded by 1: |f'(x)| \le 1. Thus,

    \[\Var(\tanh Z) \le \expect[(f'(Z))^2] \le 1.\]

This inequality is not too difficult to prove directly,2Indeed, use the fact that \tanh is 1Lipschitz continuous and use the exchangeable pairs formula for variance: \Var(\tanh(Z)) = 0.5\expect[(\tanh Z - \tanh Z')^2] \le 0.5 \expect[(Z-Z')^2] = \Var(Z) = 1, where Z' is an independent copy of Z. but the point stands that the Gaussian Poincaré inequality—and Poincaré inequalities in general—can be useful on their own terms.3The Gaussian Poincaré inequality’s multidimensional generalization has especially interesting consequences. See the bottom of this post for an example.

Poincaré Inequalities and Eigenvalues

For the remainder of this post, we will develop the connection between Poincaré inequalities and eigenvalues, leading to a proof of our main theorem:

Theorem (Poincaré inequalities from eigenvalues). The Markov chain satisfies a Poincaré inequality with constant

    \[\alpha= \frac{1}{1-\lambda_2}.\]

That is,

(3)   \[\Var_\pi(f)\le \frac{1}{1-\lambda_2}\cdot\mathcal{E}(f) \quad \text{for all } f\in\real^m.\]

There exists a function f for which equality is attained.

We begin by showing that it suffices to consider mean-zero functions \expect[f] = 0 to prove (3). Next, we derive formulas for \Var_\pi(f) and \mathcal{E}(f) using the \pi-inner product \langle\cdot,\cdot\rangle. We conclude by expanding f in eigenvectors of P and deriving the Poincaré inequality (3).

Shift to Mean-Zero

To prove the Poincaré inequality (3), we are free to assume that f has mean zero, \expect_\pi[f]=0. Indeed, both the variance \Var_\pi(f) and local variance \mathcal{E}(f) don’t change if we shift f by a constant c. That is, letting \mathbb{1} denote the function

    \[\mathbb{1}(i) = 1 \quad\text{for }i =1,\ldots,m,\]

then

    \[\Var_\pi(f+c\mathbb{1})=\Var_\pi(f)\quad\text{and}\quad\mathcal{E}(f+c\mathbb{1})=\mathcal{E}(f)\]

for every function f and constant c. Therefore, for proving our Poincaré inequality, we can always shift f so that it is mean-zero:

    \[\expect_\pi[f] = 0.\]

Variance

Our strategy for proving the main theorem will be to develop a more linear algebraic formula for the variance and local variance. Let’s begin with the variance.

Assume \expect_\pi[f] = 0. Then the variance is

    \[\Var_\pi(f)=\expect[f^2]=\sum_{i=1}^m f(i)f(i)\pi_i.\]

Using the definition of the \pi-inner product, we have shown that

    \[\Var_\pi(f)=\langle f,f\rangle.\]

Local Variance

Now we derive a formula for the local variance:

    \[\mathcal{E}(f) = \frac{1}{2} \expect[(f(x_0)-f(x_1))^2]\quad \text{where }x_0\sim\pi.\]

The probability that x_0=i and x_1=j is \pi_iP_{ij}. Thus,

    \[\mathcal{E}(f) = \frac{1}{2} \sum_{i,j=1}^m (f(i)-f(j))^2 \pi_iP_{ij}.\]

Expanding the parentheses and regrouping strategically, we obtain

    \[\mathcal{E}(f) = {\rm A} + {\rm B} + {\rm C}\]

where

    \begin{align*}{\rm A} &= \frac{1}{2} \sum_{i=1}^n (f(i))^2 \pi_i \left( \sum_{j=1}^m P_{ij}\right),\\{\rm B} &= \frac{1}{2} \sum_{j=1}^m (f(j))^2 \left( \sum_{i=1}^m \pi_i P_{ij} \right), \\{\rm C} &= -\sum_{i=1}^m f(i)\left( \sum_{j=1}^m P_{ij} f(j) \right) \pi_i.\end{align*}

Let’s take each of these terms one-by-one. For {\rm A}, recognize that \sum_{j=1}^m P_{ij} = 1. Thus,

    \[{\rm A} = \frac{1}{2}\sum_{i=1}^m (f(i))^2 \pi_i = \frac{1}{2}\langle f, f\rangle.\]

For \rm B, use detailed balance \pi_i P_{ij} = \pi_j P_{ji}. Then, using the condition \sum_{i=1}^m P_{ji} = 1, we obtain

    \[{\rm B} = \frac{1}{2} \sum_{j=1}^m (f(j))^2 \left(\sum_{i=1}^m \pi_j P_{ji} \right) = \frac{1}{2} \sum_{j=1}^m (f(j))^2 \pi_j = \frac{1}{2} \langle f, f\rangle.\]

Finally, for {\rm C}, recognize that \sum_{j=1}^m P_{ij} f(j) is the ith entry of the matrix–vector product Pf. Thus,

    \[{\rm C} = - \sum_{i=1}^m f(i) Pf(i) \,\pi_i = -\langle f, Pf\rangle.\]

Thus, we conclude

    \[\mathcal{E}(f) = \langle f, (I-P)f\rangle,\]

where I denotes the identity matrix.

Conclusion

The Poincaré inequality

    \[\Var_\pi(f) \le \frac{1}{1-\lambda_2} \cdot\mathcal{E}(f) \quad \text{for all $f$ with $\expect_\pi[f] = 0$}.\]

is equivalent to showing

    \[\frac{\mathcal{E}(f)}{\Var_\pi(f)}\ge 1-\lambda_2 \quad \text{for all $f$ with $\expect_\pi[f] = 0$}.\]

Using our newly derived formulas, this in turn is equivalent to showing

    \[\frac{\langle f, (I-P)f\rangle}{\langle f, f\rangle} \ge 1-\lambda_2 \quad \text{for all $f$ with $\expect_\pi[f] = 0$}.\]

We shall prove this version of the Poincaré inequality by expanding f as a linear combination of eigenvectors.

Consider a decomposition of f as a linear combination of P‘s eigenvectors:

    \[f = c_1 \varphi_1 + c_2 \varphi_2 + \cdots + c_m\varphi_m.\]

As we showed in last post, the condition \expect[f] = 0 is equivalent to saying that c_1 = 0.

Using the orthonormality of \varphi_1,\ldots,\varphi_m under the \pi-inner product and the eigenvalue relation P\varphi_i = \lambda_i \, \varphi_i, we have that

    \begin{align*}\langle f, f\rangle &= c_2^2 + \cdots + c_m^2, \\\langle f, (I-P)f\rangle &= (1-\lambda_2)c_2^2 + \cdots + (1-\lambda_m)c_m^2.\end{align*}



Thus,

(4)   \begin{align*}\frac{\langle f, (I-P)f\rangle}{\langle f, f\rangle} &= \frac{(1-\lambda_2)c_2^2 + \cdots + (1-\lambda_m)c_m^2}{c_2^2 + \cdots + c_m^2} \\&= (1-\lambda_2)a_2 + \cdots + (1-\lambda_m)a_m,\end{align*}

where

    \[a_i = \frac{c_i^2}{c_2^2 + \cdots + c_m^2}.\]

The coefficients a_i are nonnegative and add to 1:

    \[a_2+\cdots+a_m = \frac{c_2^2+\cdots+c_m^2}{c_2^2+\cdots+c_m^2} = 1.\]

Therefore, the smallest possible value for (4) is achieved by setting a_2 = 1 and a_3 = \cdots = a_m = 0 (equivalently, setting c_3 = \cdots=c_m = 0). Thus, we conclude

    \[\frac{\langle f, (I-P)f\rangle}{\langle f, f\rangle}\ge 1-\lambda_2,\]

with equality when f is a multiple of \varphi_2.

Leave a Reply

Your email address will not be published. Required fields are marked *