Markov Musings 3: Spectral Theory

This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.

In the previous two posts, we proved the fundamental theorem of Markov chains using couplings and discussed the use of couplings to bound the mixing time, the time required for the chain to mix to near-stationarity.

In this post, we will continue this discussion by discussing spectral methods for understanding the convergence of Markov chains.

Mathematical Setting

In this post, we will study a reversible Markov chain on $\{1,\ldots,m\}$ with transition probability matrix $P \in \real^{m\times m}$ and stationary distribution $\pi$ . Our goal will be to use the eigenvalues and eigenvectors of the matrix $P$ to understand the properties of the Markov chain.

Throughout our discussion, it will be helpful to treat vectors $f\in\real^m$ and functions $f:\{1,\ldots,m\}\to \real$ as being one and the same, and we will use both $f_i$ and $f(i)$ to denote the $i$ th entry of the vector $f$ (aka the evaluation of the function $f$ at $i$ ). By adopting this perspective, a vector is not merely a list of $m$ numbers, but instead labels each state $i=1,2,\ldots,m$ with a numeric value.

Given a function/vector $f$ , we let $\expect[f]$ denote the expected value of $f(X)$ where $X$ is drawn from $\pi$ :

$\expect[f] \coloneqq \expect_{X \sim \pi} [f(X)] = \sum_{i=1}^m \pi_i f(i).$

The variance is defined similarly:

$\Var(f) \coloneqq \expect [(f - \expect[f])^2] = \sum_{i=1}^m (f(i) - \expect[f])^2 \pi_i.$

As a final piece of notation, we let $\delta_i$ denote a probability distribution which assigns 100% probability to outcome $i$ .

Spectral Theory

The eigenvalues and eigenvectors of general square matrices are ill-behaved creatures. Indeed, a general $m\times m$ matrix with real entries can have complex-valued eigenvalues and fail to possess a full suite of $m$ linearly independent eigenvectors. The situation is dramatically better for symmetric matrices which obey the spectral theorem:

Theorem (spectral theorem for real symmetric matrices). An $m\times m$ real symmetric matrix $A$ (i.e., one satisfying $A^\top=A$ ) has $m$ real eigenvalues $\lambda_1,\ldots,\lambda_m$ and $m$ orthonormal eigenvectors $u_1,\ldots,u_m$ .

Unfortunately for us, the transition matrix $P$ for a reversible Markov chain is not always symmetric. But despite this, there’s a surprise: $P$ always has real eigenvalues. This leads us to ask:

Why does are the eigenvalues of the transition matrix $P$ real?

To answer this question, we will need to develop and a more general version of the spectral theorem and use our standing assumption that the Markov chain is reversible.

The Transpose

In our quest to develop a general version of the spectral theorem, we look more deeply into the hypothesis of the theorem, namely that $A$ is equal to its transpose $A^\top$ . Let’s first ask: What does the transpose $A^\top$ mean?

Recall that $\real^m$ is equipped with the standard inner product, sometimes called the Euclidean or dot product. We denote this product by $(\cdot,\cdot)$ and it is defined as

$(f,g) \coloneqq f^\top g = \sum_{i=1}^m f(i) g(i).$

We can think of the standard inner product as defining the amount of the vector $f$ that points in the direction of $g$ .

The transpose of a matrix $A$ is closely related to the standard inner product. Specifically, $A^\top$ is the (unique) matrix satisfying the adjoint property:

$(f,Ag) = (A^\top f, g)\quad \text{for every }f,g\in\real^m.$

So the amount of $f$ in the direction of $Ag$ is the same as the amount of $A^\top f$ in the direction of $g$ .

The Adjoint and the General Spectral Theorem

Since the transpose is intimately related to the standard inner product on $\real^m$ , it is natural to wonder if non-standard inner products on $\real^m$ give rise to non-standard versions of the transpose. This idea proves to be true.

Definition (adjoint). Let $[\cdot,\cdot]$ be any inner product on $\real^m$ and let $A$ be an $m\times m$ matrix. The adjoint of $A$ with respect to the inner product $[\cdot,\cdot]$ is the (unique) matrix $A^*$ such that
$[f,Ag]=[A^*f,g]\quad\text{for every }f,g\in\real^m.$
A matrix $A$ is self-adjoint if it equals its own adjoint, $A=A^*$ .

The spectral theorem naturally extends to the more abstract setting of adjoints with respect to non-standard inner products:

Theorem (general spectral theorem). Let $A$ be an $m\times m$ matrix which is set-adjoint with respect to an inner product $[\cdot,\cdot]$ . Then $A$ has $m$ real eigenvalues $\lambda_1,\ldots,\lambda_m$ and $m$ eigenvectors $u_1,\ldots,u_m$ which are $[\cdot,\cdot]$ –orthonormal:
$[u_i,u_j]=\begin{cases}1, & i=j, \\0,& i\ne j.\end{cases}$

Reversibility and Self-Adjointness

The general spectral theorem offers us a path to understand our observation from above that the eigenvalues of $P$ are always real. Namely, we ask:

Is there an inner product under which $P$ is self-adjoint?

Fortunately, the answer is yes. Define the $\pi$ -inner product

$\langle f, g \rangle \coloneqq \expect[f\cdot g] = \sum_{i=1}^m f(i)g(i) \pi_i.$

This inner product is very natural. If $f$ and $g$ are mean-zero ( $\expect[f] = \expect[g] = 0$ ), it reports the covariance of $f(x)$ and $g(x)$ where $x \sim \pi$ is drawn from $\pi$ .

Let us compute the adjoint of the transition matrix $P$ in the $\pi$ -inner product $\langle \cdot,\cdot \rangle$ :

$\begin{align*}\langle f, Pg \rangle&= \sum_{i=1}^m f(i) \left(\sum_{j=1}^m P_{ij} g(i) \right)\pi_i= \sum_{i,j=1}^m f(i) g(j) \pi_iP_{ij} .\end{align*}$

Recall that we have assumed our Markov chain is reversible. Thus, the detailed balance condition

$\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for } i,j=1,2,\ldots,m$

implies that

$\langle f, Pg \rangle= \sum_{i,j=1}^m f(i) g(j) \pi_jP_{ji} = \sum_{j=1}^m \left( \sum_{i=1}^m f(i) P_{ij} \right) g(j) \pi_j = \langle Pf, g\rangle.$

Thus, we conclude that $P$ is self-adjoint.

We cannot emphasize enough that the reversibility assumption is crucial to ensure that $P$ is self-adjoint. In fact,

Theorem (reversibility and self-adjointness). The transition matrix of a general Markov chain with stationary distribution $\pi$ is self-adjoint in the $\pi$ -inner product if and only if the chain is reversible.

Spectral Theorem for Markov Chains

Since $P$ is self-adjoint in the $\pi$ -inner product, the general spectral theorem implies the following

Corollary (spectral theorem for reversible Markov chain). The reversible Markov transition matrix $P$ has $m$ real eigenvalues $\lambda_1 \ge \cdots \ge \lambda_m$ associated with eigenvectors $\varphi_1,\ldots,\varphi_m$ which are orthonormal in the $\pi$ -inner product:
$\langle\varphi_i,\varphi_j\rangle=\begin{cases}1, &i=j\\0,& i\ne j.\end{cases}$

Since $P$ is a reversible Markov transition matrix—not just any self-adjoint matrix—the eigenvalues of $P$ satisfy additional properties:

Bounded. All of the eigenvalues of $P$ lie between $-1$ and $1$ .¹
Eigenvalue 1. $\lambda_1 = 1$ is an eigenvalue of $P$ with eigenvector $\varphi_1 = \mathbf{1}$ , where $\mathbf{1}$ is a vector of all one’s.

For a primitive chain, we can also have the property:

Contractive. For a primitive chain, all eigenvalues other than $\lambda_1$ have magnitude $< 1$ .²

Distance Between Probability Distributions Redux

In the previous post, we used the total variation distance to compare probability distributions:

Definition (total variation distance). The total variation distance between probability distributions $\sigma$ and $\pi$ is
$\norm{\sigma - \pi}_{\rm TV} = \max_{A \subseteq \{1,\ldots,m\}} |\sigma(A) - \pi(A)| = \frac{1}{2} \sum_{i=1}^m |\sigma_i - \pi_i|.$

The total variation distance is an “ $\ell_1$ “ way of comparing two probability distributions since it can be computed by adding the absolute difference between the probabilities $\sigma_i$ and $\pi_i$ of each outcome. Spectral theory plays more nicely with an “ $\ell_2$ ” way of comparing probability distributions, which we develop now.

Densities

Given two probability densities $\sigma$ and $\pi$ , the density of $\sigma$ with respect to $\pi$ is the function³ $h = d\sigma/d\pi$ given by

$h(i) = \frac{d\sigma}{d\pi}(i) = \frac{\sigma_i}{\pi_i} \quad \text{for } i=1,2,\ldots,m.$

The density $h = d\sigma/d\pi$ satisfies the property that

(1) $\expect_{X \sim \sigma} [g(X)] = \expect_{Y \sim \pi}[g(Y)h(Y)] = \sum_{i=1}^m g(i)h(i) \pi_i.$

To see why we call $h = d\sigma/d\pi$ a density, it may be helpful to appeal to continuous probability for a moment. If $0\le X \le 1$ is a random variable with distribution $\sigma$ , the probability density function of $\sigma$ (with respect to the uniform distribution $\mathrm{Unif}[0,1]$ ) is a function $h$ such that

$\expect_{X \sim \sigma} [g(X)] = \expect_{Y\sim \mathrm{Unif}[0,1]} [g(Y)h(Y)] = \int_0^1 g(x) h(x) \, dx.$

The formula for the continuous case is exactly the same as the finite case (1) with sums $\sum_{i=1}^m (\cdots) \pi_i$ replaced with integrals $\int_0^1 (\cdots) \, dx$ .

The Chi-Squared Divergence

We are now ready to introduce our “ $\ell_2$ ” way of comparing probability distributions.

Definition ( $\chi^2$ -divergence). The $\chi^2$ -divergence of $\sigma$ with respect to $\pi$ is the variance of the density function:
$\chi^2(\sigma \mid\mid \pi) \coloneqq \Var \left(\frac{d\sigma}{d\pi} \right) = \expect \left[\left( \frac{d\sigma}{d\pi} - 1 \right)^2\right] = \expect \left[\left(\frac{d\sigma}{d\pi}\right)^2\right] - 1.$

To see the last equality is a valid formula for $\chi^2(\sigma\mid\mid\pi) \coloneqq \Var(d\sigma/d\pi)$ , note that

(2) $\expect\left[\frac{d\sigma}{d\pi}\right] = \sum_{i=1}^m \frac{d\sigma}{d\pi}(i) \pi_i = \sum_{i=1}^m \frac{\sigma_i}{\pi_i} \pi_i = \sum_{i=1}^m \sigma_i = 1.$

Relations Between Distance Measures

The $\chi^2$ divergence always gives an upper bound on the total variation distance. Indeed, first note that we can express the total variation distance as

$\norm{\sigma - \pi}_{\rm TV} = \frac{1}{2} \expect \left[ \left| \frac{d\sigma}{d\pi} - 1 \right| \right].$

Thus, using Lyapunov’s inequality, we conclude

(3) $\norm{\sigma - \pi}_{\rm TV} = \frac{1}{2} \expect \left[ \left| \frac{d\sigma}{d\pi} - 1 \right| \right] \le \frac{1}{2} \left(\expect \left[ \left( \frac{d\sigma}{d\pi} - 1 \right)^2 \right]\right)^{1/2} = \frac{1}{2} \sqrt{\chi^2(\sigma \mid\mid \pi)}.$

Markov Chain Convergence by Eigenvalues

Now, we prove a quantitative version of the fundamental theorem of Markov chains (for reversible processes) using spectral theory and eigenvalues:⁴

Theorem (Markov chain convergence by eigenvalues). Let $\rho^{(0)},\rho^{(1)},\rho^{(2)},\ldots$ denote the distributions of the Markov chain at times $0,1,2,\ldots$ . Then
$\chi^2\left(\rho^{(n)} \, \middle|\middle| \, \pi\right) \le \left( \max \{ \lambda_2, -\lambda_n \} \right)^{2n} \chi^2\left(\rho^{(0)} \, \middle|\middle| \, \pi\right).$

This raises a natural question: How large can the initial $\chi^2$ divergence between $\rho^{(0)}$ and $\pi$ be? This is answered by the next result.

Proposition (Initial distance to stationarity). For any $i \in \{1,\ldots,m\}$ ,
(4) $\chi^2(\delta_i \mid\mid \pi) \le \frac{1}{\pi_i}.$
For any initial distribution $\rho^{(0)}$ , we have
(5) $\chi^2\left( \rho^{(0)} \, \middle|\middle| \, \pi \right) \le \frac{1}{\min_{1\le i\le m} \pi_i}$

The condition (4) is a direct computation

$\chi^2(\delta_i \mid\mid \pi) = \Var(d\delta_i/d\pi) \le \expect[(d\delta_i/d\pi)^2] = 1/\pi_i.$

For (5), observe that $\chi^2(\rho^{(0)} \mid\mid \pi)$ is a convex function of the initial distribution $\rho^{(0)}$ . Therefore, its maximal value over the convex set of all probability distribution is maximized at an extreme point $\rho^{(0)} = \delta_i$ for some $i$ . Consequently, (5) follows directly from (4).

Using the previous two results and equation (3), we immediately obtain the following:

Corollary (Mixing time). When initialized from a distribution $\rho^{(0)}$ , the chain mixes to $\varepsilon$ -total variation distance to stationarity
$\norm{\rho^{(n)} - \pi}_{\rm TV} \le \varepsilon$
after
(6) $n = \left\lceil \frac{1}{\log(\max\{\lambda_2,-\lambda_n\})}\log \left( \frac{1}{2\varepsilon \min_{1\le i \le m}\sqrt{\pi_i}} \right) \right\rceil \text{ steps}.$

Indeed,

$\begin{align*}\norm{\rho^{(n)} - \pi}_{\rm TV} &\le \frac{1}{2} \sqrt{\chi^2(\rho^{(n)} \mid\mid \pi)} \\&\le \frac{1}{2} (\max\{\lambda_2,-\lambda_n\})^n \sqrt{\chi^2(\rho^{(0)} \mid\mid \pi)} \\&\le \frac{(\max\{\lambda_2,-\lambda_n\})^n}{2\min_{1\le i \le m} \sqrt{\pi_i}}.\end{align*}$

Equating the left-hand side with $\varepsilon$ and rearranging gives (6).

Proof of Markov Chain Convergence Theorem

We conclude with a proof of the Markov chain convergence result.

Recurrence for Densities

Our first task is to derive a recurrence for the densities $d\rho^{(n)}/d\pi$ . To do this, we use the recurrence for the probability distribution:

$\rho^{(n+1)}_j = \sum_{i=1}^m \rho_i^{(n)} P_{ij}.$

Thus,

$\left( \frac{d\rho^{(n+1)}}{d\pi}\right)_j = \frac{\rho_j^{(n+1)}}{\pi_j} = \frac{\sum_{i=1}^m \rho^{(n)}_i P_{ij}}{\pi_j}= \sum_{i=1}^m \rho^{(n)}_i\frac{P_{ij}}{\pi_j}.$

We now invoke the detailed balance $P_{ij}/\pi_j = P_{ji} / \pi_i$ , which implies

$\left( \frac{d\rho^{(n+1)}}{d\pi}\right)_j = \sum_{i=1}^m\rho^{(n)}_i \frac{P_{ji}}{\pi_i} = \sum_{i=1}^m P_{ji} \frac{\rho^{(n)}_i}{\pi_i} = \left( P \frac{d \rho^{(n)}}{d\pi} \right)_j.$

The product $P(d\rho^{(n)}/d\pi)$ is an ordinary matrix–vector product. Thus, we have shown

(7) $\frac{d\rho^{(n+1)}}{d\pi} = P \frac{d\rho^{(n)}}{d\pi} \quad \text{for } n =0,1,\ldots.$

Spectral Decomposition

Now that we have a recurrence for the densities $d\rho^{(n)}/d\pi$ , we can understand how the densities change in time. Expand the initial density in eigenvectors of $P$ :

$\frac{d\rho^{(0)}}{d\pi} = c_1 \mathbf{1} + c_2 \varphi_2 + \cdots + c_m \varphi_m.$

As we verified above in (1), we have $\expect[d\rho^{(0)}/d\pi] = 1$ . Thus, we conclude $c_1 = 1$ . Since the $\varphi_j$ are eigenvectors of $P$ with eigenvalues $\lambda_j$ , the recurrence (7) implies

$\frac{d\rho^{(n)}}{d\pi} = \mathbf{1} + c_2 \lambda_2^n \varphi_2 + \cdots + c_m \lambda_m^n \varphi_m.$

Conclusion

Finally, we compute

$\chi^2(\rho^{(n)} \mid\mid \pi) = \sum_{i=2}^m c_i^2 \lambda_i^{2n} \le (\max \{ \lambda_2, -\lambda_m \})^{2n} \sum_{i=2}^m c_i^2 = (\max \{ \lambda_2, -\lambda_m \})^{2n} \chi^2(\rho^{(0)} \mid\mid \pi).$

The theorem is proven.