Markov Musings 3: Spectral Theory

This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.


In the previous two posts, we proved the fundamental theorem of Markov chains using couplings and discussed the use of couplings to bound the mixing time, the time required for the chain to mix to near-stationarity.

In this post, we will continue this discussion by discussing spectral methods for understanding the convergence of Markov chains.

Mathematical Setting

In this post, we will study a reversible Markov chain on \{1,\ldots,m\} with transition probability matrix P \in \real^{m\times m} and stationary distribution \pi. Our goal will be to use the eigenvalues and eigenvectors of the matrix P to understand the properties of the Markov chain.

Throughout our discussion, it will be helpful to treat vectors f\in\real^m and functions f:\{1,\ldots,m\}\to \real as being one and the same, and we will use both f_i and f(i) to denote the ith entry of the vector f (aka the evaluation of the function f at i). By adopting this perspective, a vector is not merely a list of m numbers, but instead labels each state i=1,2,\ldots,m with a numeric value.

Given a function/vector f, we let \expect[f] denote the expected value of f(X) where X is drawn from \pi:

    \[\expect[f] \coloneqq \expect_{X \sim \pi} [f(X)] = \sum_{i=1}^m \pi_i f(i).\]

The variance is defined similarly:

    \[\Var(f) \coloneqq \expect [(f - \expect[f])^2] = \sum_{i=1}^m (f(i) - \expect[f])^2 \pi_i.\]

As a final piece of notation, we let \delta_i denote a probability distribution which assigns 100% probability to outcome i.

Spectral Theory

The eigenvalues and eigenvectors of general square matrices are ill-behaved creatures. Indeed, a general m\times m matrix with real entries can have complex-valued eigenvalues and fail to possess a full suite of m linearly independent eigenvectors. The situation is dramatically better for symmetric matrices which obey the spectral theorem:

Theorem (spectral theorem for real symmetric matrices). An m\times m real symmetric matrix A (i.e., one satisfying A^\top=A) has m real eigenvalues \lambda_1,\ldots,\lambda_m and m orthonormal eigenvectors u_1,\ldots,u_m.

Unfortunately for us, the transition matrix P for a reversible Markov chain is not always symmetric. But despite this, there’s a surprise: P always has real eigenvalues. This leads us to ask:

Why does are the eigenvalues of the transition matrix P real?

To answer this question, we will need to develop and a more general version of the spectral theorem and use our standing assumption that the Markov chain is reversible.

The Transpose

In our quest to develop a general version of the spectral theorem, we look more deeply into the hypothesis of the theorem, namely that A is equal to its transpose A^\top. Let’s first ask: What does the transpose A^\top mean?

Recall that \real^m is equipped with the standard inner product, sometimes called the Euclidean or dot product. We denote this product by (\cdot,\cdot) and it is defined as

    \[(f,g) \coloneqq f^\top g = \sum_{i=1}^m f(i) g(i).\]

We can think of the standard inner product as defining the amount of the vector f that points in the direction of g.

The transpose of a matrix A is closely related to the standard inner product. Specifically, A^\top is the (unique) matrix satisfying the adjoint property:

    \[(f,Ag) = (A^\top f, g)\quad \text{for every }f,g\in\real^m.\]

So the amount of f in the direction of Ag is the same as the amount of A^\top f in the direction of g.

The Adjoint and the General Spectral Theorem

Since the transpose is intimately related to the standard inner product on \real^m, it is natural to wonder if non-standard inner products on \real^m give rise to non-standard versions of the transpose. This idea proves to be true.

Definition (adjoint). Let [\cdot,\cdot] be any inner product on \real^m and let A be an m\times m matrix. The adjoint of A with respect to the inner product [\cdot,\cdot] is the (unique) matrix A^* such that

    \[[f,Ag]=[A^*f,g]\quad\text{for every }f,g\in\real^m.\]

A matrix A is self-adjoint if it equals its own adjoint, A=A^*.

The spectral theorem naturally extends to the more abstract setting of adjoints with respect to non-standard inner products:

Theorem (general spectral theorem). Let A be an m\times m matrix which is set-adjoint with respect to an inner product [\cdot,\cdot]. Then A has m real eigenvalues \lambda_1,\ldots,\lambda_m and m eigenvectors u_1,\ldots,u_m which are [\cdot,\cdot]orthonormal:

    \[[u_i,u_j]=\begin{cases}1, & i=j, \\0,& i\ne j.\end{cases}\]

Reversibility and Self-Adjointness

The general spectral theorem offers us a path to understand our observation from above that the eigenvalues of P are always real. Namely, we ask:

Is there an inner product under which P is self-adjoint?

Fortunately, the answer is yes. Define the \pi-inner product

    \[\langle f, g \rangle \coloneqq \expect[f\cdot g] = \sum_{i=1}^m f(i)g(i) \pi_i.\]

This inner product is very natural. If f and g are mean-zero (\expect[f] = \expect[g] = 0), it reports the covariance of f(x) and g(x) where x \sim \pi is drawn from \pi.

Let us compute the adjoint of the transition matrix P in the \pi-inner product \langle \cdot,\cdot \rangle:

    \begin{align*}\langle f, Pg \rangle&= \sum_{i=1}^m f(i) \left(\sum_{j=1}^m P_{ij} g(i) \right)\pi_i= \sum_{i,j=1}^m f(i) g(j) \pi_iP_{ij} .\end{align*}

Recall that we have assumed our Markov chain is reversible. Thus, the detailed balance condition

    \[\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for } i,j=1,2,\ldots,m\]

implies that

    \[\langle f, Pg \rangle= \sum_{i,j=1}^m f(i) g(j) \pi_jP_{ji} = \sum_{j=1}^m \left( \sum_{i=1}^m f(i) P_{ij} \right) g(j) \pi_j = \langle Pf, g\rangle.\]

Thus, we conclude that P is self-adjoint.

We cannot emphasize enough that the reversibility assumption is crucial to ensure that P is self-adjoint. In fact,

Theorem (reversibility and self-adjointness). The transition matrix of a general Markov chain with stationary distribution \pi is self-adjoint in the \pi-inner product if and only if the chain is reversible.

Spectral Theorem for Markov Chains

Since P is self-adjoint in the \pi-inner product, the general spectral theorem implies the following

Corollary (spectral theorem for reversible Markov chain). The reversible Markov transition matrix P has m real eigenvalues \lambda_1 \ge \cdots \ge \lambda_m associated with eigenvectors \varphi_1,\ldots,\varphi_m which are orthonormal in the \pi-inner product:

    \[\langle\varphi_i,\varphi_j\rangle=\begin{cases}1, &i=j\\0,& i\ne j.\end{cases}\]

Since P is a reversible Markov transition matrix—not just any self-adjoint matrix—the eigenvalues of P satisfy additional properties:

  • Bounded. All of the eigenvalues of P lie between -1 and 1.1Here’s an argument. The vectors \delta_1,\ldots,\delta_m span all of \real^m, where \delta_i denotes a vector with 1 in position i and 0 elsewhere. Then, by the fundamental theorem of Markov chains, \delta_i^\top P^n converges to \pi^\top for every i. In particular, for any vector \alpha, \alpha^\top P^n = \sum_{i=1}^m \alpha_i (\delta_i^\top P^n) = (\sum_{i=1}^m \alpha_i) \pi^\top. Thus, since \limsup_{n\to\infty} \norm{\alpha^\top P^n} < +\infty for every vector \alpha, all of the eigenvalues must be \le 1 in magnitude.
  • Eigenvalue 1. \lambda_1 = 1 is an eigenvalue of P with eigenvector \varphi_1 = \mathbf{1}, where \mathbf{1} is a vector of all one’s.

For a primitive chain, we can also have the property:

  • Contractive. For a primitive chain, all eigenvalues other than \lambda_1 have magnitude < 1.2This follows \delta_i^\top P^n \to \pi^\top for every i=1,2,\ldots,m, so \pi^\top must be the unique left eigenvector with eigenvalue of modulus \ge 1.

Distance Between Probability Distributions Redux

In the previous post, we used the total variation distance to compare probability distributions:

Definition (total variation distance). The total variation distance between probability distributions \sigma and \pi is

    \[\norm{\sigma - \pi}_{\rm TV} = \max_{A \subseteq \{1,\ldots,m\}} |\sigma(A) - \pi(A)| = \frac{1}{2} \sum_{i=1}^m |\sigma_i - \pi_i|.\]

The total variation distance is an \ell_1 way of comparing two probability distributions since it can be computed by adding the absolute difference between the probabilities \sigma_i and \pi_i of each outcome. Spectral theory plays more nicely with an “\ell_2” way of comparing probability distributions, which we develop now.

Densities

Given two probability densities \sigma and \pi, the density of \sigma with respect to \pi is the function3Note that we typically associate probability distributions \sigma and \pi with row vectors, whereas the density f is a function which we identify with column vectors. For those interested and familiar with measure theory, here is a good reason why this makes sense. In the continuum setting, probability distributions \sigma and \pi are measures whereas the density f remains a function, known as the Radon–Nikodym derivative f = d\sigma/d\pi. This provides a general way of figuring out which objects for finite state space Markov chains are row vectors and which are column vectors: Measures are row vectors whereas functions are column vectors. h = d\sigma/d\pi given by

    \[h(i) = \frac{d\sigma}{d\pi}(i) = \frac{\sigma_i}{\pi_i} \quad \text{for } i=1,2,\ldots,m.\]

The density h = d\sigma/d\pi satisfies the property that

(1)   \[\expect_{X \sim \sigma} [g(X)] = \expect_{Y \sim \pi}[g(Y)h(Y)] = \sum_{i=1}^m g(i)h(i) \pi_i. \]

To see why we call h = d\sigma/d\pi a density, it may be helpful to appeal to continuous probability for a moment. If 0\le X \le 1 is a random variable with distribution \sigma, the probability density function of \sigma (with respect to the uniform distribution \mathrm{Unif}[0,1]) is a function h such that

    \[\expect_{X \sim \sigma} [g(X)] = \expect_{Y\sim \mathrm{Unif}[0,1]} [g(Y)h(Y)] = \int_0^1 g(x) h(x) \, dx.\]

The formula for the continuous case is exactly the same as the finite case (1) with sums \sum_{i=1}^m (\cdots) \pi_i replaced with integrals \int_0^1 (\cdots) \, dx.

The Chi-Squared Divergence

We are now ready to introduce our “\ell_2” way of comparing probability distributions.

Definition (\chi^2-divergence). The \chi^2-divergence of \sigma with respect to \pi is the variance of the density function:

    \[\chi^2(\sigma \mid\mid \pi) \coloneqq \Var \left(\frac{d\sigma}{d\pi} \right) = \expect \left[\left( \frac{d\sigma}{d\pi} - 1 \right)^2\right] = \expect \left[\left(\frac{d\sigma}{d\pi}\right)^2\right] - 1.\]

To see the last equality is a valid formula for \chi^2(\sigma\mid\mid\pi) \coloneqq \Var(d\sigma/d\pi), note that

(2)   \[\expect\left[\frac{d\sigma}{d\pi}\right] = \sum_{i=1}^m \frac{d\sigma}{d\pi}(i) \pi_i = \sum_{i=1}^m \frac{\sigma_i}{\pi_i} \pi_i = \sum_{i=1}^m \sigma_i = 1. \]

Relations Between Distance Measures

The \chi^2 divergence always gives an upper bound on the total variation distance. Indeed, first note that we can express the total variation distance as

    \[\norm{\sigma - \pi}_{\rm TV} = \frac{1}{2} \expect \left[ \left| \frac{d\sigma}{d\pi} - 1 \right| \right].\]

Thus, using Lyapunov’s inequality, we conclude

(3)   \[\norm{\sigma - \pi}_{\rm TV} = \frac{1}{2} \expect \left[ \left| \frac{d\sigma}{d\pi} - 1 \right| \right] \le \frac{1}{2} \left(\expect \left[ \left( \frac{d\sigma}{d\pi} - 1 \right)^2 \right]\right)^{1/2} = \frac{1}{2} \sqrt{\chi^2(\sigma \mid\mid \pi)}. \]

Markov Chain Convergence by Eigenvalues

Now, we prove a quantitative version of the fundamental theorem of Markov chains (for reversible processes) using spectral theory and eigenvalues:4Note that we used the fundamental theorem of Markov chains in above footnotes to prove the “bounded” and “contractive” properties, so, at present, this purported proof of the fundamental theorem would be circular. Fortunately, we can establish these two claims independently of the fundamental theorem, say, by appealing to the Perron–Frobenius theorem. Thus, this argument does give a genuine non-circular proof of the fundamental theorem.

Theorem (Markov chain convergence by eigenvalues). Let \rho^{(0)},\rho^{(1)},\rho^{(2)},\ldots denote the distributions of the Markov chain at times 0,1,2,\ldots. Then

    \[\chi^2\left(\rho^{(n)} \, \middle|\middle| \, \pi\right) \le \left( \max \{ \lambda_2, -\lambda_n \} \right)^{2n} \chi^2\left(\rho^{(0)} \, \middle|\middle| \, \pi\right).\]

This raises a natural question: How large can the initial \chi^2 divergence between \rho^{(0)} and \pi be? This is answered by the next result.

Proposition (Initial distance to stationarity). For any i \in \{1,\ldots,m\},

(4)   \[\chi^2(\delta_i \mid\mid \pi) \le \frac{1}{\pi_i}. \]

For any initial distribution \rho^{(0)}, we have

(5)   \[\chi^2\left( \rho^{(0)} \, \middle|\middle| \, \pi \right) \le \frac{1}{\min_{1\le i\le m} \pi_i} \]

The condition (4) is a direct computation

    \[\chi^2(\delta_i \mid\mid \pi) = \Var(d\delta_i/d\pi) \le \expect[(d\delta_i/d\pi)^2] = 1/\pi_i.\]

For (5), observe that \chi^2(\rho^{(0)} \mid\mid \pi) is a convex function of the initial distribution \rho^{(0)}. Therefore, its maximal value over the convex set of all probability distribution is maximized at an extreme point \rho^{(0)} = \delta_i for some i. Consequently, (5) follows directly from (4).

Using the previous two results and equation (3), we immediately obtain the following:

Corollary (Mixing time). When initialized from a distribution \rho^{(0)}, the chain mixes to \varepsilon-total variation distance to stationarity

    \[\norm{\rho^{(n)} - \pi}_{\rm TV} \le \varepsilon\]

after

(6)   \[n = \left\lceil \frac{1}{\log(\max\{\lambda_2,-\lambda_n\})}\log \left( \frac{1}{2\varepsilon \min_{1\le i \le m}\sqrt{\pi_i}} \right) \right\rceil \text{ steps}. \]

Indeed,

    \begin{align*}\norm{\rho^{(n)} - \pi}_{\rm TV} &\le \frac{1}{2} \sqrt{\chi^2(\rho^{(n)} \mid\mid \pi)} \\&\le \frac{1}{2} (\max\{\lambda_2,-\lambda_n\})^n \sqrt{\chi^2(\rho^{(0)} \mid\mid \pi)} \\&\le \frac{(\max\{\lambda_2,-\lambda_n\})^n}{2\min_{1\le i \le m} \sqrt{\pi_i}}.\end{align*}


Equating the left-hand side with \varepsilon and rearranging gives (6).

Proof of Markov Chain Convergence Theorem

We conclude with a proof of the Markov chain convergence result.

Recurrence for Densities

Our first task is to derive a recurrence for the densities d\rho^{(n)}/d\pi. To do this, we use the recurrence for the probability distribution:

    \[\rho^{(n+1)}_j = \sum_{i=1}^m \rho_i^{(n)} P_{ij}.\]

Thus,

    \[\left( \frac{d\rho^{(n+1)}}{d\pi}\right)_j = \frac{\rho_j^{(n+1)}}{\pi_j} = \frac{\sum_{i=1}^m \rho^{(n)}_i P_{ij}}{\pi_j}= \sum_{i=1}^m \rho^{(n)}_i\frac{P_{ij}}{\pi_j}.\]

We now invoke the detailed balance P_{ij}/\pi_j = P_{ji} / \pi_i, which implies

    \[\left( \frac{d\rho^{(n+1)}}{d\pi}\right)_j = \sum_{i=1}^m\rho^{(n)}_i  \frac{P_{ji}}{\pi_i} = \sum_{i=1}^m P_{ji} \frac{\rho^{(n)}_i}{\pi_i} = \left( P \frac{d \rho^{(n)}}{d\pi} \right)_j.\]

The product P(d\rho^{(n)}/d\pi) is an ordinary matrix–vector product. Thus, we have shown

(7)   \[\frac{d\rho^{(n+1)}}{d\pi} = P \frac{d\rho^{(n)}}{d\pi} \quad \text{for } n =0,1,\ldots. \]

Spectral Decomposition

Now that we have a recurrence for the densities d\rho^{(n)}/d\pi, we can understand how the densities change in time. Expand the initial density in eigenvectors of P:

    \[\frac{d\rho^{(0)}}{d\pi} = c_1 \mathbf{1} + c_2 \varphi_2 + \cdots + c_m \varphi_m.\]

As we verified above in (1), we have \expect[d\rho^{(0)}/d\pi] = 1. Thus, we conclude c_1 = 1. Since the \varphi_j are eigenvectors of P with eigenvalues \lambda_j, the recurrence (7) implies

    \[\frac{d\rho^{(n)}}{d\pi} = \mathbf{1} + c_2 \lambda_2^n \varphi_2 + \cdots + c_m \lambda_m^n \varphi_m.\]

Conclusion

Finally, we compute

    \[\chi^2(\rho^{(n)} \mid\mid \pi) = \sum_{i=2}^m c_i^2 \lambda_i^{2n} \le (\max \{ \lambda_2, -\lambda_m \})^{2n} \sum_{i=2}^m c_i^2 = (\max \{ \lambda_2, -\lambda_m \})^{2n} \chi^2(\rho^{(0)} \mid\mid \pi).\]

The theorem is proven.

One thought on “Markov Musings 3: Spectral Theory

Leave a Reply

Your email address will not be published. Required fields are marked *