In the previous two posts, we proved the fundamental theorem of Markov chains using *couplings* and discussed the use of couplings to bound the *mixing time*, the time required for the chain to mix to near-stationarity.

In this post, we will continue this discussion by discussing *spectral* methods for understanding the convergence of Markov chains.

## Mathematical Setting

In this post, we will study a *reversible* Markov chain on with transition probability matrix and stationary distribution . Our goal will be to use the eigenvalues and eigenvectors of the matrix to understand the properties of the Markov chain.

Throughout our discussion, it will be helpful to treat *vectors* and *functions* as being one and the same, and we will use both and to denote the th entry of the vector (aka the evaluation of the function at ). By adopting this perspective, a vector is not merely a list of numbers, but instead labels each state with a numeric value.

Given a function/vector , we let denote the expected value of where is drawn from :

The variance is defined similarly:

As a final piece of notation, we let denote a probability distribution which assigns 100% probability to outcome .

## Spectral Theory

The eigenvalues and eigenvectors of general square matrices are ill-behaved creatures. Indeed, a general matrix with real entries can have complex-valued eigenvalues and fail to possess a full suite of linearly independent eigenvectors. The situation is dramatically better for *symmetric matrices* which obey the *spectral theorem*:

Theorem (spectral theorem for real symmetric matrices).An real symmetric matrix (i.e., one satisfying ) has real eigenvalues and orthonormal eigenvectors .

Unfortunately for us, the transition matrix for a reversible Markov chain is not always symmetric. But despite this, there’s a surprise: always has real eigenvalues. This leads us to ask:

Why does are the eigenvalues of the transition matrix real?

To answer this question, we will need to develop and a more general version of the spectral theorem and use our standing assumption that the Markov chain is *reversible*.

### The Transpose

In our quest to develop a general version of the spectral theorem, we look more deeply into the hypothesis of the theorem, namely that is equal to its transpose . Let’s first ask: What does the transpose *mean*?

Recall that is equipped with the *standard inner product*, sometimes called the Euclidean or dot product. We denote this product by and it is defined as

We can think of the standard inner product as defining the amount of the vector that points in the direction of .

The transpose of a matrix is closely related to the standard inner product. Specifically, is the (unique) matrix satisfying the *adjoint property*:

So the amount of in the direction of is the same as the amount of in the direction of .

### The Adjoint and the General Spectral Theorem

Since the transpose is intimately related to the standard inner product on , it is natural to wonder if *non-standard inner products* on give rise to *non-standard versions of the transpose*. This idea proves to be true.

Definition (adjoint).Let be any inner product on and let be an matrix. Theadjointof with respect to the inner product is the (unique) matrix such thatA matrix is

self-adjointif it equals its own adjoint, .

The spectral theorem naturally extends to the more abstract setting of adjoints with respect to non-standard inner products:

Theorem (general spectral theorem).Let be an matrix which is set-adjoint with respect to an inner product . Then has real eigenvalues and eigenvectors which are –orthonormal:

### Reversibility and Self-Adjointness

The general spectral theorem offers us a path to understand our observation from above that the eigenvalues of are always real. Namely, we ask:

Is there an inner product under which is self-adjoint?

Fortunately, the answer is yes. Define the -inner product

This inner product is very natural. If and are mean-zero (), it reports the covariance of and where is drawn from .

Let us compute the adjoint of the transition matrix in the -inner product :

Recall that we have assumed our Markov chain is reversible. Thus, the detailed balance condition

implies that

Thus, we conclude that is self-adjoint.

We cannot emphasize enough that the reversibility assumption is crucial to ensure that is self-adjoint. In fact,

Theorem (reversibility and self-adjointness).The transition matrix of a general Markov chain with stationary distribution is self-adjoint in the -inner product if and only if the chain is reversible.

### Spectral Theorem for Markov Chains

Since is self-adjoint in the -inner product, the general spectral theorem implies the following

Corollary (spectral theorem for reversible Markov chain).The reversible Markov transition matrix has real eigenvalues associated with eigenvectors which are orthonormal in the -inner product:

Since is a reversible Markov transition matrix—not just any self-adjoint matrix—the eigenvalues of satisfy additional properties:

**Bounded.**All of the eigenvalues of lie between and .^{1}Here’s an argument. The vectors span all of , where denotes a vector with in position and elsewhere. Then, by the fundamental theorem of Markov chains, converges to for every . In particular, for any vector , . Thus, since for every vector , all of the eigenvalues must be in magnitude.**Eigenvalue 1.**is an eigenvalue of with eigenvector , where is a vector of all one’s.

For a primitive chain, we can also have the property:

**Contractive.**For a primitive chain, all eigenvalues other than have magnitude .^{2}This follows for every , so must be the unique left eigenvector with eigenvalue of modulus .

## Distance Between Probability Distributions Redux

In the previous post, we used the *total variation distance* to compare probability distributions:

Definition (total variation distance).Thetotal variation distancebetween probability distributions and is

The total variation distance is an ““ way of comparing two probability distributions since it can be computed by adding the absolute difference between the probabilities and of each outcome. Spectral theory plays more nicely with an “” way of comparing probability distributions, which we develop now.

### Densities

Given two probability densities and , the *density* of with respect to is the *function*^{3}Note that we typically associate probability distributions and with *row vectors*, whereas the density is a function which we identify with *column vectors*. For those interested and familiar with measure theory, here is a good reason why this makes sense. In the continuum setting, probability distributions and are *measures* whereas the density remains a *function*, known as the Radon–Nikodym derivative . This provides a general way of figuring out which objects for finite state space Markov chains are row vectors and which are column vectors: Measures are row vectors whereas functions are column vectors. given by

The density satisfies the property that

(1)

To see why we call a density, it may be helpful to appeal to continuous probability for a moment. If is a random variable with distribution , the probability density function of (with respect to the uniform distribution ) is a function such that

The formula for the continuous case is exactly the same as the finite case (1) with sums replaced with integrals .

### The Chi-Squared Divergence

We are now ready to introduce our “” way of comparing probability distributions.

Definition (-divergence).The-divergenceof with respect to is the variance of the density function:

To see the last equality is a valid formula for , note that

(2)

### Relations Between Distance Measures

The divergence always gives an upper bound on the total variation distance. Indeed, first note that we can express the total variation distance as

Thus, using Lyapunov’s inequality, we conclude

(3)

## Markov Chain Convergence by Eigenvalues

Now, we prove a quantitative version of the fundamental theorem of Markov chains (for reversible processes) using spectral theory and eigenvalues:^{4}Note that we used the fundamental theorem of Markov chains in above footnotes to prove the “bounded” and “contractive” properties, so, at present, this purported proof of the fundamental theorem would be circular. Fortunately, we can establish these two claims independently of the fundamental theorem, say, by appealing to the Perron–Frobenius theorem. Thus, this argument does give a genuine non-circular proof of the fundamental theorem.

Theorem (Markov chain convergence by eigenvalues).Let denote the distributions of the Markov chain at times . Then

This raises a natural question: How large can the initial divergence between and be? This is answered by the next result.

Proposition (Initial distance to stationarity).For any ,(4)

For any initial distribution , we have(5)

The condition (4) is a direct computation

For (5), observe that is a

*convex function*of the initial distribution . Therefore, its maximal value over the convex set of all probability distribution is maximized at an extreme point for some . Consequently, (5) follows directly from (4).

Using the previous two results and equation (3), we immediately obtain the following:

Corollary (Mixing time).When initialized from a distribution , the chain mixes to -total variation distance to stationarityafter

(6)

Indeed,

Equating the left-hand side with and rearranging gives (6).

### Proof of Markov Chain Convergence Theorem

We conclude with a proof of the Markov chain convergence result.

### Recurrence for Densities

Our first task is to derive a recurrence for the densities . To do this, we use the recurrence for the probability distribution:

Thus,

We now invoke the detailed balance , which implies

The product is an ordinary matrix–vector product. Thus, we have shown

(7)

### Spectral Decomposition

Now that we have a recurrence for the densities , we can understand how the densities change in time. Expand the initial density in eigenvectors of :

As we verified above in (1), we have . Thus, we conclude . Since the are eigenvectors of with eigenvalues , the recurrence (7) implies

### Conclusion

Finally, we compute

The theorem is proven.