In the previous two posts, we proved the fundamental theorem of Markov chains using couplings and discussed the use of couplings to bound the mixing time, the time required for the chain to mix to near-stationarity.
In this post, we will continue this discussion by discussing spectral methods for understanding the convergence of Markov chains.
In this post, we will study a reversible Markov chain on with transition probability matrix and stationary distribution . Our goal will be to use the eigenvalues and eigenvectors of the matrix to understand the properties of the Markov chain.
Throughout our discussion, it will be helpful to treat vectors and functions as being one and the same, and we will use both and to denote the th entry of the vector (aka the evaluation of the function at ). By adopting this perspective, a vector is not merely a list of numbers, but instead labels each state with a numeric value.
Given a function/vector , we let denote the expected value of where is drawn from :
The variance is defined similarly:
As a final piece of notation, we let denote a probability distribution which assigns 100% probability to outcome .
The eigenvalues and eigenvectors of general square matrices are ill-behaved creatures. Indeed, a general matrix with real entries can have complex-valued eigenvalues and fail to possess a full suite of linearly independent eigenvectors. The situation is dramatically better for symmetric matrices which obey the spectral theorem:
Theorem (spectral theorem for real symmetric matrices). An real symmetric matrix (i.e., one satisfying ) has real eigenvalues and orthonormal eigenvectors .
Unfortunately for us, the transition matrix for a reversible Markov chain is not always symmetric. But despite this, there’s a surprise: always has real eigenvalues. This leads us to ask:
Why does are the eigenvalues of the transition matrix real?
To answer this question, we will need to develop and a more general version of the spectral theorem and use our standing assumption that the Markov chain is reversible.
In our quest to develop a general version of the spectral theorem, we look more deeply into the hypothesis of the theorem, namely that is equal to its transpose . Let’s first ask: What does the transpose mean?
Recall that is equipped with the standard inner product, sometimes called the Euclidean or dot product. We denote this product by and it is defined as
We can think of the standard inner product as defining the amount of the vector that points in the direction of .
The transpose of a matrix is closely related to the standard inner product. Specifically, is the (unique) matrix satisfying the adjoint property:
So the amount of in the direction of is the same as the amount of in the direction of .
The Adjoint and the General Spectral Theorem
Since the transpose is intimately related to the standard inner product on , it is natural to wonder if non-standard inner products on give rise to non-standard versions of the transpose. This idea proves to be true.
Definition (adjoint). Let be any inner product on and let be an matrix. The adjoint of with respect to the inner product is the (unique) matrix such that
A matrix is self-adjoint if it equals its own adjoint, .
The spectral theorem naturally extends to the more abstract setting of adjoints with respect to non-standard inner products:
Theorem (general spectral theorem). Let be an matrix which is set-adjoint with respect to an inner product . Then has real eigenvalues and eigenvectors which are –orthonormal:
Reversibility and Self-Adjointness
The general spectral theorem offers us a path to understand our observation from above that the eigenvalues of are always real. Namely, we ask:
Is there an inner product under which is self-adjoint?
Fortunately, the answer is yes. Define the -inner product
This inner product is very natural. If and are mean-zero (), it reports the covariance of and where is drawn from .
Let us compute the adjoint of the transition matrix in the -inner product :
Recall that we have assumed our Markov chain is reversible. Thus, the detailed balance condition
Thus, we conclude that is self-adjoint.
We cannot emphasize enough that the reversibility assumption is crucial to ensure that is self-adjoint. In fact,
Theorem (reversibility and self-adjointness). The transition matrix of a general Markov chain with stationary distribution is self-adjoint in the -inner product if and only if the chain is reversible.
Spectral Theorem for Markov Chains
Since is self-adjoint in the -inner product, the general spectral theorem implies the following
Corollary (spectral theorem for reversible Markov chain). The reversible Markov transition matrix has real eigenvalues associated with eigenvectors which are orthonormal in the -inner product:
Since is a reversible Markov transition matrix—not just any self-adjoint matrix—the eigenvalues of satisfy additional properties:
- Bounded. All of the eigenvalues of lie between and .1Here’s an argument. The vectors span all of , where denotes a vector with in position and elsewhere. Then, by the fundamental theorem of Markov chains, converges to for every . In particular, for any vector , . Thus, since for every vector , all of the eigenvalues must be in magnitude.
- Eigenvalue 1. is an eigenvalue of with eigenvector , where is a vector of all one’s.
For a primitive chain, we can also have the property:
- Contractive. For a primitive chain, all eigenvalues other than have magnitude .2This follows for every , so must be the unique left eigenvector with eigenvalue of modulus .
Distance Between Probability Distributions Redux
In the previous post, we used the total variation distance to compare probability distributions:
Definition (total variation distance). The total variation distance between probability distributions and is
The total variation distance is an ““ way of comparing two probability distributions since it can be computed by adding the absolute difference between the probabilities and of each outcome. Spectral theory plays more nicely with an “” way of comparing probability distributions, which we develop now.
Given two probability densities and , the density of with respect to is the function3Note that we typically associate probability distributions and with row vectors, whereas the density is a function which we identify with column vectors. For those interested and familiar with measure theory, here is a good reason why this makes sense. In the continuum setting, probability distributions and are measures whereas the density remains a function, known as the Radon–Nikodym derivative . This provides a general way of figuring out which objects for finite state space Markov chains are row vectors and which are column vectors: Measures are row vectors whereas functions are column vectors. given by
The density satisfies the property that
To see why we call a density, it may be helpful to appeal to continuous probability for a moment. If is a random variable with distribution , the probability density function of (with respect to the uniform distribution ) is a function such that
The formula for the continuous case is exactly the same as the finite case (1) with sums replaced with integrals .
We are now ready to introduce our “” way of comparing probability distributions.
Definition (-divergence). The -divergence of with respect to is the variance of the density function:
To see the last equality is a valid formula for , note that
Relations Between Distance Measures
The divergence always gives an upper bound on the total variation distance. Indeed, first note that we can express the total variation distance as
Thus, using Lyapunov’s inequality, we conclude
Markov Chain Convergence by Eigenvalues
Now, we prove a quantitative version of the fundamental theorem of Markov chains (for reversible processes) using spectral theory and eigenvalues:4Note that we used the fundamental theorem of Markov chains in above footnotes to prove the “bounded” and “contractive” properties, so, at present, this purported proof of the fundamental theorem would be circular. Fortunately, we can establish these two claims independently of the fundamental theorem, say, by appealing to the Perron–Frobenius theorem. Thus, this argument does give a genuine non-circular proof of the fundamental theorem.
Theorem (Markov chain convergence by eigenvalues). Let denote the distributions of the Markov chain at times . Then
This raises a natural question: How large can the initial divergence between and be? This is answered by the next result.
Proposition (Initial distance to stationarity). For any ,
(4)For any initial distribution , we have
The condition (4) is a direct computation
For (5), observe that is a convex function of the initial distribution . Therefore, its maximal value over the convex set of all probability distribution is maximized at an extreme point for some . Consequently, (5) follows directly from (4).
Using the previous two results and equation (3), we immediately obtain the following:
Corollary (Mixing time). When initialized from a distribution , the chain mixes to -total variation distance to stationarity
Equating the left-hand side with and rearranging gives (6).
Proof of Markov Chain Convergence Theorem
We conclude with a proof of the Markov chain convergence result.
Recurrence for Densities
Our first task is to derive a recurrence for the densities . To do this, we use the recurrence for the probability distribution:
We now invoke the detailed balance , which implies
The product is an ordinary matrix–vector product. Thus, we have shown
Now that we have a recurrence for the densities , we can understand how the densities change in time. Expand the initial density in eigenvectors of :
As we verified above in (1), we have . Thus, we conclude . Since the are eigenvectors of with eigenvalues , the recurrence (7) implies
Finally, we compute
The theorem is proven.