Here, denotes the distribution of the chain at time , denotes the stationary distribution, denotes the divergence, and denote the decreasingly ordered eigenvalues of the Markov transition matrix . To bound the rate of convergence to stationarity, we therefore must upper bound and lower bound .
Having to prove separate bounds for two eigenvalues is inconvenient. In the next post, we will develop tools to bound . But what should we do about Fortunately, there is a trick.
It Pays to be Lazy
Call a Markov chain lazy if, at every step, the chain has at least a 50% chance of staying put. That is, for every .
Any Markov chain can be made into a lazy chain. At every step of the chain, flip a coin. If the coin is heads, perform a step of the Markov chain as normal, drawing the next step randomly according to the transition matrix . If the coin comes up tails, do nothing and stay put.
Algebraically, the lazy chain described in the previous paragraph corresponds to replacing the original transition matrix with the lazy transition matrix , where denotes the identity matrix. It is easy to see that the stationary distribution for is also a stationary distribution for :
The eigenvalues of the lazy transition matrix are
In particular, since all of the original eigenvalues are , all of the eigenvalues of the lazy chain are . Thus, the smallest eigenvalue of is always and thus, for the lazy chain, the convergence is controlled only by :
Since it can be easier to establish bounds on the second eigenvalue than on both the second and last, it can be convenient—for theoretical purposes especially—to work with lazy chains, using the construction above to enforce laziness if necessary.
Should You be Lazy?
We’ve seen one theoretical argument for why it pays to use lazy Markov chains. But should we use lazy Markov chains in practice? The answer ultimately depends on how we want to use the Markov chain.
Here are two very different ways we could use a Markov chain:
- One-shot sampling. Run the Markov chain once for a fixed number of steps , and use the result as an approximate sample from .
For this use case, it is important that the output is genuinely a sample from , and the possibility of large negative eigenvalues can significantly degrade the convergence. In the extreme case where is an eigenvalue, the chain will fail to converge to stationarity at all. For this application, the lazy approach may make sense.
- Averaging. Another way we can use a Markov chain is to compute the average of value of a function over a random variable drawn from the stationary distribution:
As we shall see, this averaging process converges at an (asymptotically) slower rate for the lazy chain than the original chain. Therefore, for this application, we should typically just run the chain as-is, and not worry about making it lazy.
The Case Against Laziness
To see why being lazy hurts us for the averaging problem, we will use the Markov chain central limit theorem. As with many random averaging processes, we expect that in the limit of a large number of steps , the Markov chain average
will become approximately normally distributed. This is the content of the Markov chain central limit theorem (CLT):
Informal theorem (Markov chain central limit theorem). Let denote the states of a Markov chain initialized in the stationary distribution . For a large number of steps, is approximately normally distributed with mean and variance where
The Markov chain CLT shows that the variance of the Markov chain average depends not only on the variance of in the stationary distribution, but also the covariance between and later values . The faster the covariance decreases, the smaller will be and thus the smaller the error for the Markov chain average.
More formally, the Markov chain CLT is given by the following result:
Theorem (Markov chain central limit theorem). As ,
converges in distribution to a normal random variable with mean zero and variance , as defined in (1).
To compare the lazy and non-lazy chains, let’s compute the variance parameter in the Markov chain CLT in terms of the eigenvalues of the chain. For the remainder of this post, let be any function.
Step 1: Spectral Decomposition
As in the previous post, we think of vectors such as as defining a function . Thus, we can expand the function using the eigenvectors
(2)The first eigenvector is the vector of all ones (or, equivalently, the function that outputs for every output).
Step 2: Applying the Transition Matrix to a Function
The transition matrix is defined so that if the Markov chain has probability distribution at time , then the chain has distribution at time . In other words, multiplying by on the right advances a distribution forward one step in time. This leads us to ask, what is the interpretation of multiplying a function by on the left. That is, is there an interpretation to the matrix–vector product ?
Indeed, there is such an interpretation: The th coordinate of is the expectation of conditional on the chain starting at :
(3)There’s a straightforward proof of this fact. Let denote a probability distribution which places 100% of the probability mass on the single site . The th entry of is
We know that is the state of the Markov chain after steps when initialized in the initial distribution . Thus,
This proves the formula (3).
Step 3: Computing the Covariance
With the help of the formula for and the spectral decomposition of , we are ready to compute the covariances appearing in (1).Let . Then
Let’s first compute and . Since is the vector/function of all ones, we have
For the last equality, we use the spectral decomposition (2) along with the orthonormality of the eigenvectors .
Since we assume the chain is initialized in the stationary distribution , it remains in stationarity at time , , so we have .
Now, let’s compute . Use the law of total expectation to write
Now, invoke (3) and the definition of the -inner product to write
Finally, using the spectral decomposition, we obtain
Combining this formula with our earlier observation that and plugging into (4), we obtain
(5)For the second equality, we use that the largest eigenvalue is , so entirely drops out of the covariance.
Step 4: Wrapping it Up
With the formula (5) for the covariance in hand, we are ready to evaluate the variance parameter in the Markov chain CLT. First, note that the variance is
Therefore, combining this formula for the variance and the formula (5) for the covariance, we obtain
Now, apply the formula for the sum of an infinite geometric series to obtain
Conclusion: Laziness is (Asymptotically) Bad for Averaging
Before we get to the conclusions, let’s summarize where we are. We are seeking to use Markov chain average
as an approximation to the stationary mean . The Markov chain central limit theorem shows that, for a large number of steps , the error is approximately normally distributioned with mean zero and variance . So to obtain accurate estimates, we want to be as small as possible.
After some work, we were able to derive a formula (6) for in terms of the eigenvalues of the transition matrix . The formula for for the lazy chain is identical, except with each eigenvalue replaced by . Thus, we have
From these formulas, it is clear that the lazy Markov chain has a larger parameter and is thus less accurate than the non-lazy Markov chain, no matter what the eigenvalues are.
To compare the non-lazy and lazy chains in another way, consider the plot below. The blue solid line shows the amplification factor of an eigenvalue , which represents amount by which the squared coefficient is scaled by in . In the red-dashed line, we see the corresponding amplification factor for the corresponding eigenvalue of the lazy chain. We see that at every value, the lazy chain has a higher amplification factor than the original chain.
Remember that our original motivation for using the lazy chain was to remove the possibility of slow convergence of the chain to stationarity because of negative eigenvalues. But for the averaging application, negative eigenvalues are great. The process of Markov chain averaging shrinks the influence of negative eigenvalues on , whereas positive eigenvalues are amplified. For the averaging application, negative eigenvalues for the chain are a feature, not a bug.
Moral of the Story
Much of Markov chain theory, particularly in theoretical parts of computer science, mathematics, and machine learning, is centered around proving convergence to stationarity. Negative eigenvalues are a genuine obstruction to convergence to stationarity, and using the lazy chain in practice may be a sensible idea if one truly needs a sample from the stationary distribution in a given application.
But one-shot sampling is not the only or even the most common uses for Markov chains in computational practice. For other applications, such as averaging, the negative eigenvalues are actually a help. Using the lazy chain in practice for these problems would be a poor idea.
To me, the broader lesson in this story is that, as applied mathematicians, it can be inadvisable to become too fixated on one particular mathematical benchmark for designing and analyzing algorithms. Proving rapid convergence to stationarity with respect to the total variation distance is one nice way to analyze Markov chains. But it isn’t the only way, and chains not possessing this property because of large negative eigenvalues may actually be better in practice for some problems. Ultimately, applied mathematics should, at least in part, be guided by applications, and paying attention to how algorithms are used in practice should inform how we build and evaluate new ones.