Markov Musings 2: Couplings – Ethan N. Epperly

This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.

In the last post, we saw a proof of the fundamental theorem of Markov chains using the method of couplings. In this post, we will use couplings to provide quantitative bounds on the rate of Markov chain convergence.

Mixing Time

Let us begin where the last post ended by recalling some facts about the mixing time of a Markov chain. Consider a Markov chain $x_0,x_1,x_2,\ldots$ with stationary distribution $\pi$ . Recall that the mixing time is

$\tau_{\rm mix} \coloneqq \min \left\{ n \ge 1 : \max_{\rho^{(0)}} \norm{\rho^{(n)} - \pi}_{\rm TV} \le \frac{1}{2e} \right\}.$

Here, $\rho^{(n)}$ denotes the distribution of the state $x_n$ of the chain at time $n$ and $\norm{\cdot}_{\rm TV}$ is the total variation distance. For more on the total variation distance, see the previous post.

At the end of last post, we saw the following theorem, showing that the mixing time controls the rate of Markov chain convergence.

Theorem (mixing time as convergence rate). For any starting distribution $\rho^{(0)}$ ,
$\norm{\rho^{(n)} - \pi}_{\rm TV} \le e^{-\lfloor n / \tau_{\rm mix}\rfloor}.$
Consequently, if $n\ge \tau_{\rm mix}\cdot \lceil \log(1/\epsilon)\rceil$ , then $\norm{\rho^{(n)} - \pi}_{\rm TV} \le \varepsilon$ .

This result motivates us to develop techniques to bound the mixing time $\tau_{\rm mix}$ .

Couplings and Convergence

In the previous post, we made essential use of couplings of probability distributions. In this post, we will extend this notion by using couplings of entire Markov chains.

Definition (coupling of Markov chains). A coupling of Markov chains with transition matrix $P$ are two processes $x_0,x_1,x_2,\ldots$ and $y_0,y_1,y_2,\ldots$ such that (A) each process is individually a Markov chain with transition matrix $P$ : In particular,
$\prob\{x_{n+1} = j \mid x_n = i \} = \prob\{y_{n+1} = j \mid y_n = i \} = P_{ij}$
and (B) if $x_n = y_n$ , then $x_{n+1} = y_{n+1}$ .

A coupling of Markov chains thus consists of a pair of Markov chains which become glued together should they ever touch.

There are several ways we can use couplings to bound the mixing time of Markov chains. Here is one way. Let $x_0,x_1,x_2,\ldots$ and $y_0,y_1,y_2,\ldots$ be a coupling of Markov chains for which $x_0$ is initialized in some initial distribution $x_0 \sim \rho^{(0)}$ and $y_0$ is initialized in the stationary distribution $y_0 \sim \pi$ . Let $\tau_{\rho^{(0)}}$ be the time at which $x$ and $y$ meet:

$\tau_{\rho^{(0)}} \coloneqq \min\{ n \ge 0 : x_n = y_n\}.$

The following theorem shows that the tails of the random variable $\tau$ control the convergence of the Markov chain $x_0,x_1,x_2,\ldots$ to stationarity.

Theorem (convergence from couplings). Then $\norm{\rho^{(n)} - \pi}_{\rm TV} \le \prob\{\tau_{\rho^{(0)}} > n\}$ .

This result is an immediate application of the coupling lemma from last post. Using this bound, we can bound the mixing time

Corollary (mixing time from couplings). $\tau_{\rm mix} \le 2e \max_{\rho^{(0)}} \expect[\tau_{\rho^{(0)}}]$ .

The maximum is taken over all initial distributions $\rho^{(0)}$ .

Indeed, by the above theorem and Markov’s inequality,¹

$\norm{\rho^{(n)} - \pi}_{\rm TV} \le \prob\{\tau_{\rho^{(0)}} > n\}\le \frac{\expect[\tau_{\rho^{(0)}}]}{n}.$

Take a maximum over initial distribution $\rho^{(0)}$ and set $n \coloneqq \tau_{\rm mix}$ . Then the left-hand side is at most $1/2e$ , so

$\frac{1}{2e} \le \max_{\rho^{(0)}}\norm{\rho^{(\tau_{\rm mix})} - \pi}_{\rm TV} = \prob\{\tau_{\rho^{(0)}} > n\} \le \frac{\max_{\rho^{(0)}}\expect[\tau_{\rho^{(0)}}]}{\tau_{\rm mix}}.$

Rearranging gives the corollary.

Example: Bit Strings

There are a number of examples² to prove bounds on mixing times using the method of couplings, but we will focus on a simple but illustrative example: a lazy random walk on the set of bit strings.

A bit string is a sequence of 0’s and 1’s. Bit strings are used to store and represent information on digital computers. For instance, the character “a” is stored as “1100001” using the ASCII encoding.

The Chain

Our example is a Markov chain whose states consist of all bit strings of length $N$ . For each step of the chain,

With 50% probability, we do nothing and leave the state unchanged.
With 50% probability, we choose a (uniform) random position from $1$ to $N$ and flip the bit in that position, changing 0 to 1 or 1 to 0.

One can think of this Markov chain as modeling a random process in which a bit string is corrupted by errors which flip a single bit. The stationary distribution for this chain is the uniform distribution on all bit strings (i.e., each bit string of length $n$ has the same probability $2^{-n}$ ). Therefore, one can also think of this Markov chain as a very inefficient way of generating a string of random bits.³

Designing the Coupling

Let us use the method of couplings to determine how fast this chain mixes. First, observe that there’s an alternative way of stating the transition rule for this Markov chain:

Pick a (uniform) random position from $1$ to $N$ and set the bit in that position to $0$ with 50% probability and $1$ with 50% probability.

Indeed, under this rule, half of the time we leave the state unchanged because we set the bit in the selected position to the value it already had. For the other half of the time, we flip the selected bit.

Now, we construct a coupling. Initialize $x_0$ in an arbitrary distribution $\rho^{(0)}$ and $y_0$ in the stationary distribution (uniform over all bit strings). The key to construct a coupling will be to use the same randomness to update the state of both the “ $x$ ” chain and the “ $y$ ” chain. For this chain, there’s a natural way to do this.

At time $n$ , pick a (uniform) random position $1\le p_n \le N$ and a (uniform) random bit $b_n \in \{0,1\}$ . Set $x_{n+1}$ by changing the $p_n$ th bit of $x_n$ to $b_n$ , and set $y_{n+1}$ by changing the $p_n$ th bit of $y_n$ to $b_n$ .

We couple the two chains by setting the same position $p_n$ to the same bit $b_n$ .

Bounding the Mixing Time

To bound the mixing time, we need to understand when these two chains meet. Observe that if we pick position $p$ to set at some point, the two Markov chains have the same bit in position $p$ at every time later in the chain. Consequently,

If all of the positions $1,\ldots,N$ have been chosen by time $n$ , then $x_n = y_n$ .

The two chains might meet before all of the positions have been chosen, but they are guaranteed to meet after.

Set $\tau_{\rm pick}$ to be the time at which all positions $1,\ldots,N$ have been selected at least once. Then our observation is equivalent to saying that the meeting time $\tau_{\rho^{(0)}}$ is smaller than $\tau_{\rm pick}$ ,

$\tau_{\rho^{(0)}} \le \tau_{\rm pick}.$

Thus, to bound the mixing time, it is sufficient to compute the expected value of $\tau_{\rm pick}$ .

Collecting Coupons

Computing the expected value of $\tau_{\rm pick}$ is actually an example of a very famous problem in probability theory known as the coupon collector problem. The classic version goes like this:

A cereal company is running a promotion where each box of cereal contains a coupon, which is equally likely to be any of one of $N$ different colors. A coupon of each color can be traded in for a toy. What is the expected number of boxes we need to open in order to qualify for the toy?

The coupon collector is exactly the same as our problem, with the $N$ different colors of coupons playing the same role as the $N$ different positions in our bit strings. Going forward, let’s use the language of coupons and boxes to describe this problem, as its more colorful.

When tackling a hard question like computing $\expect[\tau_{\rm pick}]$ , it can be helpful to break into easier pieces. Let’s start with the simplest possible question: How many picks do we need to collect the first coupon? Just one. We always get a coupon in our first box.

With that question answered, let’s ask the next-simplest question: How many picks do we need to collect the second coupon, differently colored than the first? There are $N$ colors and $N-1$ of them are different from the first. So the probability of picking a new coupon in the second box is $(N-1)/N$ . In fact,

Until we succeed at picking the second unique coupon, every box has an independent $(N-1)/N$ chance of containing such a coupon.

The number of boxes needed is therefore a geometric random variable with success probability $p = (N-1)/N$ , and its mean is $1/p = N/(N-1)$ . On average, we need $N/(N-1)$ picks to collect the second coupon.

The reasoning is the same for the third coupon. There are $N-2$ coupons we haven’t collected and $N$ total, so the probability of picking the third coupon in each box is $(N-2)/N$ . Thus, the number of picks is a geometric random variable with success probability $(N-2)/N$ with mean $N/(N-2)$ .

In general, we need an expected $N / (N-k)$ picks to pick the $k$ th coupon. Adding up the expected number of picks for each $k = 1,2,\ldots,N$ , we obtain that the total number of picks required is to collect all the coupons is

$\expect[\tau_{\rm pick}] = 1 + \frac{N}{N-1} + \frac{N}{N-2} + \cdots + \frac{N}{N-(N-1)} = N \left( \frac{1}{N} + \frac{1}{N-1} + \cdots + \frac{1}{2} + 1 \right).$