Markov Musings 2: Couplings

This post is part of a series, Markov Musings, about the mathematical analysis of Markov chains. See here for the first post in the series.


In the last post, we saw a proof of the fundamental theorem of Markov chains using the method of couplings. In this post, we will use couplings to provide quantitative bounds on the rate of Markov chain convergence.

Mixing Time

Let us begin where the last post ended by recalling some facts about the mixing time of a Markov chain. Consider a Markov chain x_0,x_1,x_2,\ldots with stationary distribution \pi. Recall that the mixing time is

    \[\tau_{\rm mix} \coloneqq \min \left\{ n \ge 1 : \max_{\rho^{(0)}} \norm{\rho^{(n)} - \pi}_{\rm TV} \le \frac{1}{2e} \right\}.\]


Here, \rho^{(n)} denotes the distribution of the state x_n of the chain at time n and \norm{\cdot}_{\rm TV} is the total variation distance. For more on the total variation distance, see the previous post.

At the end of last post, we saw the following theorem, showing that the mixing time controls the rate of Markov chain convergence.

Theorem (mixing time as convergence rate). For any starting distribution \rho^{(0)},

    \[\norm{\rho^{(n)} - \pi}_{\rm TV} \le e^{-\lfloor n / \tau_{\rm mix}\rfloor}.\]

Consequently, if n\ge \tau_{\rm mix}\cdot \lceil \log(1/\epsilon)\rceil, then \norm{\rho^{(n)} - \pi}_{\rm TV} \le \varepsilon.

This result motivates us to develop techniques to bound the mixing time \tau_{\rm mix}.

Couplings and Convergence

In the previous post, we made essential use of couplings of probability distributions. In this post, we will extend this notion by using couplings of entire Markov chains.

Definition (coupling of Markov chains). A coupling of Markov chains with transition matrix P are two processes x_0,x_1,x_2,\ldots and y_0,y_1,y_2,\ldots such that (A) each process is individually a Markov chain with transition matrix P: In particular,

    \[\prob\{x_{n+1} = j \mid x_n = i \} = \prob\{y_{n+1} = j \mid y_n = i \} = P_{ij}\]

and (B) if x_n = y_n, then x_{n+1} = y_{n+1}.

A coupling of Markov chains thus consists of a pair of Markov chains which become glued together should they ever touch.

There are several ways we can use couplings to bound the mixing time of Markov chains. Here is one way. Let x_0,x_1,x_2,\ldots and y_0,y_1,y_2,\ldots be a coupling of Markov chains for which x_0 is initialized in some initial distribution x_0 \sim \rho^{(0)} and y_0 is initialized in the stationary distribution y_0 \sim \pi. Let \tau_{\rho^{(0)}} be the time at which x and y meet:

    \[\tau_{\rho^{(0)}} \coloneqq \min\{ n \ge 0 : x_n = y_n\}.\]

The following theorem shows that the tails of the random variable \tau control the convergence of the Markov chain x_0,x_1,x_2,\ldots to stationarity.

Theorem (convergence from couplings). Then \norm{\rho^{(n)} - \pi}_{\rm TV} \le \prob\{\tau_{\rho^{(0)}} > n\}.

This result is an immediate application of the coupling lemma from last post. Using this bound, we can bound the mixing time

Corollary (mixing time from couplings). \tau_{\rm mix} \le 2e \max_{\rho^{(0)}} \expect[\tau_{\rho^{(0)}}].

The maximum is taken over all initial distributions \rho^{(0)}.

Indeed, by the above theorem and Markov’s inequality,1For more on concentration inequalities such as Markov’s inequality, see my post on the subject.

    \[\norm{\rho^{(n)} - \pi}_{\rm TV} \le \prob\{\tau_{\rho^{(0)}} > n\}\le \frac{\expect[\tau_{\rho^{(0)}}]}{n}.\]

Take a maximum over initial distribution \rho^{(0)} and set n \coloneqq \tau_{\rm mix}. Then the left-hand side is at most 1/2e, so

    \[\frac{1}{2e} \le \max_{\rho^{(0)}}\norm{\rho^{(\tau_{\rm mix})} - \pi}_{\rm TV} = \prob\{\tau_{\rho^{(0)}} > n\} \le \frac{\max_{\rho^{(0)}}\expect[\tau_{\rho^{(0)}}]}{\tau_{\rm mix}}.\]

Rearranging gives the corollary.

Example: Bit Strings

There are a number of examples2See, for instance, section 5.3 of Markov Chains and Mixing Times. to prove bounds on mixing times using the method of couplings, but we will focus on a simple but illustrative example: a lazy random walk on the set of bit strings.

A bit string is a sequence of 0’s and 1’s. Bit strings are used to store and represent information on digital computers. For instance, the character “a” is stored as “1100001” using the ASCII encoding.

The Chain

Our example is a Markov chain whose states consist of all bit strings of length N. For each step of the chain,

  • With 50% probability, we do nothing and leave the state unchanged.
  • With 50% probability, we choose a (uniform) random position from 1 to N and flip the bit in that position, changing 0 to 1 or 1 to 0.

One can think of this Markov chain as modeling a random process in which a bit string is corrupted by errors which flip a single bit. The stationary distribution for this chain is the uniform distribution on all bit strings (i.e., each bit string of length n has the same probability 2^{-n}). Therefore, one can also think of this Markov chain as a very inefficient way of generating a string of random bits.3Another way of thinking about this Markov chain is as a random walk on the vertices of an n-dimensional hypercube graph. With 50% probability, we stay put, and with 50% probability, we move along a random edge. Because we have a 50% probability of doing nothing, we call this process a lazy random walk.

Designing the Coupling

Let us use the method of couplings to determine how fast this chain mixes. First, observe that there’s an alternative way of stating the transition rule for this Markov chain:

  • Pick a (uniform) random position from 1 to N and set the bit in that position to 0 with 50% probability and 1 with 50% probability.

Indeed, under this rule, half of the time we leave the state unchanged because we set the bit in the selected position to the value it already had. For the other half of the time, we flip the selected bit.

Now, we construct a coupling. Initialize x_0 in an arbitrary distribution \rho^{(0)} and y_0 in the stationary distribution (uniform over all bit strings). The key to construct a coupling will be to use the same randomness to update the state of both the “x” chain and the “y” chain. For this chain, there’s a natural way to do this.

At time n, pick a (uniform) random position 1\le p_n \le N and a (uniform) random bit b_n \in \{0,1\}. Set x_{n+1} by changing the p_nth bit of x_n to b_n, and set y_{n+1} by changing the p_nth bit of y_n to b_n.

We couple the two chains by setting the same position p_n to the same bit b_n.

Bounding the Mixing Time

To bound the mixing time, we need to understand when these two chains meet. Observe that if we pick position p to set at some point, the two Markov chains have the same bit in position p at every time later in the chain. Consequently,

If all of the positions 1,\ldots,N have been chosen by time n, then x_n = y_n.

The two chains might meet before all of the positions have been chosen, but they are guaranteed to meet after.

Set \tau_{\rm pick} to be the time at which all positions 1,\ldots,N have been selected at least once. Then our observation is equivalent to saying that the meeting time \tau_{\rho^{(0)}} is smaller than \tau_{\rm pick},

    \[\tau_{\rho^{(0)}} \le \tau_{\rm pick}.\]

Thus, to bound the mixing time, it is sufficient to compute the expected value of \tau_{\rm pick}.

Collecting Coupons

Computing the expected value of \tau_{\rm pick} is actually an example of a very famous problem in probability theory known as the coupon collector problem. The classic version goes like this:

A cereal company is running a promotion where each box of cereal contains a coupon, which is equally likely to be any of one of N different colors. A coupon of each color can be traded in for a toy. What is the expected number of boxes we need to open in order to qualify for the toy?

The coupon collector is exactly the same as our problem, with the N different colors of coupons playing the same role as the N different positions in our bit strings. Going forward, let’s use the language of coupons and boxes to describe this problem, as its more colorful.

When tackling a hard question like computing \expect[\tau_{\rm pick}], it can be helpful to break into easier pieces. Let’s start with the simplest possible question: How many picks do we need to collect the first coupon? Just one. We always get a coupon in our first box.

With that question answered, let’s ask the next-simplest question: How many picks do we need to collect the second coupon, differently colored than the first? There are N colors and N-1 of them are different from the first. So the probability of picking a new coupon in the second box is (N-1)/N. In fact,

Until we succeed at picking the second unique coupon, every box has an independent (N-1)/N chance of containing such a coupon.

The number of boxes needed is therefore a geometric random variable with success probability p = (N-1)/N, and its mean is 1/p = N/(N-1). On average, we need N/(N-1) picks to collect the second coupon.

The reasoning is the same for the third coupon. There are N-2 coupons we haven’t collected and N total, so the probability of picking the third coupon in each box is (N-2)/N. Thus, the number of picks is a geometric random variable with success probability (N-2)/N with mean N/(N-2).

In general, we need an expected N / (N-k) picks to pick the kth coupon. Adding up the expected number of picks for each k = 1,2,\ldots,N, we obtain that the total number of picks required is to collect all the coupons is

    \[\expect[\tau_{\rm pick}] = 1 + \frac{N}{N-1} + \frac{N}{N-2} + \cdots + \frac{N}{N-(N-1)} = N \left( \frac{1}{N} + \frac{1}{N-1} + \cdots + \frac{1}{2} + 1 \right).\]

More concisely, if we let H_N denote the Nth harmonic number

    \[H_N = 1 + \frac{1}{2} + \cdots + \frac{1}{N}\]

then \expect[\tau_{\rm pick}] = NH_N. Since the Nth harmonic number is at most H_N \le \log(N) + 1, we have

    \[\expect[\tau_{\rm pick}] \le N \log (N) + N.\]

Conclusion

Putting all of these ingredients together, we obtain

    \[\tau_{\rm mix} \le 2e \, \expect[\tau_{\rho^{(0)}}] \le 2e \, \expect[\tau_{\rm pick}] \le 2e \, N \log(N) + 2e \, N.\]

The mixing time is (at most) proportional to N\log N.

More on Bit Strings and Collecting Coupons
Using more sophisticated techniques, it can be shown that the random variable \tau_{\rm pick} satisfies

    \[\prob\{\tau_{\rm pick} > N \log N + cN\} \le e^{-c}\]

for every N\ge 1 and c\ge 0, from which it follows that

    \[\norm{\rho^{(n)} - \pi}_{\rm TV} \le \varepsilon \quad \text{provided that} \quad n \ge N \log N + N \log(1/\varepsilon).\]

Using a more sophisticated analysis—also based on couplings—we can improve this result to

    \[\norm{\rho^{(n)} - \pi}_{\rm TV} \le \varepsilon \quad \text{provided that} \quad n \ge \frac{1}{2} N \log N + c \,N \log(1/\varepsilon)\]

for some constant c > 0. The constant 1/2 in this inequality cannot be improved. See section 5.3.1 of Markov Chains and Mixing Times.

Leave a Reply

Your email address will not be published. Required fields are marked *