Big Ideas in Applied Math: Markov Chains

In this post, we’ll talk about Markov chains, a useful and general model of a random system evolving in time.

PageRank

To see how Markov chains can be useful in practice, we begin our discussion with the famous PageRank problem. The goal is assign a numerical ranking to each website on the internet measuring how important it is. To do this, we form a mathematical model of an internet user randomly surfing the web. The importance of each website will be measured by the amount of times this user visits each page.

The PageRank model of an internet user is as follows: Start the user at an arbitrary initial website $x_0$ . At each step, the user makes one of two choices:

With 85% probability, the user follows a random link on their current website.
With 15% probability, the user gets bored and jumps to a random website selected from the entire internet.

As with any mathematical model, this is a riduculously oversimplified description of how a person would surf the web. However, like any good mathematical model, it is useful. Because of the way the model is designed, the user will spend more time on websites with many incoming links. Thus, websites with many incoming links will be rated as important, which seems like a sensible choice.

An example of the PageRank distribution for a small internet is shown below. As one would expect, the surfer spends a large part of their time on website B, which has many incoming links. Interestingly, the user spends almost as much of their time on website C, whose only links are to and from B. Under the PageRank model, a website is important if it is linked to by an important website, even if that is the only website linking to it.

Markov Chains in General

Having seen one Markov chain, the PageRank internet surfer, let’s talk about Markov chains in general. A (time-homogeneous) Markov chain consists of two things: a set of states and probabilities for transitioning between states:

Set of states. For this discussion, we limit ourselves to Markov chains which can only exist in finitely many different states. To simplify our discussion, label the possible states using numbers $1,2,\ldots,m$ .
Transition probabilities. The definining property of a (time-homogeneous) Markov chain is that, at any point in time $n$ , if the state is $i$ , the probability of moving to state $j$ is a fixed number $P_{ij}$ . In particular, the probability $P_{ij}$ of moving from $i$ to $j$ does not depend on the time $n$ or the past history of the chain before time $n$ ; only the value of the chain at time $n$ matters.

Denote the state of the Markov chain at times $0,1,2,\ldots$ by $x_0,x_1,x_2,\ldots$ . Note that the states $x_0,x_1,x_2,\ldots$ are random quantities. We can write the Markov chain property using the language of conditional probability:

$\mathbb{P} \{ x_{n+1} = j \mid x_n = i,x_{n-1}=a_{n-1},\ldots,x_0=a_0\} = \mathbb{P}\{x_{n+1} = j \mid x_n = i\} = P_{ij}.$

This equation states that the probability that the system is in state $j$ at time $n+1$ given the entire history of the system depends only on the value $x_n = i$ of the chain at time $n$ . This probability is the transition probability $P_{ij}$ .

Let’s see how the PageRank internet surfer fits into this model:

Set of states. Here, the set of states are the websites, which we label $1,\ldots,m$ .
Transition probabilities. Consider two websites $i$ and $j$ . If $i$ does not have a link to $j$ , then the only way of going from $i$ to $j$ is if the surfer randomly gets bored (probability 15%) and picks website $j$ to visit at random (probability $1/m$ ). Thus,
( $i\not\to j$ ) $P_{ij} = \frac{0.15}{m}.$
Suppose instead that $i$ does link to $j$ and $i$ has $d_i$ outgoing links. Then, in addition to the $0.15/m$ probability computed before, user $i$ has an 85% percent chance of following a link and a $1/d_i$ chance of picking $j$ as that link. Thus,
( $i\to j$ ) $P_{ij} = \frac{0.85}{d_i} + \frac{0.15}{m}.$

Markov Chains and Linear Algebra

For a non-random process $y_0,y_1,y_2,\ldots$ , we can understand the processes evolution by determining its state $y_n$ at every point in time $n$ . Since Markov chains are random processes, it is not enough to track the state $x_n$ of the process at every time $n$ . Rather, we must understand the probability distribution of the state $x_n$ at every point in time $n$ .

It is customary in Markov chain theory to represent a probability distribution on the states $\{1,\ldots,n\}$ by a row vector $\rho^\top$ .¹ The $i$ th entry $\rho_i$ stores the probability that the system is in state $i$ . Naturally, as $\rho^\top$ is a probability distribution, its entries must be nonnegative ( $\rho_i \ge 0$ for every $i$ ) and add to one ( $\sum_{i=1}^n \rho_i = 1$ ).

Let $(\rho^{(0)})^\top, (\rho^{(1)})^\top,\ldots$ denote the probability distributions of the states $x_0,x_1,\ldots$ . It is natural to ask: How are the distributions $(\rho^{(0)})^\top, (\rho^{(1)})^\top,\ldots$ related to each other? Let’s answer this question.

The probability that $x_{n+1}$ is in state $j$ is the $j$ th entry of $\rho^{(n+1)}$ :

$\rho^{(n+1)}_j = \mathbb{P} \{x_{n+1} = j\}$

To compute this probability, we break into cases based on the value of the process at time $n$ : either $x_n = 1$ or $x_n = 2$ or … or $x_n = m$ ; only one of these cases can be true at once. When we have an “or” of random events and these events are mutually exclusive (only can be true at once), then the probabilities add:

$\rho^{(n+1)}_j = \mathbb{P} \{x_{n+1} = j\} = \sum_{i=1}^m \mathbb{P} \{x_{n+1} = j, x_n = i\}.$

Now we use some conditional probability. The probability that $x_{n+1} = j$ and $x_n = i$ is the probability that $x_n = i$ times the probability that $x_{n+1} = j$ conditional on $x_n = i$ . That is,

$\rho^{(n+1)}_j = \sum_{i=1}^m \mathbb{P} \{x_{n+1} = j, x_n = i\} = \sum_{i=1}^m \mathbb{P} \{x_n = i\} \mathbb{P}\{x_{n+1} = j \mid x_n = i\}.$

Now, we can simplify using our definitions. The probability that $x_n = i$ is just $\rho^{(n)}_i$ and the probability of moving from $i$ to $j$ is $P_{ij}$ . Thus, we conclude

$\rho_j^{(n+1)} = \sum_{i=1}^m \rho^{(n)}_i P_{ij} .$

Phrased in the language of linear algebra, we’ve shown

$\left(\rho^{(n+1)}\right)^\top = \left(\rho^{(n)}\right)^\top P \quad \text{for any } n = 0,1,2,\ldots.$

That is, if we view the transition probabilities $P_{ij}$ as comprising an $m\times m$ matrix $P$ , then the distribution at time $n+1$ is obtained by multiplying the distribution at time $n$ by transition matrix $P$ . In particular, if we iterate this result, we obtain that the distribution at time $n$ is given by

$\left(\rho^{(n)}\right)^\top = \left(\rho^{(n-1)}\right)^\top P = \left[\left(\rho^{(n-2)}\right)^\top P\right]P = \left(\rho^{(n-2)}\right)^\top P^2 = \cdots = \left(\rho^{(0)}\right)^\top P^n.$

Thus, the distribution at time $n$ is the distribution at time $0$ multiplied by the $n$ th power of the transition matrix $P$ .

Convergence to Stationarity

Let’s go back to our web surfer again. At time $0$ , we started our surfer at a particular website, say $1$ . As such, the probability distribution² $\rho^{(0)}$ at time $0$ is concentrated just on website $1$ , with no other website having any probability at all. In the first few steps, most of the probability will remain in the vacinity of website $1$ , in the websites linked to by $1$ and the websites linked to by the websites linked to by $1$ and so on. However, as we run the chain long enough, the surfer will have time to widely across the web and the probability distribution will become less and less influenced by the chain’s starting location. This motivates the following definition:

Definition. A Markov chain satisfies the mixing property if the probability distributions $\rho^{(0)}, \rho^{(1)}, \ldots$ converge to a single fixed probability distribution $\pi$ regardless of how the chain is initialized (i.e., independent of the starting distribution $\rho^{(0)}$ ).

The distribution $\pi$ for a mixing Markov chain is known as a stationary distribution because it does not change under the action of $P$ :

(St) $\pi^\top = \pi^\top P.$

To see this, recall the recurrence

$\left(\rho^{(n+1)}\right)^\top = \left(\rho^{(n)}\right)^\top P,$

take the limit as $n\to\infty$ , and observe that both $(\rho^{(n+1)})^\top$ and $(\rho^{(n)})^\top$ converge to $\pi^\top$ .

One of the basic questions in the theory of Markov chains is finding conditions under which the mixing property (or suitable weaker versions of it) hold. To answer this question, we will need the following definition:

A Markov chain is primitive if, after running the chain for some number $n$ steps, the chain has positive probability of moving between any two states. That is,
$\text{There exists $n$ such that, for any $i,j = 1,2,\ldots,m$, } \quad\mathbb{P}\{x_n = j \mid x_0 = i \} = (P^n)_{ij} > 0.$

The fundamental theorem of Markov chains is that primitive chains satisfy the mixing property.

Theorem (fundamental theorem of Markov chains). Every primitive Markov chain is mixing. In particular, there exists one and only probability distribution $\pi$ satisfying the stationary property (St) and the probability distributions $\rho^{(0)},\rho^{(1)},\ldots$ converge to $\pi$ when initialized in any probability distribution $\rho^{(0)}$ . Every entry of $\pi$ is strictly positive.

Let’s see an example of the fundamental theorem with the PageRank surfer. After $n=1$ step, there is at least a $0.15/m > 0$ chance of moving from any website $i$ to any other website $j$ . Thus, the chain is primitive. Consequently, there is a unique stationary distribution $\pi$ , and the surfer will converge to this stationary distribution regardless of which website they start at.

Going Backwards in Time

Often, it is helpful to consider what would happen if we ran a Markov chain backwards in time. To see why this is an interesting idea, suppose you run website $w$ and you’re interested in where your traffic is coming from. One way of achieving this would be to initialize the Markov chain at $w$ and run the chain backwards in time. Rather than asking, “given I’m at $w$ now, where would a user go next?”, you ask “given that I’m at $w$ now, where do I expect to have come from?”

Let’s formalize this notion a little bit. Consider a primitive Markov chain $x_0,x_1,x_2,\ldots$ with stationary distribution $\pi$ . We assume that we initialize this Markov chain in the stationary distribution. That is, we pick $\rho^{(0)} = \pi$ as our initial distribution for $x_0$ . The time-reversed Markov chain $y_0,y_1,\ldots$ is defined as follows: The probability $P^{\rm rev}_{ij}$ of moving from $i$ to $j$ in the time-reversed Markov chain is the probability that I was at state $j$ one step previously given that I’m at state $i$ now:

$\mathbb{P} \{y_1 = j \mid y_0 = i\} = P^{\rm rev}_{ij} = \mathbb{P} \{ x_0 = j \mid x_1 = i \}.$

To get a nice closed form expression for the reversed transition probabilities $P^{\rm rev}_{ij}$ , we can invoke Bayes’ theorem:

(Rev) $P^{\rm rev}_{ij} = \mathbb{P} \{ x_0 = j \mid x_1 = i \} = \frac{\mathbb{P} \{x_0 = j\} \mathbb{P} \{x_1 = i \mid x_0 = j\}}{\mathbb{P} \{x_1 = i\}} = \frac{ \pi_j P_{ji}}{\pi_i}.$

The time-reversed Markov chain can be a strange beast. For the reversed PageRank surfer, for instance, follows links “upstream” traveling from the linked site to the linking site. As such, our hypothetical website owner could get a good sense of where their traffic is coming from by initializing the reversed chain $y_0 = w$ at their website and following the chain one step back.

Reversible Markov Chains

We now have two different Markov chains: the original and its time-reversal. Call a Markov chain reversible if these processes are the same. That is, if the transition probabilities are the same:

$P^{\rm rev}_{ij} = P_{ij} \quad \text{for every } i,j=1,2,\ldots,m.$

Using our formula (Rev) for the reversed transition probability, the reversibility condition can be written more concisely as

$\pi_i P_{ij} = \pi_j P_{ji}.$

This condition is referred to as detailed balance.³ In words, it states that a Markov chain is reversible if, when initialized in the stationary distribution $\pi$ , the flow of probability mass from $i$ to $j$ (that is, $\pi_i P_{ij}$ ) is equal to the flow of probability mass from $j$ to $i$ (that is, $\pi_jP_{ji}$ ).

Many interesting Markov chains are reversible. One class of examples are Markov chain models of physical and chemical processes. Since physical laws like classical and quantum mechanics are reversible under time, so too should we expect Markov chain models built from theories to be reversible.

Not every interesting Markov chain is reversible, however. Indeed, except in special cases, the PageRank Markov chain is not reversible. If $i$ links to $j$ but. $j$ does not link to $i$ , then the flow of mass from $i$ to $j$ will be higher than the flow from $j$ to $i$ .

Before moving on, we note one useful fact about reversible Markov chains. Suppose a reversible, primitive Markov chain satisfies the detailed balance condition with a probability distribution $\sigma$ :

$\sigma_i P_{ij} = \sigma_j P_{ji}.$

Then $\sigma = \pi$ is the stationary distribution of this chain. To see why, we check the stationarity condition $\sigma^\top P = \sigma^\top$ . Indeed, for every $j$ ,

$(\sigma^\top P)_j = \sum_{i=1}^m \sigma_i P_{ij} = \sum_{i=1}^m \sigma_j P_{ji} = \sigma_j.$

The second equality is detailed balance and the third equality is just the condition that the sum of the transition probabilities from $j$ to each $i$ is one. Thus, $\sigma^\top P = \sigma^\top$ and $\sigma$ is a stationary distribution for $P$ . But a primitive chain has only one stationary distribution $\pi$ , so $\sigma = \pi$ .

Markov Chains as Algorithms

Markov chains are an amazingly flexible tool. One use of Markov chains is more scientific: Given a system in the real world, we can model it by a Markov chain. By simulating the chain or by studying its mathematical properties, we can hope to learn about the system we’ve modeled.

Another use of Markov chains is algorithmic. Rather than thinking of the Markov chain as modeling some real-world process, we instead design the Markov chain to serve a computationally useful end. The PageRank surfer is one example. We wanted to rank the importance of websites, so we designed a Markov chain to achieve this task.

One task we can use Markov chains to solve are sampling problems. Suppose we have a complicated probability distribution $\pi$ , and we want a random sample from $\pi$ —that is, a random quantity $y$ such that $\mathbb{P} \{ y = i \} = \pi_i$ for every $i$ . One way to achieve this goal is to design a primitive Markov chain with stationary distribution $\pi$ . Then, we run the chain for a large number of steps $n$ and use $x_n$ as an approximate sample from $\pi$ .

To design a Markov chain with stationary distribution $\pi$ , it is sufficient to generate transition probabilities $P$ such that $\pi$ and $P$ satisfy the detailed balance condition. Then, we are guaranteed that $\pi$ is a stationary distribution for the chain. (We also should check the primitiveness condition, but this is often straightforward.)

Here is an effective way of building a Markov chain to sample from a distribution $\pi$ . Suppose that the chain is in state $i$ at time $n$ , $x_n = i$ . To choose the next state, we begin by sampling $j$ from a proposal distribution $T$ . The proposal distribution $T$ can be almost anything we like, as long as it satisfies three conditions:

Probability distribution. For every $i$ , the transition probabilitie $T_{ij}$ add to one: $\sum_{j=1}^m T_{ij} = 1$ .
Bidirectional. If $T_{ij} > 0$ , then $T_{ji} > 0$ .
Primitive. The transition probabilities $T$ form a primitive Markov chain.

In order to sample from the correct distribution, we can’t just accept every proposal. Rather, given the proposal $i\to j$ , we accept with probability

$\min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\}.$

If we accept the proposal, the next state of our chain is $x_{n+1} = j$ . Otherwise, we stay where we are $x_{n+1} = i$ . This Markov chain is known as a Metropolis–Hastings sampler.

For clarity, we list the steps of the Metropolis–Hastings sampler explicitly:

Initialize the chain in any state $x_0$ and set $n := 0$ .
Draw a proposal $x'$ with from the proposal distribution, $\mathbb{P} \{ x' = j \} = T_{x_nj}$ .
Compute the acceptance probability
$p_{\rm acc} := \min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\}.$
With probability $p_{\rm acc}$ , set $x_{n+1} := x'$ . Otherwise, set $x_{n+1} := x_n$ .
Set $n := n+1$ and go back to step 2.

To check that $\pi$ is a stationary distribution of the Metropolis–Hastings distribution, all we need to do is check detailed balance. Note that the probability $P_{ij}$ of transitioning from $i$ to $j\ne i$ under the Metropolis–Hastings sampler is the proposal probability $T_{ij}$ times the acceptance probability:

$P_{ij} = T_{ij} \cdot \min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\}.$

Detailed balance is confirmed by a short computation⁴

$\pi_i P_{ij} = \pi_i T_{ij} \cdot \min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\} = \min \left\{ \pi_i T_{ij} , \pi_j T_{ji} \right\} = \pi_j P_{ji}.$

Thus the Metropolis–Hastings sampler has $\pi$ as stationary distribution.

Determinatal Point Processes: Diverse Items from a Collection

The uses of Markov chains in science, engineering, math, computer science, and machine learning are vast. I wanted to wrap up with one application that I find particularly neat.

Suppose I run a bakery and I sell $N$ different baked goods. I want to pick out $k$ special items for a display window to lure customers into my store. As a first approach, I might pick my top- $k$ selling items for the window. But I realize that there’s a problem. All of my top sellers are muffins, so all of the items in my display window are muffins. My display window is doing a good job luring in muffin-lovers, but a bad job of enticing lovers of other baked goods. In addition to rating the popularity of each item, I should also promote diversity in the items I select for my shop window.

Here’s a creative solution to my display case problems using linear algebra. Suppose that, rather than just looking at a list of the sales of each item, I define a matrix $A$ for my baked goods. In the $ii$ th entry $A_{ii}$ of my matrix, I write the number of sales for baked good $i$ . I populate the off-diagonal entries $A_{ij}$ of my matrix with a measure of similarity between items $i$ and $j$ .⁵ So if $i$ and $j$ are both muffins, $A_{ij}$ will be large. But if $i$ is a muffin and $j$ is a cookie, then $A_{ij}$ will be small. For mathematical reasons, we require $A$ to be symmetric and positive definite.

To populate my display case, I choose a random subset of $k$ items from my full menu of size $N$ according to the following strange probability distribution: The probability $\pi_S$ of picking items $S = \{s_1,\ldots,s_k\} \subseteq \{1,\ldots,N\}$ is proportional to the determinant of the submatrix $A(S,S)$ . More specifically,

( $k$ -DPP) $\pi_S = \frac{\det A(S,S)}{\sum_{\text{all subsets $T$ of size $k$}} \det A(T,T)}.$

Here, we let $A(S,S)$ denote the $k\times k$ submatrix of $A$ consisting of the entries appearing in rows and columns $s_1,\ldots,s_k$ . Such a random subset is known as a $k$ -determinantal point process ( $k$ -DPP). (See this survey for more about DPPs.)

To see why this makes any sense, let’s consider a simple example of $N = 3$ items and a display case of size $k = 2$ . Suppose I have three items: a pumpkin muffin, a chocolate chip muffin, and an oatmeal raisin cookies. Say the $A$ matrix looks like

$A = \begin{bmatrix} 10 & 9 & 0 \\ 9 & 10 & 0 \\ 0 & 0 & 5 \end{bmatrix}.$

We see that both muffins are equally popular $A_{11} = A_{22} = 10$ and much more popular than the cookie $A_{33} = 5$ . However, the two muffins are similar to each other and thus the corresponding submatrix has small determinant

$\det A(\{1,2\},\{1,2\}) = \det \twobytwo{10}{9}{9}{10} = 19.$

By contrast, if the cookie is disimilar to each muffin and the determinant is higher

$\det A(\{1,3\},\{1,3\}) = \det A(\{2,3\},\{2,3\}) = \det \twobytwo{10}{0}{0}{5} = 50.$

Thus, even though the muffins are more popular overall, choosing our display case from a $2$ -DPP, we have a $(50+50) / (50+50+19) \approx 84\%$ chance of choosing a muffin and a cookie for our display case. It is for this reason that we can say that a $k$ -DPP preferentially selects for diverse items.

Is sampling from a $k$ -DPP the best way of picking $k$ items for my display case? How does it compare to other possible methods?⁶ These are interesting questions for another time. For now, let us focus our attention on a different question: How would you sample from a $k$ -DPP?

Determinantal Point Process by Markov Chains

Sampling from a $k$ -DPP is a hard computational problem. Indeed, there are ${N \choose k}$ possible $k$ -element subspaces of a set of $N$ items. The number of possibilities gets large fast. If I have $N = 100$ items and want to pick $k = 10$ of them, there are already over 10 trillion possible combinations.

Markov chains offer one compelling way of sampling a $k$ -DPP. First, we need a proposal distribution. Let’s choose the simplest one we can think of:

Proposal for $k$ -DPP sampling. Suppose our current set of $k$ items is $S = \{s_1,\ldots,s_k\}$ . To generate a proposal, choose a uniformly random element $s_{\rm old}$ out of $S$ and a uniformly random element $s_{\rm new}$ out of $\{1,\ldots,N\}$ without $S$ . Propose $S'$ obtained from $S$ by replacing $s_{\rm old}$ with $s_{\rm new}$ (i.e., $S' = S \cup \{s_{\rm new}\} \setminus \{s_{\rm old}\}$ ).

Now, we need to compute the Metropolis–Hastings acceptance probability

$p_{\rm acc} = \min \left\{ 1 , \frac{\pi_{S'} T_{S'S}}{\pi_{S} T_{SS'}} \right\}.$

For $S$ and $S'$ which differ only by the addition of one element and the removal of another, the proposal probabilities $T_{S'S}$ and $T_{SS'}$ are both equal to $1/(kN)$ , $T_{S'S} = T_{SS'} = 1/(kN)$ . Using the formula for the probability $\pi_S$ of drawing $S$ from a $k$ -DPP, we compute that

$\frac{\pi_{S'}}{\pi_S} = \frac{\det A(S',S')}{\det A(S,S)}.$

Thus, the Metropolis–Hastings acceptance probability is just a ratio of determinants:

(Acc) $p_{\rm acc} = \min \left\{ 1 , \frac{\pi_{S'} T_{S'S}}{\pi_{S} T_{SS'}} \right\} = \min \left\{ 1, \frac{\det A(S',S')}{\det A(S,S)} \right\}.$

And we’re done. Let’s summarize our sampling algorithm:

Choose an initial set $S_0$ arbitrarily and set $n := 0$ .
Draw $s_{\rm old}$ uniformly at random from $S_n$ .
Draw $s_{\rm new}$ uniformly at random from $\{1,\ldots,N\} \setminus S_n$ .
Set $S' := S_n \cup \{s_{\rm new}\} \setminus \{s_{\rm old}\}$ .
With probability $p_{\rm acc}$ defined in (Acc), accept and set $S_{n+1} := S'$ . Otherwise, set $S_{n+1} := S_n$ .
Set $n := n+1$ and go to step 2.

This is a remarkably simple algorithm to sample from a complicated distribution. And its fairly efficient as well. Analysis by Anari, Oveis Gharan, and Rezaei shows that, when you pick a good enough initial set $S_0$ , this sampling algorithm produces approximate samples from a $k$ -DPP in roughly $Nk^2$ steps.⁷ Remarkably, if $k$ is much smaller than $N$ , this Markov chain-based algorithm samples from a $k$ -DPP without even looking at all $N^2$ entries of the matrix $A$ !

Upshot. Markov chains are a simple and general model for a state evolving randomly in time. Under mild conditions, Markov chains converge to a stationary distribution: In the limit of a large number of steps, the state of the system become randomly distributed in a way independent of how it was initialized. We can use Markov chains as algorithms to approximately sample from challenging distributions.

3 thoughts on “Big Ideas in Applied Math: Markov Chains”

Pingback: Markov Musings 1: The Fundamental Theorem – Ethan Epperly
Michael Scharrer says:

February 29, 2024 at 9:21 am

Great post. I especially enjoyed the fact that you used very different (and more interesting) practical examples than are used in most Markov chain teaching materials.

One thing that confused me a bit at first was the proof for detailed balance of the Metropolis Hastings update. When I realized that the equation is still for j≠i and the case for i=j follows from probabilities summing up to 1, it all made sense.

1. Ethan N. Epperly says:
  
  March 3, 2024 at 2:24 am
  
  Thanks for the feedback! I’ve edited the post to hopefully be more clear.