Big Ideas in Applied Math: Markov Chains

In this post, we’ll talk about Markov chains, a useful and general model of a random system evolving in time.

PageRank

To see how Markov chains can be useful in practice, we begin our discussion with the famous PageRank problem. The goal is assign a numerical ranking to each website on the internet measuring how important it is. To do this, we form a mathematical model of an internet user randomly surfing the web. The importance of each website will be measured by the amount of times this user visits each page.

The PageRank model of an internet user is as follows: Start the user at an arbitrary initial website x_0. At each step, the user makes one of two choices:

  • With 85% probability, the user follows a random link on their current website.
  • With 15% probability, the user gets bored and jumps to a random website selected from the entire internet.

As with any mathematical model, this is a riduculously oversimplified description of how a person would surf the web. However, like any good mathematical model, it is useful. Because of the way the model is designed, the user will spend more time on websites with many incoming links. Thus, websites with many incoming links will be rated as important, which seems like a sensible choice.

An example of the PageRank distribution for a small internet is shown below. As one would expect, the surfer spends a large part of their time on website B, which has many incoming links. Interestingly, the user spends almost as much of their time on website C, whose only links are to and from B. Under the PageRank model, a website is important if it is linked to by an important website, even if that is the only website linking to it.

Markov Chains in General

Having seen one Markov chain, the PageRank internet surfer, let’s talk about Markov chains in general. A (time-homogeneous) Markov chain consists of two things: a set of states and probabilities for transitioning between states:

  • Set of states. For this discussion, we limit ourselves to Markov chains which can only exist in finitely many different states. To simplify our discussion, label the possible states using numbers 1,2,\ldots,m.
  • Transition probabilities. The definining property of a (time-homogeneous) Markov chain is that, at any point in time n, if the state is i, the probability of moving to state j is a fixed number P_{ij}. In particular, the probability P_{ij} of moving from i to j does not depend on the time n or the past history of the chain before time n; only the value of the chain at time n matters.

Denote the state of the Markov chain at times 0,1,2,\ldots by x_0,x_1,x_2,\ldots. Note that the states x_0,x_1,x_2,\ldots are random quantities. We can write the Markov chain property using the language of conditional probability:

    \[\mathbb{P} \{ x_{n+1} = j \mid x_n = i,x_{n-1}=a_{n-1},\ldots,x_0=a_0\} = \mathbb{P}\{x_{n+1} = j \mid x_n = i\} = P_{ij}.\]

This equation states that the probability that the system is in state j at time n+1 given the entire history of the system depends only on the value x_n = i of the chain at time n. This probability is the transition probability P_{ij}.

Let’s see how the PageRank internet surfer fits into this model:

  • Set of states. Here, the set of states are the websites, which we label 1,\ldots,m.
  • Transition probabilities. Consider two websites i and j. If i does not have a link to j, then the only way of going from i to j is if the surfer randomly gets bored (probability 15%) and picks website j to visit at random (probability 1/m). Thus,

    (i\not\to j)   \[P_{ij} = \frac{0.15}{m}. \]

    Suppose instead that i does link to j and i has d_i outgoing links. Then, in addition to the 0.15/m probability computed before, user i has an 85% percent chance of following a link and a 1/d_i chance of picking j as that link. Thus,

    (i\to j)   \[P_{ij} = \frac{0.85}{d_i} + \frac{0.15}{m}. \]

Markov Chains and Linear Algebra

For a non-random process y_0,y_1,y_2,\ldots, we can understand the processes evolution by determining its state y_n at every point in time n. Since Markov chains are random processes, it is not enough to track the state x_n of the process at every time n. Rather, we must understand the probability distribution of the state x_n at every point in time n.

It is customary in Markov chain theory to represent a probability distribution on the states \{1,\ldots,n\} by a row vector \rho^\top.1To really emphasize that probability distributions are row vectors, we shall write them as transposes of column vectors. So \rho is a column vector but \rho^\top represents the probability distribution as is a row vector. The ith entry \rho_i stores the probability that the system is in state i. Naturally, as \rho^\top is a probability distribution, its entries must be nonnegative (\rho_i \ge 0 for every i) and add to one (\sum_{i=1}^n \rho_i = 1).

Let (\rho^{(0)})^\top, (\rho^{(1)})^\top,\ldots denote the probability distributions of the states x_0,x_1,\ldots. It is natural to ask: How are the distributions (\rho^{(0)})^\top, (\rho^{(1)})^\top,\ldots related to each other? Let’s answer this question.

The probability that x_{n+1} is in state j is the jth entry of \rho^{(n+1)}:

    \[\rho^{(n+1)}_j = \mathbb{P} \{x_{n+1} = j\}\]

To compute this probability, we break into cases based on the value of the process at time n: either x_n = 1 or x_n = 2 or … or x_n = m; only one of these cases can be true at once. When we have an “or” of random events and these events are mutually exclusive (only can be true at once), then the probabilities add:

    \[\rho^{(n+1)}_j = \mathbb{P} \{x_{n+1} = j\} = \sum_{i=1}^m \mathbb{P} \{x_{n+1} = j, x_n = i\}.\]

Now we use some conditional probability. The probability that x_{n+1} = j and x_n = i is the probability that x_n = i times the probability that x_{n+1} = j conditional on x_n = i. That is,

    \[\rho^{(n+1)}_j = \sum_{i=1}^m \mathbb{P} \{x_{n+1} = j, x_n = i\} = \sum_{i=1}^m \mathbb{P} \{x_n = i\} \mathbb{P}\{x_{n+1} = j \mid x_n = i\}.\]

Now, we can simplify using our definitions. The probability that x_n = i is just \rho^{(n)}_i and the probability of moving from i to j is P_{ij}. Thus, we conclude

    \[\rho_j^{(n+1)} = \sum_{i=1}^m \rho^{(n)}_i P_{ij} .\]

Phrased in the language of linear algebra, we’ve shown

    \[\left(\rho^{(n+1)}\right)^\top = \left(\rho^{(n)}\right)^\top P \quad \text{for any } n = 0,1,2,\ldots.\]

That is, if we view the transition probabilities P_{ij} as comprising an m\times m matrix P, then the distribution at time n+1 is obtained by multiplying the distribution at time n by transition matrix P. In particular, if we iterate this result, we obtain that the distribution at time n is given by

    \[\left(\rho^{(n)}\right)^\top = \left(\rho^{(n-1)}\right)^\top P = \left[\left(\rho^{(n-2)}\right)^\top P\right]P = \left(\rho^{(n-2)}\right)^\top P^2 = \cdots = \left(\rho^{(0)}\right)^\top P^n.\]

Thus, the distribution at time n is the distribution at time 0 multiplied by the nth power of the transition matrix P.

Convergence to Stationarity

Let’s go back to our web surfer again. At time 0, we started our surfer at a particular website, say 1. As such, the probability distribution2To keep notation clean going forward, we will drop the transposes off of probability distributions, except when working with them linear algebraically. \rho^{(0)} at time 0 is concentrated just on website 1, with no other website having any probability at all. In the first few steps, most of the probability will remain in the vacinity of website 1, in the websites linked to by 1 and the websites linked to by the websites linked to by 1 and so on. However, as we run the chain long enough, the surfer will have time to widely across the web and the probability distribution will become less and less influenced by the chain’s starting location. This motivates the following definition:

Definition. A Markov chain satisfies the mixing property if the probability distributions \rho^{(0)}, \rho^{(1)}, \ldots converge to a single fixed probability distribution \pi regardless of how the chain is initialized (i.e., independent of the starting distribution \rho^{(0)}).

The distribution \pi for a mixing Markov chain is known as a stationary distribution because it does not change under the action of P:

(St)   \[\pi^\top = \pi^\top P. \]

To see this, recall the recurrence

    \[\left(\rho^{(n+1)}\right)^\top = \left(\rho^{(n)}\right)^\top P,\]

take the limit as n\to\infty, and observe that both (\rho^{(n+1)})^\top and (\rho^{(n)})^\top converge to \pi^\top.

One of the basic questions in the theory of Markov chains is finding conditions under which the mixing property (or suitable weaker versions of it) hold. To answer this question, we will need the following definition:

A Markov chain is primitive if, after running the chain for some number n steps, the chain has positive probability of moving between any two states. That is,

    \[\text{There exists $n$ such that, for any $i,j = 1,2,\ldots,m$, } \quad\mathbb{P}\{x_n = j \mid x_0 = i \} = (P^n)_{ij} > 0.\]

The fundamental theorem of Markov chains is that primitive chains satisfy the mixing property.

Theorem (fundamental theorem of Markov chains). Every primitive Markov chain is mixing. In particular, there exists one and only probability distribution \pi satisfying the stationary property (St) and the probability distributions \rho^{(0)},\rho^{(1)},\ldots converge to \pi when initialized in any probability distribution \rho^{(0)}. Every entry of \pi is strictly positive.

Let’s see an example of the fundamental theorem with the PageRank surfer. After n=1 step, there is at least a 0.15/m > 0 chance of moving from any website i to any other website j. Thus, the chain is primitive. Consequently, there is a unique stationary distribution \pi, and the surfer will converge to this stationary distribution regardless of which website they start at.

Going Backwards in Time

Often, it is helpful to consider what would happen if we ran a Markov chain backwards in time. To see why this is an interesting idea, suppose you run website w and you’re interested in where your traffic is coming from. One way of achieving this would be to initialize the Markov chain at w and run the chain backwards in time. Rather than asking, “given I’m at w now, where would a user go next?”, you ask “given that I’m at w now, where do I expect to have come from?”

Let’s formalize this notion a little bit. Consider a primitive Markov chain x_0,x_1,x_2,\ldots with stationary distribution \pi. We assume that we initialize this Markov chain in the stationary distribution. That is, we pick \rho^{(0)} = \pi as our initial distribution for x_0. The time-reversed Markov chain y_0,y_1,\ldots is defined as follows: The probability P^{\rm rev}_{ij} of moving from i to j in the time-reversed Markov chain is the probability that I was at state j one step previously given that I’m at state i now:

    \[\mathbb{P} \{y_1 = j \mid y_0 = i\} = P^{\rm rev}_{ij} = \mathbb{P} \{ x_0 = j \mid x_1 = i \}.\]

To get a nice closed form expression for the reversed transition probabilities P^{\rm rev}_{ij}, we can invoke Bayes’ theorem:

(Rev)   \[P^{\rm rev}_{ij} = \mathbb{P} \{ x_0 = j \mid x_1 = i \} = \frac{\mathbb{P} \{x_0 = j\} \mathbb{P} \{x_1 = i \mid x_0 = j\}}{\mathbb{P} \{x_1 = i\}} = \frac{ \pi_j P_{ji}}{\pi_i}. \]

The time-reversed Markov chain can be a strange beast. For the reversed PageRank surfer, for instance, follows links “upstream” traveling from the linked site to the linking site. As such, our hypothetical website owner could get a good sense of where their traffic is coming from by initializing the reversed chain y_0 = w at their website and following the chain one step back.

Reversible Markov Chains

We now have two different Markov chains: the original and its time-reversal. Call a Markov chain reversible if these processes are the same. That is, if the transition probabilities are the same:

    \[P^{\rm rev}_{ij} = P_{ij} \quad \text{for every } i,j=1,2,\ldots,m.\]

Using our formula (Rev) for the reversed transition probability, the reversibility condition can be written more concisely as

    \[\pi_i P_{ij} = \pi_j P_{ji}.\]

This condition is referred to as detailed balance.3There is an abstruse—but useful—way of reformulating the detailed balance condition. Think of a vector f \in \mathbb{R}^m as defining a function on the set \{1,\ldots,m\}, f : i \mapsto f(i) \coloneqq f_i. Letting x denote a random variable drawn from the stationary distribution x \sim \pi, we can define a non-standard inner product on \mathbb{R}^m: \langle f, g\rangle_{\pi} \coloneqq \mathbb{E}[f(x) g(x)] = \sum_{i=1}^m \pi_i f(i)g(i). Then the Markov chain is reversible if and only if detailed balance holds if and only if P is a self-adjoint operator on \mathbb{R}^m when equipped with the non-standard inner product \langle \cdot,\cdot\rangle_\pi. This more abstract characterization has useful consequences. For instance, by the spectral theorem, the transition matrix P of a reversible Markov chain has real eigenvalues and supports a basis of orthonormal eigenvectors (in the \langle \cdot,\cdot\rangle_\pi inner product). In words, it states that a Markov chain is reversible if, when initialized in the stationary distribution \pi, the flow of probability mass from i to j (that is, \pi_i P_{ij}) is equal to the flow of probability mass from j to i (that is, \pi_jP_{ji}).

Many interesting Markov chains are reversible. One class of examples are Markov chain models of physical and chemical processes. Since physical laws like classical and quantum mechanics are reversible under time, so too should we expect Markov chain models built from theories to be reversible.

Not every interesting Markov chain is reversible, however. Indeed, except in special cases, the PageRank Markov chain is not reversible. If i links to j but. j does not link to i, then the flow of mass from i to j will be higher than the flow from j to i.

Before moving on, we note one useful fact about reversible Markov chains. Suppose a reversible, primitive Markov chain satisfies the detailed balance condition with a probability distribution \sigma:

    \[\sigma_i P_{ij} = \sigma_j P_{ji}.\]

Then \sigma = \pi is the stationary distribution of this chain. To see why, we check the stationarity condition \sigma^\top P = \sigma^\top. Indeed, for every j,

    \[(\sigma^\top P)_j = \sum_{i=1}^m \sigma_i P_{ij} = \sum_{i=1}^m \sigma_j P_{ji} = \sigma_j.\]

The second equality is detailed balance and the third equality is just the condition that the sum of the transition probabilities from j to each i is one. Thus, \sigma^\top P = \sigma^\top and \sigma is a stationary distribution for P. But a primitive chain has only one stationary distribution \pi, so \sigma = \pi.

Markov Chains as Algorithms

Markov chains are an amazingly flexible tool. One use of Markov chains is more scientific: Given a system in the real world, we can model it by a Markov chain. By simulating the chain or by studying its mathematical properties, we can hope to learn about the system we’ve modeled.

Another use of Markov chains is algorithmic. Rather than thinking of the Markov chain as modeling some real-world process, we instead design the Markov chain to serve a computationally useful end. The PageRank surfer is one example. We wanted to rank the importance of websites, so we designed a Markov chain to achieve this task.

One task we can use Markov chains to solve are sampling problems. Suppose we have a complicated probability distribution \pi, and we want a random sample from \pi—that is, a random quantity y such that \mathbb{P} \{ y = i \} = \pi_i for every i. One way to achieve this goal is to design a primitive Markov chain with stationary distribution \pi. Then, we run the chain for a large number of steps n and use x_n as an approximate sample from \pi.

To design a Markov chain with stationary distribution \pi, it is sufficient to generate transition probabilities P such that \pi and P satisfy the detailed balance condition. Then, we are guaranteed that \pi is a stationary distribution for the chain. (We also should check the primitiveness condition, but this is often straightforward.)

Here is an effective way of building a Markov chain to sample from a distribution \pi. Suppose that the chain is in state i at time n, x_n = i. To choose the next state, we begin by sampling j from a proposal distribution T. The proposal distribution T can be almost anything we like, as long as it satisfies three conditions:

  • Probability distribution. For every i, the transition probabilitie T_{ij} add to one: \sum_{j=1}^m T_{ij} = 1.
  • Bidirectional. If T_{ij} > 0, then T_{ji} > 0.
  • Primitive. The transition probabilities T form a primitive Markov chain.

In order to sample from the correct distribution, we can’t just accept every proposal. Rather, given the proposal i\to j, we accept with probability

    \[\min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\}.\]

If we accept the proposal, the next state of our chain is x_{n+1} = j. Otherwise, we stay where we are x_{n+1} = i. This Markov chain is known as a Metropolis–Hastings sampler.

For clarity, we list the steps of the Metropolis–Hastings sampler explicitly:

  1. Initialize the chain in any state x_0 and set n := 0.
  2. Draw a proposal x' with from the proposal distribution, \mathbb{P} \{ x' = j \} = T_{x_nj}.
  3. Compute the acceptance probability

        \[p_{\rm acc} := \min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\}.\]

  4. With probability p_{\rm acc}, set x_{n+1} := x'. Otherwise, set x_{n+1} := x_n.
  5. Set n := n+1 and go back to step 2.

To check that \pi is a stationary distribution of the Metropolis–Hastings distribution, all we need to do is check detailed balance. Note that the probability P_{ij} of transitioning from i to j\ne i under the Metropolis–Hastings sampler is the proposal probability T_{ij} times the acceptance probability:

    \[P_{ij} = T_{ij} \cdot \min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\}.\]

Detailed balance is confirmed by a short computation4Note that the detailed balance condition for i = j is always satisfied for any Markov chain \pi_i P_{ii} = \pi_i P_{ii}.

    \[\pi_i P_{ij} = \pi_i T_{ij} \cdot \min \left\{ 1 , \frac{\pi_j T_{ji}}{\pi_i T_{ij}} \right\} = \min \left\{ \pi_i T_{ij} , \pi_j T_{ji} \right\} = \pi_j P_{ji}.\]

Thus the Metropolis–Hastings sampler has \pi as stationary distribution.

Determinatal Point Processes: Diverse Items from a Collection

The uses of Markov chains in science, engineering, math, computer science, and machine learning are vast. I wanted to wrap up with one application that I find particularly neat.

Suppose I run a bakery and I sell N different baked goods. I want to pick out k special items for a display window to lure customers into my store. As a first approach, I might pick my top-k selling items for the window. But I realize that there’s a problem. All of my top sellers are muffins, so all of the items in my display window are muffins. My display window is doing a good job luring in muffin-lovers, but a bad job of enticing lovers of other baked goods. In addition to rating the popularity of each item, I should also promote diversity in the items I select for my shop window.

Here’s a creative solution to my display case problems using linear algebra. Suppose that, rather than just looking at a list of the sales of each item, I define a matrix A for my baked goods. In the iith entry A_{ii} of my matrix, I write the number of sales for baked good i. I populate the off-diagonal entries A_{ij} of my matrix with a measure of similarity between items i and j.5There are many ways of defining such a similarity matrix. Here is one way. Let z_1,\ldots,z_N be the number ordered for each bakery item by a random customer. Set \Sigma to be the correlation matrix of the random variables z_1,\ldots,z_N, with \Sigma_{ij} being the correlation between the random variables z_i and z_j. The matrix \Sigma has all ones on its diagonal. The off-diagonal entries \Sigma_{ij} measure the amount that items i and j tend to be purchased together. Let D be a diagonal matrix where D_{ii} is the total sales of item i. Set A \coloneqq D^{1/2}\Sigma D^{1/2}. By scaling \Sigma by the diagonal matrix D, the diagonal entries of A represent the popularity of each item, whereass the off-diagonal entries still represent correlations, now scaled by popularity. So if i and j are both muffins, A_{ij} will be large. But if i is a muffin and j is a cookie, then A_{ij} will be small. For mathematical reasons, we require A to be symmetric and positive definite.

To populate my display case, I choose a random subset of k items from my full menu of size N according to the following strange probability distribution: The probability \pi_S of picking items S = \{s_1,\ldots,s_k\} \subseteq \{1,\ldots,N\} is proportional to the determinant of the submatrix A(S,S). More specifically,

(k-DPP)   \[\pi_S = \frac{\det A(S,S)}{\sum_{\text{all subsets $T$ of size $k$}} \det A(T,T)}. \]

Here, we let A(S,S) denote the k\times k submatrix of A consisting of the entries appearing in rows and columns s_1,\ldots,s_k. Such a random subset is known as a k-determinantal point process (k-DPP). (See this survey for more about DPPs.)

To see why this makes any sense, let’s consider a simple example of N = 3 items and a display case of size k = 2. Suppose I have three items: a pumpkin muffin, a chocolate chip muffin, and an oatmeal raisin cookies. Say the A matrix looks like

    \[A = \begin{bmatrix} 10 & 9 & 0 \\ 9 & 10 & 0 \\ 0 & 0 & 5 \end{bmatrix}.\]

We see that both muffins are equally popular A_{11} = A_{22} = 10 and much more popular than the cookie A_{33} = 5. However, the two muffins are similar to each other and thus the corresponding submatrix has small determinant

    \[\det A(\{1,2\},\{1,2\}) = \det \twobytwo{10}{9}{9}{10} = 19.\]

By contrast, if the cookie is disimilar to each muffin and the determinant is higher

    \[\det A(\{1,3\},\{1,3\}) = \det A(\{2,3\},\{2,3\}) = \det \twobytwo{10}{0}{0}{5} = 50.\]

Thus, even though the muffins are more popular overall, choosing our display case from a 2-DPP, we have a (50+50) / (50+50+19) \approx 84\% chance of choosing a muffin and a cookie for our display case. It is for this reason that we can say that a k-DPP preferentially selects for diverse items.

Is sampling from a k-DPP the best way of picking k items for my display case? How does it compare to other possible methods?6Another method I’m partial to for this task is randomly pivoted Cholesky sampling, which is computationally cheaper than k-DPP sampling even with the Markov chain sampling approach to k-DPP sampling that we will discuss shortly. These are interesting questions for another time. For now, let us focus our attention on a different question: How would you sample from a k-DPP?

Determinantal Point Process by Markov Chains

Sampling from a k-DPP is a hard computational problem. Indeed, there are {N \choose k} possible k-element subspaces of a set of N items. The number of possibilities gets large fast. If I have N = 100 items and want to pick k = 10 of them, there are already over 10 trillion possible combinations.

Markov chains offer one compelling way of sampling a k-DPP. First, we need a proposal distribution. Let’s choose the simplest one we can think of:

Proposal for k-DPP sampling. Suppose our current set of k items is S = \{s_1,\ldots,s_k\}. To generate a proposal, choose a uniformly random element s_{\rm old} out of S and a uniformly random element s_{\rm new} out of \{1,\ldots,N\} without S. Propose S' obtained from S by replacing s_{\rm old} with s_{\rm new} (i.e., S' = S \cup \{s_{\rm new}\} \setminus \{s_{\rm old}\}).

Now, we need to compute the Metropolis–Hastings acceptance probability

    \[p_{\rm acc} = \min \left\{ 1 , \frac{\pi_{S'} T_{S'S}}{\pi_{S} T_{SS'}} \right\}.\]

For S and S' which differ only by the addition of one element and the removal of another, the proposal probabilities T_{S'S} and T_{SS'} are both equal to 1/(kN), T_{S'S} = T_{SS'} = 1/(kN). Using the formula for the probability \pi_S of drawing S from a k-DPP, we compute that

    \[\frac{\pi_{S'}}{\pi_S} = \frac{\det A(S',S')}{\det A(S,S)}.\]

Thus, the Metropolis–Hastings acceptance probability is just a ratio of determinants:

(Acc)   \[p_{\rm acc} = \min \left\{ 1 , \frac{\pi_{S'} T_{S'S}}{\pi_{S} T_{SS'}} \right\} = \min \left\{ 1, \frac{\det A(S',S')}{\det A(S,S)} \right\}. \]

And we’re done. Let’s summarize our sampling algorithm:

  1. Choose an initial set S_0 arbitrarily and set n := 0.
  2. Draw s_{\rm old} uniformly at random from S_n.
  3. Draw s_{\rm new} uniformly at random from \{1,\ldots,N\} \setminus S_n.
  4. Set S' := S_n \cup \{s_{\rm new}\} \setminus \{s_{\rm old}\}.
  5. With probability p_{\rm acc} defined in (Acc), accept and set S_{n+1} := S'. Otherwise, set S_{n+1} := S_n.
  6. Set n := n+1 and go to step 2.

This is a remarkably simple algorithm to sample from a complicated distribution. And its fairly efficient as well. Analysis by Anari, Oveis Gharan, and Rezaei shows that, when you pick a good enough initial set S_0, this sampling algorithm produces approximate samples from a k-DPP in roughly Nk^2 steps.7They actually use a slight variant of this algorithm where the acceptance probabilities (Acc) are reduced by a factor of two. Observe that this still has the correct stationary distribution because detailed balance continues to hold. The extra factor is introduced to ensure that the Markov chain is primitive. Remarkably, if k is much smaller than N, this Markov chain-based algorithm samples from a k-DPP without even looking at all N^2 entries of the matrix A!

Upshot. Markov chains are a simple and general model for a state evolving randomly in time. Under mild conditions, Markov chains converge to a stationary distribution: In the limit of a large number of steps, the state of the system become randomly distributed in a way independent of how it was initialized. We can use Markov chains as algorithms to approximately sample from challenging distributions.

3 thoughts on “Big Ideas in Applied Math: Markov Chains

  1. Great post. I especially enjoyed the fact that you used very different (and more interesting) practical examples than are used in most Markov chain teaching materials.

    One thing that confused me a bit at first was the proof for detailed balance of the Metropolis Hastings update. When I realized that the equation is still for j≠i and the case for i=j follows from probabilities summing up to 1, it all made sense.

Leave a Reply

Your email address will not be published. Required fields are marked *