This post (or rather a collection of notes) is an attempt to go through different concepts around MCMC methods from the ground up. I will be trying to gather and structure my learnings around this topic in a way that is clear, intelligible, and beginner friendly.
Sampling from complex, high-dimensional probability distributions is a fundamental challenge in statistics, machine learning, and many applied fields. The core idea is that Markov Chain Monte Carlo (MCMC) methods overcome this challenge by constructing a Markov chain whose stationary distribution is the target distribution.
In practice, MCMC is used for Bayesian inference, uncertainty quantification, and solving inverse problems in areas such as audio and image processing. This blogpost proposes an accessible introduction to MCMC methods from its theoretical foundations to some Python implementations of these algorithms for real world use cases.
A Markov chain is a sequence of random variables ${X_t}_{t \geq 0}$ that satisfies the Markov property:
\[P(X_{t+1} = y \mid X_t = x, X_{t-1} = x_{t-1}, \dots) = P(X_{t+1}=y \mid X_t=x).\]The chain’s dynamics are defined by a transition probability $P(x,y)$ (or kernel) that is time-independent.
A probability distribution $\pi(x)$ is said to be stationary (or invariant) if, for all states $y$,
\[\pi(y) = \sum_{x} \pi(x) P(x, y)\]This means that if : $X_0 \sim \pi, \text{ then } X_t \sim \pi \text{ for every } t.$
A sufficient condition for stationarity is the detailed balance condition:
\[\pi(x) P(x, y) = \pi(y) P(y, x), \quad \forall x, y\]Proof Sketch:
Thus, detailed balance guarantees that $\pi$ is invariant under the chain dynamics.
For the empirical averages from the chain to converge to expectations under $\pi$, the chain must be:
If these conditions hold, the ergodic theorem asserts that for any integrable function $f$:
\[\frac{1}{N}\sum_{t=1}^{N} f(X_t) \longrightarrow \mathbb{E}_{\pi}[f(x)]\]as $N \to \infty$.
This is the foundation behind using MCMC to approximate integrals and expectations.
Below are several common MCMC algorithms with some theroretical details, context, and pseudocode.
Context & Objective:
Often, our goal is to compute expectations under a complex target distribution $\pi(x)$. For example, we may wish to evaluate:
\[I = \int f(x)\pi(x)\,dx\]When direct sampling from $\pi(x)$ is infeasible, we introduce a proposal (or importance) distribution $q(x)$ that is easier to sample from.
Theory & Formulation:
Given samples $x_1, x_2, \dots, x_N$ drawn from $q(x)$, the expectation is estimated as:
\[I\approx \frac{\sum_{i=1}^{N} f(x_i) w(x_i)}{\sum_{i=1}^{N} w(x_i)},\]with importance weights defined by:
\[w(x) = \frac{\pi(x)}{q(x)}\]Key Considerations:
Pseudocode:
For i = 1 to N:
Sample x_i ~ q(x)
Compute weight w_i = π(x_i) / q(x_i)
Estimate I ≈ (Σ f(x_i) w_i) / (Σ w_i)
Context & Objective:
MH constructs a Markov chain whose stationary distribution is $\pi(x)$. It proposes moves using a proposal density $q(x’|x)$ and accepts these moves with a carefully designed acceptance probability. The key idea is to design the transition probability $P(x \to x’)$ so that the target distribution is invariant. A sufficient condition is detailed balance, which states that for all states $x$ and $x’$:
$\pi(x) P(x \to x’) = \pi(x’) P(x’ \to x)$
In MH, the transition probability is given by:
\[P(x \to x') = q(x'|x) \alpha(x, x')\]where $\alpha(x, x’)$ is the acceptance probability. To satisfy detailed balance, we require:
\[\pi(x) q(x'|x) \alpha(x, x') = \pi(x') q(x|x') \alpha(x', x)\]A common and effective choice is to define $\alpha(x, x’)$ as:
\[\alpha(x, x') = \min\left\{1, \frac{\pi(x') \, q(x|x')}{\pi(x) \, q(x'|x)}\right\}\]| When $$\pi(x’)q(x | x’) \geq \pi(x)q(x’ | x)$$, we have $\alpha(x, x’) = 1$, otherwise, the move is accepted with probability: |
Convergence and ergodicity
The MH algorithm constructs a Markov chain that, under suitable conditions (irreducibility, aperiodicity, and positive recurrence), converges to the target distribution $\pi(x)$. The ergodic theorem then guarantees that time averages computed from the chain will converge to the expectations under $\pi(x)$:
\[\frac{1}{N}\sum_{t=1}^{N} f(x_t) \longrightarrow \mathbb{E}_{\pi}[f(x)] \quad \text{as } N \to \infty\]Practical considerations:
| Choice of Proposal: The efficiency of the MH algorithm heavily depends on how well the proposal distribution explores the state space. If $q(x’ | x)$ is too narrow, the chain will explore slowly, if too wide, the acceptance rate may drop. |
Algorithm & Pseudocode:
Initialize x₀
For t = 0 to N - 1:
Propose x' ~ q(x'|x_t)
Calculate acceptance probability:
α = min{1, [π(x') q(x_t|x')] / [π(x_t) q(x'|x_t)] }
With probability α:
Set x_(t+1) = x'
Else:
Set x_(t+1) = x_t
Context & Objective:
Gibbs sampling is a special case of the Metropolis–Hastings algorithm, optimized for high-dimensional problems where the joint distribution $\pi(x)$ is difficult to sample from directly, but the full conditional distributions $π(x_i∣x_{−i})$ (where $x_{-i}$ denotes all components except $x_i$) are tractable.
Suppose $x = (x_1, x_2, \dots, x_d)$ is a $d$-dimensional vector. The Gibbs sampler iterates over each coordinate and updates it by sampling from the full conditional distribution:
\[x_i^{(t+1)} \sim \pi\left(x_i \mid x_1^{(t+1)}, \dots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \dots, x_d^{(t)}\right)\]Since each update is drawn exactly from the full conditional, the move is automatically accepted. The chain is constructed to have $\pi(x)$ as its stationary distribution.
Theoretical details:
It can be verified by:
\[π(x)π(x'_i∣x_{−i})=π(x')\]which is consistent with the detailed balance requirement.
Practical considerations:
Detailed Balance in Gibbs Sampling: Although Gibbs sampling does not require an explicit acceptance step, one can show that it satisfies detailed balance. For two states $x$ and $x’$ that differ only in the $i$-th coordinate, the update probability is given by the full conditional:
$P(x \to x’) = \pi\left(x_i’ \mid x_{-i}\right)$ It can be verified that:
$\pi(x) \pi\left(x_i’ \mid x_{-i}\right) = \pi(x’)$ which is consistent with the detailed balance requirement.
Convergence and ergodicity:
Gibbs sampling inherits the convergence properties of Markov chains. Provided that the chain is irreducible and aperiodic (often ensured by the structure of the conditional distributions), the Gibbs sampler is ergodic, meaning that the empirical averages converge to the true expectations under $\pi(x)$.
Practical considerations:
Pseudocode:
Initialize x = (x₁, x₂, …, x_d)
For t = 0 to N - 1:
For i = 1 to d:
Sample x_i^(t+1) ~ π(x_i | x₁^(t+1), …, x_(i-1)^(t+1), x_(i+1)^(t), …, x
d^(t))
Context & Objective:
HMC is based on the Hamiltonian function:
\[H(x, p) = U(x) + K(p)\]where:
Hamilton’s equations describe the evolution of $x$ and $p$:
$\frac{dx}{dt} = \nabla_p H(x, p) = M^{-1}p, \qquad \frac{dp}{dt} = -\nabla_x H(x, p) = -\nabla U(x)$
Leapfrog integrator
In practice, Hamilton’s equations are solved numerically using the leapfrog integrator, which is chosen for its symplectic (volume-preserving) and time-reversible properties. The leapfrog update is performed in three steps:
$p\left(t + \frac{\epsilon}{2}\right) = p(t) - \frac{\epsilon}{2}\nabla U(x(t))$
$x(t + \epsilon) = x(t) + \epsilon\, M^{-1} p\left(t + \frac{\epsilon}{2}\right)$
$p(t + \epsilon) = p\left(t + \frac{\epsilon}{2}\right) - \frac{\epsilon}{2}\nabla U(x(t + \epsilon))$
Repeating these steps for $L$ iterations produces a proposal $(x^, p^)$.
Because numerical integration introduces discretization errors, HMC employs a Metropolis acceptance step to correct for these errors. The acceptance probability is given by:
\[\alpha = \min\left\{1, \exp\Big[-H(x^*, p^*) + H(x, p)\Big]\right\}\]This step ensures that the overall transition kernel satisfies detailed balance with respect to the augmented target distribution $\pi(x) \, \mathcal{N}(p;0,M)$.
Convergence and efficiency:
Pseudocode:
Initialize x₀
For t = 0 to N - 1:
Sample momentum p ~ N(0, M)
Set (x, p) = (x_t, p)
Simulate Hamiltonian dynamics using the leapfrog integrator:
For l = 1 to L:
p = p + (ε/2)*∇log π(x)
x = x + ε * M^(-1) * p
p = p + (ε/2)*∇log π(x)
Perform MH accept/reject step with probability:
α = min{1, exp[ -H(x*, p*) + H(x_t, p) ]}
If accepted:
x_(t+1) = x*
Else:
x_(t+1) = x_t
Context & Objective:
MALA enhances the standard Metropolis–Hastings algorithm by incorporating gradient information to propose moves that are more likely to be accepted. It is sometimes viewed as a discretized version of the Langevin diffusion process.
Langevin dynamics:
Consider the overdamped Langevin equation, which describes the evolution of $x$ in continuous time:
\[dx_t = \frac{1}{2}\nabla \log \pi(x_t) \, dt + dW_t\]where $dW_t$ represents a Wiener process (or Brownian motion). The stationary distribution of this stochastic differential equation is $\pi(x)$.
Discretization and proposal:
Discretizing the Langevin equation with step size $\epsilon$ gives the proposal:
\[x' = x^{(t)} + \frac{\epsilon^2}{2}\nabla \log \pi(x^{(t)}) + \epsilon\, \eta, \quad \eta \sim \mathcal{N}(0, I)\]This proposal is asymmetric due to the drift term $\frac{\epsilon^2}{2}\nabla \log \pi(x^{(t)})$.
Metropolis correction for MALA:
To correct for the discretization error and ensure that the chain converges to $\pi(x)$ an MH acceptance step is applied. The acceptance probability is computed as:
\[\alpha = \min\left\{1, \frac{\pi(x') \, q(x^{(t)} \mid x')}{\pi(x^{(t)}) \, q(x' \mid x^{(t)})}\right\}\]where the proposal density $q$ is given by:
\[q(x' \mid x) = \mathcal{N}\left(x'; x + \frac{\epsilon^2}{2}\nabla \log \pi(x), \, \epsilon^2 I\right)\]Convergence and efficiency:
Pseudocode:
Initialize x₀
For t = 0 to N - 1:
Compute gradient g = ∇ log π(x_t)
Propose x' = x_t + (ε²/2)*g + ε*η, where η ~ N(0,I)
Compute asymmetric proposal densities q(x'|x_t) and q(x_t|x')
Calculate acceptance probability:
α = min{1, [π(x')q(x_t|x')] / [π(x_t)q(x'|x_t)] }
Accept or reject accordingly.
Instead of implementing vanilla MCMC algorithms of toy examples, and retrieving simple distributions with libraries like Numpy and Scipy, we chose to implement these algorithms in the context of real world use cases, that range from computer vision to audio signal modeling.
These implementations can be tweaked and used for solving real world problems.
Here is a link to a Github repository where you can find the full implementations with some nice visualizations:
Context:
In Bayesian linear regression, we model the relationship:
\[y = X\beta + \epsilon,\quad \epsilon\sim\mathcal{N}(0,\sigma^2I)\]with priors $\beta \sim \mathcal{N}(\mu_0, \Sigma_0)$ and $\sigma^2 \sim \text{Inv-Gamma}(\alpha_0, \beta_0)$. The Gibbs sampler alternates between sampling $\beta$ and $\sigma^2$ .
Python Implementation:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import invgamma, multivariate_normal
# Generate synthetic data
np.random.seed(42)
n, d = 100, 2
X = np.hstack((np.ones((n,1)), np.random.randn(n, d-1)))
true_beta = np.array([1.0, 2.0])
sigma_true = 1.0
y = X @ true_beta + sigma_true * np.random.randn(n)
# Prior hyperparameters
beta_prior_mean = np.zeros(d)
beta_prior_cov = np.eye(d) * 10
alpha_prior = 2.0
beta_prior_val = 1.0
# Number of iterations for Gibbs sampling
iterations = 5000
beta_samples = np.zeros((iterations, d))
sigma2_samples = np.zeros(iterations)
# Initial values
beta_current = np.zeros(d)
sigma2_current = 1.0
for i in range(iterations):
# Sample beta | sigma^2, y, X
V_beta = np.linalg.inv(np.linalg.inv(beta_prior_cov) + (X.T @ X) / sigma2_current)
m_beta = V_beta @ (np.linalg.inv(beta_prior_cov) @ beta_prior_mean + (X.T @ y) / sigma2_current)
beta_current = multivariate_normal.rvs(mean=m_beta, cov=V_beta)
# Sample sigma^2 | beta, y, X
alpha_post = alpha_prior + n/2
residuals = y - X @ beta_current
beta_post = beta_prior_val + 0.5 * np.sum(residuals**2)
sigma2_current = invgamma.rvs(a=alpha_post, scale=beta_post)
beta_samples[i, :] = beta_current
sigma2_samples[i] = sigma2_current
Context:
Reconstructing a clean audio signal $s(t)$ from a noisy observation $y(t)$ involves preprocessing (e.g. filtering) and sampling from the posterior:
\[p(s|y) \propto p(y|s)p(s)\]where $p(s)$ enforces smoothness.
Preprocessing & MH Implementation:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import butter, filtfilt
# Generate synthetic audio: sine wave with noise
np.random.seed(42)
t = np.linspace(0, 1, 500)
s_true = np.sin(2 * np.pi * 5 * t)
noise_std = 0.3
y_noisy = s_true + noise_std * np.random.randn(len(t))
# --- Preprocessing ---
# Apply a low-pass Butterworth filter to remove high-frequency noise
def butter_lowpass_filter(data, cutoff, fs, order=4):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = butter(order, normal_cutoff, btype='low', analog=False)
y_filtered = filtfilt(b, a, data)
return y_filtered
fs = 500 # Sampling frequency (Hz)
cutoff = 10 # Cutoff frequency (Hz)
y_filtered = butter_lowpass_filter(y_noisy, cutoff, fs)
# --- Metropolis–Hastings for Signal Reconstruction ---
def likelihood(s, y, sigma):
return np.exp(-0.5 * np.sum((y - s)**2) / sigma**2)
def smoothness_prior(s, lambda_reg=100):
diff = np.diff(s)
return np.exp(-lambda_reg * np.sum(diff**2))
def target(s, y, sigma, lambda_reg=100):
return likelihood(s, y, sigma) * smoothness_prior(s, lambda_reg)
def mh_audio_reconstruction(y, sigma_noise, iterations=3000, proposal_std=0.05):
s_current = y.copy() # Initialize with the preprocessed signal
samples = []
for i in range(iterations):
s_proposal = s_current + np.random.normal(0, proposal_std, size=s_current.shape)
ratio = target(s_proposal, y, sigma_noise) / target(s_current, y, sigma_noise)
if np.random.rand() < min(1, ratio):
s_current = s_proposal
samples.append(s_current.copy())
return np.array(samples)
# Run MH sampler on the preprocessed (filtered) signal
iterations = 3000
samples = mh_audio_reconstruction(y_filtered, sigma_noise=noise_std, iterations=iterations)
s_reconstructed = samples[-1]
Context:
For image deblurring or reconstruction, consider the posterior:
\[p(I|Y) \propto p(Y|I)p(I)\]where $I$ is the image, $Y$ is the observation, and $p(I)$ encodes spatial smoothness. HMC efficiently explores high-dimensional image spaces.
Simplified HMC Implementation (2D Image Denoising):
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
# Create synthetic image: gradient image with noise
np.random.seed(42)
image_size = 64
true_image = np.outer(np.linspace(0, 1, image_size), np.linspace(0, 1, image_size))
noisy_image = true_image + 0.3 * np.random.randn(image_size, image_size)
# Define the target log-probability (negative energy) for the image.
# Combines a data fidelity term with a smoothness prior.
def log_target(I, Y, sigma, lambda_reg):
fidelity = -0.5 * np.sum((Y - I)**2) / sigma**2
# Smoothness via a quadratic penalty on finite differences
smoothness = -lambda_reg * (np.sum(np.diff(I, axis=0)**2) + np.sum(np.diff(I, axis=1)**2))
return fidelity + smoothness
# HMC parameters
step_size = 0.001
num_steps = 20
iterations = 100
sigma_noise = 0.3
lambda_reg = 0.1
# Initialize with the noisy image
I_current = noisy_image.copy()
def hmc_update(I_current, Y, step_size, num_steps, sigma, lambda_reg):
I = I_current.copy()
momentum = np.random.randn(*I.shape)
current_momentum = momentum.copy()
# Compute gradient of log_target using finite differences (central differences)
def grad_log_target(I):
grad = np.zeros_like(I)
# Fidelity term gradient
grad += (Y - I) / sigma**2
# Smoothness gradient (using differences)
grad[:-1, :] += 2 * lambda_reg * (I[:-1, :] - I[1:, :])
grad[1:, :] -= 2 * lambda_reg * (I[:-1, :] - I[1:, :])
grad[:, :-1] += 2 * lambda_reg * (I[:, :-1] - I[:, 1:])
grad[:, 1:] -= 2 * lambda_reg * (I[:, :-1] - I[:, 1:])
return grad
# Leapfrog integration
grad = grad_log_target(I)
momentum += 0.5 * step_size * grad
for _ in range(num_steps):
I += step_size * momentum
grad = grad_log_target(I)
momentum += step_size * grad
momentum += 0.5 * step_size * grad
# Negate momentum for symmetry
momentum = -momentum
# Compute Hamiltonians
current_H = -log_target(I_current, Y, sigma, lambda_reg) + 0.5 * np.sum(current_momentum**2)
proposed_H = -log_target(I, Y, sigma, lambda_reg) + 0.5 * np.sum(momentum**2)
if np.random.rand() < np.exp(current_H - proposed_H):
return I, True
else:
return I_current, False
hmc_images = []
accepted = 0
for it in range(iterations):
I_new, acc = hmc_update(I_current, noisy_image, step_size, num_steps, sigma_noise, lambda_reg)
if acc:
accepted += 1
I_current = I_new
hmc_images.append(I_current.copy())
print(f"HMC Acceptance Rate: {accepted/iterations:.3f}"
Context:
Implicit neural representations (e.g., Neural Signed Distance Functions, SDFs) model continuous signals (e.g., 3D shapes) using neural networks. Uncertainty can be captured by placing a prior over latent variables. Here, we use MALA to sample from the posterior over a latent variable in a small Neural SDF model.
Improved Implementation & Visualization:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
# Define a simple Neural SDF with a latent vector parameter
class NeuralSDF(nn.Module):
def __init__(self, latent_dim=5):
super(NeuralSDF, self).__init__()
self.latent = nn.Parameter(torch.randn(latent_dim))
self.fc = nn.Sequential(
nn.Linear(3 + latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x):
# x: [batch, 3]
latent_expand = self.latent.unsqueeze(0).expand(x.size(0), -1)
x_input = torch.cat([x, latent_expand], dim=1)
return self.fc(x_input)
# Generate simulated observations: points near a sphere of radius 0.8
def generate_sdf_observations(n_points=200):
points = torch.rand(n_points, 3) * 2 - 1 # uniformly in [-1,1]^3
sdf_true = torch.norm(points, dim=1, keepdim=True) - 0.8
sdf_obs = sdf_true + 0.05 * torch.randn_like(sdf_true)
return points, sdf_obs
points, sdf_obs = generate_sdf_observations(200)
# Define log-likelihood and log-prior for the latent variable
def log_likelihood(model, points, sdf_obs, sigma=0.05):
pred = model(points)
return -0.5 * torch.sum((sdf_obs - pred)**2) / sigma**2
def log_prior(model):
return -0.5 * torch.sum(model.latent**2)
def log_posterior(model, points, sdf_obs, sigma=0.05):
return log_likelihood(model, points, sdf_obs, sigma) + log_prior(model)
# MALA update for the latent variable
def mala_update(model, points, sdf_obs, sigma=0.05, step_size=1e-3):
model.zero_grad()
logp = log_posterior(model, points, sdf_obs, sigma)
logp.backward()
grad = model.latent.grad.detach()
latent_current = model.latent.detach().clone()
noise = torch.randn_like(latent_current)
latent_proposal = latent_current + 0.5 * step_size * grad + torch.sqrt(torch.tensor(step_size)) * noise
# Compute acceptance probability (using symmetric proposal assumption)
latent_old = model.latent.data.clone()
model.latent.data = latent_proposal
logp_proposal = log_posterior(model, points, sdf_obs, sigma)
accept_prob = torch.exp(logp_proposal - logp)
if torch.rand(1) < accept_prob:
accepted = True
else:
model.latent.data = latent_old
accepted = False
return accepted, model.latent.data.clone(), logp_proposal.item()
model = NeuralSDF(latent_dim=5)
iterations = 2000
latent_samples = []
log_probs = []
accepts = 0
for i in range(iterations):
accepted, latent_sample, lp = mala_update(model, points, sdf_obs, sigma=0.05, step_size=1e-3)
latent_samples.append(latent_sample.numpy())
log_probs.append(lp)
if accepted:
accepts += 1
print(f"MALA Acceptance Rate: {accepts/iterations:.3f}")
In this post, we have presented a rigorous exploration of MCMC methods. We began with theoretical foundations and then developed multiple algorithms with step-by-step pseudocode and theoretical justification. We detailed five major techniques: Importance Sampling, Metropolis–Hastings, Gibbs Sampling, Hamiltonian Monte Carlo, and MALA.
The use cases further demonstrate the practicality of these methods: