Introduction to MCMC methods: From importance sampling to advanced algorithms

1. Introduction

This post (or rather a collection of notes) is an attempt to go through different concepts around MCMC methods from the ground up. I will be trying to gather and structure my learnings around this topic in a way that is clear, intelligible, and beginner friendly.

Sampling from complex, high-dimensional probability distributions is a fundamental challenge in statistics, machine learning, and many applied fields. The core idea is that Markov Chain Monte Carlo (MCMC) methods overcome this challenge by constructing a Markov chain whose stationary distribution is the target distribution.

In practice, MCMC is used for Bayesian inference, uncertainty quantification, and solving inverse problems in areas such as audio and image processing. This blogpost proposes an accessible introduction to MCMC methods from its theoretical foundations to some Python implementations of these algorithms for real world use cases.

2. Theoretical Foundations

2.1. Markov Chains and Stationarity

A Markov chain is a sequence of random variables ${X_t}_{t \geq 0}$ that satisfies the Markov property:

\[P(X_{t+1} = y \mid X_t = x, X_{t-1} = x_{t-1}, \dots) = P(X_{t+1}=y \mid X_t=x).\]

The chain’s dynamics are defined by a transition probability $P(x,y)$ (or kernel) that is time-independent.

A probability distribution $\pi(x)$ is said to be stationary (or invariant) if, for all states $y$,

\[\pi(y) = \sum_{x} \pi(x) P(x, y)\]

This means that if : $X_0 \sim \pi, \text{ then } X_t \sim \pi \text{ for every } t.$

2.2. Detailed Balance and Invariance

A sufficient condition for stationarity is the detailed balance condition:

\[\pi(x) P(x, y) = \pi(y) P(y, x), \quad \forall x, y\]

Proof Sketch:

  1. Assume detailed balance: For every pair $(x,y)$:
\[\pi(x) P(x, y) = \pi(y) P(y, x)\]
  1. Sum over all $x$ for a fixed $y$:
\[\sum_{x} \pi(x) P(x, y) = \sum_{x} \pi(y) P(y, x) = \pi(y) \sum_{x} P(y, x)\]
  1. Normalization: Since $\sum_{x} P(y, x) = 1$, we obtain:
\[\sum_{x} \pi(x) P(x, y) = \pi(y)\]

Thus, detailed balance guarantees that $\pi$ is invariant under the chain dynamics.

2.3. Convergence and Ergodicity

For the empirical averages from the chain to converge to expectations under $\pi$, the chain must be:

If these conditions hold, the ergodic theorem asserts that for any integrable function $f$:

\[\frac{1}{N}\sum_{t=1}^{N} f(X_t) \longrightarrow \mathbb{E}_{\pi}[f(x)]\]

as $N \to \infty$.

This is the foundation behind using MCMC to approximate integrals and expectations.

3. MCMC Algorithms

Below are several common MCMC algorithms with some theroretical details, context, and pseudocode.

3.1. Importance Sampling

Context & Objective:

Often, our goal is to compute expectations under a complex target distribution $\pi(x)$. For example, we may wish to evaluate:

\[I = \int f(x)\pi(x)\,dx\]

When direct sampling from $\pi(x)$ is infeasible, we introduce a proposal (or importance) distribution $q(x)$ that is easier to sample from.

Theory & Formulation:

Given samples $x_1, x_2, \dots, x_N$ drawn from $q(x)$, the expectation is estimated as:

\[I\approx \frac{\sum_{i=1}^{N} f(x_i) w(x_i)}{\sum_{i=1}^{N} w(x_i)},\]

with importance weights defined by:

\[w(x) = \frac{\pi(x)}{q(x)}\]

Key Considerations:

Pseudocode:

For i = 1 to N:
    Sample x_i ~ q(x)
    Compute weight w_i = π(x_i) / q(x_i)
Estimate I ≈ (Σ f(x_i) w_i) / (Σ w_i)

3.2. Metropolis–Hastings (MH)

Context & Objective:

MH constructs a Markov chain whose stationary distribution is $\pi(x)$. It proposes moves using a proposal density $q(x’|x)$ and accepts these moves with a carefully designed acceptance probability. The key idea is to design the transition probability $P(x \to x’)$ so that the target distribution is invariant. A sufficient condition is detailed balance, which states that for all states $x$ and $x’$:

$\pi(x) P(x \to x’) = \pi(x’) P(x’ \to x)$

In MH, the transition probability is given by:

\[P(x \to x') = q(x'|x) \alpha(x, x')\]

where $\alpha(x, x’)$ is the acceptance probability. To satisfy detailed balance, we require:

\[\pi(x) q(x'|x) \alpha(x, x') = \pi(x') q(x|x') \alpha(x', x)\]

A common and effective choice is to define $\alpha(x, x’)$ as:

\[\alpha(x, x') = \min\left\{1, \frac{\pi(x') \, q(x|x')}{\pi(x) \, q(x'|x)}\right\}\]
When $$\pi(x’)q(x x’) \geq \pi(x)q(x’ x)$$, we have $\alpha(x, x’) = 1$, otherwise, the move is accepted with probability:
\[\frac{\pi(x') \, q(x|x')}{\pi(x) \, q(x'|x)}\]

Convergence and ergodicity

The MH algorithm constructs a Markov chain that, under suitable conditions (irreducibility, aperiodicity, and positive recurrence), converges to the target distribution $\pi(x)$. The ergodic theorem then guarantees that time averages computed from the chain will converge to the expectations under $\pi(x)$:

\[\frac{1}{N}\sum_{t=1}^{N} f(x_t) \longrightarrow \mathbb{E}_{\pi}[f(x)] \quad \text{as } N \to \infty\]

Practical considerations:

Algorithm & Pseudocode:

Initialize x₀
For t = 0 to N - 1:
    Propose x' ~ q(x'|x_t)
    Calculate acceptance probability:
        α = min{1, [π(x') q(x_t|x')] / [π(x_t) q(x'|x_t)] }
    With probability α:
        Set x_(t+1) = x'
    Else:
        Set x_(t+1) = x_t

3.3. Gibbs Sampling

Context & Objective:

Gibbs sampling is a special case of the Metropolis–Hastings algorithm, optimized for high-dimensional problems where the joint distribution $\pi(x)$ is difficult to sample from directly, but the full conditional distributions $π(x_i∣x_{−i})$ (where $x_{-i}$ denotes all components except $x_i$) are tractable.

Suppose $x = (x_1, x_2, \dots, x_d)$ is a $d$-dimensional vector. The Gibbs sampler iterates over each coordinate and updates it by sampling from the full conditional distribution:

\[x_i^{(t+1)} \sim \pi\left(x_i \mid x_1^{(t+1)}, \dots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \dots, x_d^{(t)}\right)\]

Since each update is drawn exactly from the full conditional, the move is automatically accepted. The chain is constructed to have $\pi(x)$ as its stationary distribution.

Theoretical details:

\[P(x→x′)=π(x'_{i}∣x_{−i})\]

It can be verified by:

\[π(x)π(x'_i∣x_{−i})=π(x')\]

which is consistent with the detailed balance requirement.

Practical considerations:

Convergence and ergodicity:

Gibbs sampling inherits the convergence properties of Markov chains. Provided that the chain is irreducible and aperiodic (often ensured by the structure of the conditional distributions), the Gibbs sampler is ergodic, meaning that the empirical averages converge to the true expectations under $\pi(x)$.

Practical considerations:

Pseudocode:

Initialize x = (x₁, x₂, …, x_d)
For t = 0 to N - 1:
    For i = 1 to d:
        Sample x_i^(t+1) ~ π(x_i | x₁^(t+1), …, x_(i-1)^(t+1), x_(i+1)^(t), …, x
        d^(t))

3.4. Hamiltonian Monte Carlo (HMC)

Context & Objective:

HMC is based on the Hamiltonian function:

\[H(x, p) = U(x) + K(p)\]

where:

Hamilton’s equations describe the evolution of $x$ and $p$:

$\frac{dx}{dt} = \nabla_p H(x, p) = M^{-1}p, \qquad \frac{dp}{dt} = -\nabla_x H(x, p) = -\nabla U(x)$

Leapfrog integrator

In practice, Hamilton’s equations are solved numerically using the leapfrog integrator, which is chosen for its symplectic (volume-preserving) and time-reversible properties. The leapfrog update is performed in three steps:

  1. Half-step momentum update:

$p\left(t + \frac{\epsilon}{2}\right) = p(t) - \frac{\epsilon}{2}\nabla U(x(t))$

  1. Full-step position update:

$x(t + \epsilon) = x(t) + \epsilon\, M^{-1} p\left(t + \frac{\epsilon}{2}\right)$

  1. Half-step momentum update:

$p(t + \epsilon) = p\left(t + \frac{\epsilon}{2}\right) - \frac{\epsilon}{2}\nabla U(x(t + \epsilon))$

Repeating these steps for $L$ iterations produces a proposal $(x^, p^)$.

Because numerical integration introduces discretization errors, HMC employs a Metropolis acceptance step to correct for these errors. The acceptance probability is given by:

\[\alpha = \min\left\{1, \exp\Big[-H(x^*, p^*) + H(x, p)\Big]\right\}\]

This step ensures that the overall transition kernel satisfies detailed balance with respect to the augmented target distribution $\pi(x) \, \mathcal{N}(p;0,M)$.

Convergence and efficiency:

Pseudocode:

Initialize x₀
For t = 0 to N - 1:
    Sample momentum p ~ N(0, M)
    Set (x, p) = (x_t, p)
    Simulate Hamiltonian dynamics using the leapfrog integrator:
        For l = 1 to L:
            p = p + (ε/2)*∇log π(x)
            x = x + ε * M^(-1) * p
            p = p + (ε/2)*∇log π(x)
    Perform MH accept/reject step with probability:
        α = min{1, exp[ -H(x*, p*) + H(x_t, p) ]}
    If accepted:
        x_(t+1) = x*
    Else:
        x_(t+1) = x_t

3.5. Metropolis Adjusted Langevin Algorithm (MALA)

Context & Objective:

MALA enhances the standard Metropolis–Hastings algorithm by incorporating gradient information to propose moves that are more likely to be accepted. It is sometimes viewed as a discretized version of the Langevin diffusion process.

Langevin dynamics:

Consider the overdamped Langevin equation, which describes the evolution of $x$ in continuous time:

\[dx_t = \frac{1}{2}\nabla \log \pi(x_t) \, dt + dW_t\]

where $dW_t$ represents a Wiener process (or Brownian motion). The stationary distribution of this stochastic differential equation is $\pi(x)$.

Discretization and proposal:

Discretizing the Langevin equation with step size $\epsilon$ gives the proposal:

\[x' = x^{(t)} + \frac{\epsilon^2}{2}\nabla \log \pi(x^{(t)}) + \epsilon\, \eta, \quad \eta \sim \mathcal{N}(0, I)\]

This proposal is asymmetric due to the drift term $\frac{\epsilon^2}{2}\nabla \log \pi(x^{(t)})$.

Metropolis correction for MALA:

To correct for the discretization error and ensure that the chain converges to $\pi(x)$ an MH acceptance step is applied. The acceptance probability is computed as:

\[\alpha = \min\left\{1, \frac{\pi(x') \, q(x^{(t)} \mid x')}{\pi(x^{(t)}) \, q(x' \mid x^{(t)})}\right\}\]

where the proposal density $q$ is given by:

\[q(x' \mid x) = \mathcal{N}\left(x'; x + \frac{\epsilon^2}{2}\nabla \log \pi(x), \, \epsilon^2 I\right)\]

Convergence and efficiency:

Pseudocode:

Initialize x₀
For t = 0 to N - 1:
    Compute gradient g = ∇ log π(x_t)
    Propose x' = x_t + (ε²/2)*g + ε*η, where η ~ N(0,I)
    Compute asymmetric proposal densities q(x'|x_t) and q(x_t|x')
    Calculate acceptance probability:
        α = min{1, [π(x')q(x_t|x')] / [π(x_t)q(x'|x_t)] }
    Accept or reject accordingly.

4. Use cases implementations

Instead of implementing vanilla MCMC algorithms of toy examples, and retrieving simple distributions with libraries like Numpy and Scipy, we chose to implement these algorithms in the context of real world use cases, that range from computer vision to audio signal modeling.

These implementations can be tweaked and used for solving real world problems.

Here is a link to a Github repository where you can find the full implementations with some nice visualizations:

MCMC

4.1. Bayesian Linear Regression (via Gibbs Sampling)

Context:

In Bayesian linear regression, we model the relationship:

\[y = X\beta + \epsilon,\quad \epsilon\sim\mathcal{N}(0,\sigma^2I)\]

with priors $\beta \sim \mathcal{N}(\mu_0, \Sigma_0)$ and $\sigma^2 \sim \text{Inv-Gamma}(\alpha_0, \beta_0)$. The Gibbs sampler alternates between sampling $\beta$ and $\sigma^2$ .

Python Implementation:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import invgamma, multivariate_normal

# Generate synthetic data
np.random.seed(42)
n, d = 100, 2
X = np.hstack((np.ones((n,1)), np.random.randn(n, d-1)))
true_beta = np.array([1.0, 2.0])
sigma_true = 1.0
y = X @ true_beta + sigma_true * np.random.randn(n)

# Prior hyperparameters
beta_prior_mean = np.zeros(d)
beta_prior_cov = np.eye(d) * 10
alpha_prior = 2.0
beta_prior_val = 1.0

# Number of iterations for Gibbs sampling
iterations = 5000
beta_samples = np.zeros((iterations, d))
sigma2_samples = np.zeros(iterations)

# Initial values
beta_current = np.zeros(d)
sigma2_current = 1.0

for i in range(iterations):
    # Sample beta | sigma^2, y, X
    V_beta = np.linalg.inv(np.linalg.inv(beta_prior_cov) + (X.T @ X) / sigma2_current)
    m_beta = V_beta @ (np.linalg.inv(beta_prior_cov) @ beta_prior_mean + (X.T @ y) / sigma2_current)
    beta_current = multivariate_normal.rvs(mean=m_beta, cov=V_beta)

    # Sample sigma^2 | beta, y, X
    alpha_post = alpha_prior + n/2
    residuals = y - X @ beta_current
    beta_post = beta_prior_val + 0.5 * np.sum(residuals**2)
    sigma2_current = invgamma.rvs(a=alpha_post, scale=beta_post)

    beta_samples[i, :] = beta_current
    sigma2_samples[i] = sigma2_current

4.2. Audio Signal Reconstruction with Metropolis–Hastings (MH) and Preprocessing

Context:

Reconstructing a clean audio signal $s(t)$ from a noisy observation $y(t)$ involves preprocessing (e.g. filtering) and sampling from the posterior:

\[p(s|y) \propto p(y|s)p(s)\]

where $p(s)$ enforces smoothness.

Preprocessing & MH Implementation:

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import butter, filtfilt

# Generate synthetic audio: sine wave with noise
np.random.seed(42)
t = np.linspace(0, 1, 500)
s_true = np.sin(2 * np.pi * 5 * t)
noise_std = 0.3
y_noisy = s_true + noise_std * np.random.randn(len(t))

# --- Preprocessing ---
# Apply a low-pass Butterworth filter to remove high-frequency noise
def butter_lowpass_filter(data, cutoff, fs, order=4):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    y_filtered = filtfilt(b, a, data)
    return y_filtered

fs = 500  # Sampling frequency (Hz)
cutoff = 10  # Cutoff frequency (Hz)
y_filtered = butter_lowpass_filter(y_noisy, cutoff, fs)

# --- Metropolis–Hastings for Signal Reconstruction ---
def likelihood(s, y, sigma):
    return np.exp(-0.5 * np.sum((y - s)**2) / sigma**2)

def smoothness_prior(s, lambda_reg=100):
    diff = np.diff(s)
    return np.exp(-lambda_reg * np.sum(diff**2))

def target(s, y, sigma, lambda_reg=100):
    return likelihood(s, y, sigma) * smoothness_prior(s, lambda_reg)

def mh_audio_reconstruction(y, sigma_noise, iterations=3000, proposal_std=0.05):
    s_current = y.copy()  # Initialize with the preprocessed signal
    samples = []
    for i in range(iterations):
        s_proposal = s_current + np.random.normal(0, proposal_std, size=s_current.shape)
        ratio = target(s_proposal, y, sigma_noise) / target(s_current, y, sigma_noise)
        if np.random.rand() < min(1, ratio):
            s_current = s_proposal
        samples.append(s_current.copy())
    return np.array(samples)

# Run MH sampler on the preprocessed (filtered) signal
iterations = 3000
samples = mh_audio_reconstruction(y_filtered, sigma_noise=noise_std, iterations=iterations)
s_reconstructed = samples[-1]

4.3. Image Reconstruction with Hamiltonian Monte Carlo (HMC)

Context:

For image deblurring or reconstruction, consider the posterior:

\[p(I|Y) \propto p(Y|I)p(I)\]

where $I$ is the image, $Y$ is the observation, and $p(I)$ encodes spatial smoothness. HMC efficiently explores high-dimensional image spaces.

Simplified HMC Implementation (2D Image Denoising):

import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter

# Create synthetic image: gradient image with noise
np.random.seed(42)
image_size = 64
true_image = np.outer(np.linspace(0, 1, image_size), np.linspace(0, 1, image_size))
noisy_image = true_image + 0.3 * np.random.randn(image_size, image_size)

# Define the target log-probability (negative energy) for the image.
# Combines a data fidelity term with a smoothness prior.
def log_target(I, Y, sigma, lambda_reg):
    fidelity = -0.5 * np.sum((Y - I)**2) / sigma**2
    # Smoothness via a quadratic penalty on finite differences
    smoothness = -lambda_reg * (np.sum(np.diff(I, axis=0)**2) + np.sum(np.diff(I, axis=1)**2))
    return fidelity + smoothness

# HMC parameters
step_size = 0.001
num_steps = 20
iterations = 100
sigma_noise = 0.3
lambda_reg = 0.1

# Initialize with the noisy image
I_current = noisy_image.copy()

def hmc_update(I_current, Y, step_size, num_steps, sigma, lambda_reg):
    I = I_current.copy()
    momentum = np.random.randn(*I.shape)
    current_momentum = momentum.copy()

    # Compute gradient of log_target using finite differences (central differences)
    def grad_log_target(I):
        grad = np.zeros_like(I)
        # Fidelity term gradient
        grad += (Y - I) / sigma**2
        # Smoothness gradient (using differences)
        grad[:-1, :] += 2 * lambda_reg * (I[:-1, :] - I[1:, :])
        grad[1:, :]  -= 2 * lambda_reg * (I[:-1, :] - I[1:, :])
        grad[:, :-1] += 2 * lambda_reg * (I[:, :-1] - I[:, 1:])
        grad[:, 1:]  -= 2 * lambda_reg * (I[:, :-1] - I[:, 1:])
        return grad

    # Leapfrog integration
    grad = grad_log_target(I)
    momentum += 0.5 * step_size * grad
    for _ in range(num_steps):
        I += step_size * momentum
        grad = grad_log_target(I)
        momentum += step_size * grad
    momentum += 0.5 * step_size * grad
    # Negate momentum for symmetry
    momentum = -momentum

    # Compute Hamiltonians
    current_H = -log_target(I_current, Y, sigma, lambda_reg) + 0.5 * np.sum(current_momentum**2)
    proposed_H = -log_target(I, Y, sigma, lambda_reg) + 0.5 * np.sum(momentum**2)
    if np.random.rand() < np.exp(current_H - proposed_H):
        return I, True
    else:
        return I_current, False

hmc_images = []
accepted = 0
for it in range(iterations):
    I_new, acc = hmc_update(I_current, noisy_image, step_size, num_steps, sigma_noise, lambda_reg)
    if acc:
        accepted += 1
    I_current = I_new
    hmc_images.append(I_current.copy())

print(f"HMC Acceptance Rate: {accepted/iterations:.3f}"

4.4. Implicit Neural Representations (Neural SDF) with MALA

Context:

Implicit neural representations (e.g., Neural Signed Distance Functions, SDFs) model continuous signals (e.g., 3D shapes) using neural networks. Uncertainty can be captured by placing a prior over latent variables. Here, we use MALA to sample from the posterior over a latent variable in a small Neural SDF model.

Improved Implementation & Visualization:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Define a simple Neural SDF with a latent vector parameter
class NeuralSDF(nn.Module):
    def __init__(self, latent_dim=5):
        super(NeuralSDF, self).__init__()
        self.latent = nn.Parameter(torch.randn(latent_dim))
        self.fc = nn.Sequential(
            nn.Linear(3 + latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
    def forward(self, x):
        # x: [batch, 3]
        latent_expand = self.latent.unsqueeze(0).expand(x.size(0), -1)
        x_input = torch.cat([x, latent_expand], dim=1)
        return self.fc(x_input)

# Generate simulated observations: points near a sphere of radius 0.8
def generate_sdf_observations(n_points=200):
    points = torch.rand(n_points, 3) * 2 - 1  # uniformly in [-1,1]^3
    sdf_true = torch.norm(points, dim=1, keepdim=True) - 0.8
    sdf_obs = sdf_true + 0.05 * torch.randn_like(sdf_true)
    return points, sdf_obs

points, sdf_obs = generate_sdf_observations(200)

# Define log-likelihood and log-prior for the latent variable
def log_likelihood(model, points, sdf_obs, sigma=0.05):
    pred = model(points)
    return -0.5 * torch.sum((sdf_obs - pred)**2) / sigma**2

def log_prior(model):
    return -0.5 * torch.sum(model.latent**2)

def log_posterior(model, points, sdf_obs, sigma=0.05):
    return log_likelihood(model, points, sdf_obs, sigma) + log_prior(model)

# MALA update for the latent variable
def mala_update(model, points, sdf_obs, sigma=0.05, step_size=1e-3):
    model.zero_grad()
    logp = log_posterior(model, points, sdf_obs, sigma)
    logp.backward()
    grad = model.latent.grad.detach()
    latent_current = model.latent.detach().clone()
    noise = torch.randn_like(latent_current)
    latent_proposal = latent_current + 0.5 * step_size * grad + torch.sqrt(torch.tensor(step_size)) * noise

    # Compute acceptance probability (using symmetric proposal assumption)
    latent_old = model.latent.data.clone()
    model.latent.data = latent_proposal
    logp_proposal = log_posterior(model, points, sdf_obs, sigma)
    accept_prob = torch.exp(logp_proposal - logp)
    if torch.rand(1) < accept_prob:
        accepted = True
    else:
        model.latent.data = latent_old
        accepted = False
    return accepted, model.latent.data.clone(), logp_proposal.item()

model = NeuralSDF(latent_dim=5)
iterations = 2000
latent_samples = []
log_probs = []
accepts = 0

for i in range(iterations):
    accepted, latent_sample, lp = mala_update(model, points, sdf_obs, sigma=0.05, step_size=1e-3)
    latent_samples.append(latent_sample.numpy())
    log_probs.append(lp)
    if accepted:
        accepts += 1

print(f"MALA Acceptance Rate: {accepts/iterations:.3f}")

5. Conclusion

In this post, we have presented a rigorous exploration of MCMC methods. We began with theoretical foundations and then developed multiple algorithms with step-by-step pseudocode and theoretical justification. We detailed five major techniques: Importance Sampling, Metropolis–Hastings, Gibbs Sampling, Hamiltonian Monte Carlo, and MALA.

The use cases further demonstrate the practicality of these methods: