Introduction to MCMC methods: From importance sampling to advanced algorithms

1. Introduction

This post (or rather a collection of notes) is an attempt to go through different concepts around MCMC methods from the ground up. I will be trying to gather and structure my learnings around this topic in a way that is clear, intelligible, and beginner friendly.

Sampling from complex, high-dimensional probability distributions is a fundamental challenge in statistics, machine learning, and many applied fields. The core idea is that Markov Chain Monte Carlo (MCMC) methods overcome this challenge by constructing a Markov chain whose stationary distribution is the target distribution.

In practice, MCMC is used for Bayesian inference, uncertainty quantification, and solving inverse problems in areas such as audio and image processing. This blogpost proposes an accessible introduction to MCMC methods from its theoretical foundations to some Python implementations of these algorithms for real world use cases.

2. Theoretical Foundations

2.1. Markov Chains and Stationarity

A Markov chain is a sequence of random variables ${X_t}_{t \geq 0}$ that satisfies the Markov property:

\[P(X_{t+1} = y \mid X_t = x, X_{t-1} = x_{t-1}, \dots) = P(X_{t+1}=y \mid X_t=x).\]

The chain’s dynamics are defined by a transition probability $P(x,y)$ (or kernel) that is time-independent.

A probability distribution $\pi(x)$ is said to be stationary (or invariant) if, for all states $y$,

\[\pi(y) = \sum_{x} \pi(x) P(x, y)\]

This means that if : $X_0 \sim \pi, \text{ then } X_t \sim \pi \text{ for every } t.$

2.2. Detailed Balance and Invariance

A sufficient condition for stationarity is the detailed balance condition:

\[\pi(x) P(x, y) = \pi(y) P(y, x), \quad \forall x, y\]

Proof Sketch:

Assume detailed balance: For every pair $(x,y)$:

\[\pi(x) P(x, y) = \pi(y) P(y, x)\]

Sum over all $x$ for a fixed $y$:

\[\sum_{x} \pi(x) P(x, y) = \sum_{x} \pi(y) P(y, x) = \pi(y) \sum_{x} P(y, x)\]

Normalization: Since $\sum_{x} P(y, x) = 1$, we obtain:

\[\sum_{x} \pi(x) P(x, y) = \pi(y)\]

Thus, detailed balance guarantees that $\pi$ is invariant under the chain dynamics.

2.3. Convergence and Ergodicity

For the empirical averages from the chain to converge to expectations under $\pi$, the chain must be:

Irreducible: Every state can be reached from any other state.
Aperiodic: The chain does not get trapped in cycles.
Positive Recurrent: The expected return time to any state is finite.

If these conditions hold, the ergodic theorem asserts that for any integrable function $f$:

\[\frac{1}{N}\sum_{t=1}^{N} f(X_t) \longrightarrow \mathbb{E}_{\pi}[f(x)]\]

as $N \to \infty$.

This is the foundation behind using MCMC to approximate integrals and expectations.

3. MCMC Algorithms

Below are several common MCMC algorithms with some theroretical details, context, and pseudocode.

3.1. Importance Sampling

Context & Objective:

Often, our goal is to compute expectations under a complex target distribution $\pi(x)$. For example, we may wish to evaluate:

\[I = \int f(x)\pi(x)\,dx\]

When direct sampling from $\pi(x)$ is infeasible, we introduce a proposal (or importance) distribution $q(x)$ that is easier to sample from.

Theory & Formulation:

Given samples $x_1, x_2, \dots, x_N$ drawn from $q(x)$, the expectation is estimated as:

\[I\approx \frac{\sum_{i=1}^{N} f(x_i) w(x_i)}{\sum_{i=1}^{N} w(x_i)},\]

with importance weights defined by:

\[w(x) = \frac{\pi(x)}{q(x)}\]

Key Considerations:

Choice of $q(x):q(x)$ must have heavier tails than $\pi(x)$ to avoid extremely high weights.
Variance: A poor choice of $q(x)$ leads to high variance in estimates.

Pseudocode:

For i = 1 to N:
    Sample x_i ~ q(x)
    Compute weight w_i = π(x_i) / q(x_i)
Estimate I ≈ (Σ f(x_i) w_i) / (Σ w_i)

3.2. Metropolis–Hastings (MH)

Context & Objective:

MH constructs a Markov chain whose stationary distribution is $\pi(x)$. It proposes moves using a proposal density $q(x’|x)$ and accepts these moves with a carefully designed acceptance probability. The key idea is to design the transition probability $P(x \to x’)$ so that the target distribution is invariant. A sufficient condition is detailed balance, which states that for all states $x$ and $x’$:

$\pi(x) P(x \to x’) = \pi(x’) P(x’ \to x)$

In MH, the transition probability is given by:

\[P(x \to x') = q(x'|x) \alpha(x, x')\]

where $\alpha(x, x’)$ is the acceptance probability. To satisfy detailed balance, we require:

\[\pi(x) q(x'|x) \alpha(x, x') = \pi(x') q(x|x') \alpha(x', x)\]

A common and effective choice is to define $\alpha(x, x’)$ as:

\[\alpha(x, x') = \min\left\{1, \frac{\pi(x') \, q(x|x')}{\pi(x) \, q(x'|x)}\right\}\]

When $$\pi(x’)q(x

x’) \geq \pi(x)q(x’

x)$$, we have $\alpha(x, x’) = 1$, otherwise, the move is accepted with probability:

\[\frac{\pi(x') \, q(x|x')}{\pi(x) \, q(x'|x)}\]

Convergence and ergodicity

The MH algorithm constructs a Markov chain that, under suitable conditions (irreducibility, aperiodicity, and positive recurrence), converges to the target distribution $\pi(x)$. The ergodic theorem then guarantees that time averages computed from the chain will converge to the expectations under $\pi(x)$:

\[\frac{1}{N}\sum_{t=1}^{N} f(x_t) \longrightarrow \mathbb{E}_{\pi}[f(x)] \quad \text{as } N \to \infty\]

Practical considerations:

Choice of Proposal: The efficiency of the MH algorithm heavily depends on how well the proposal distribution explores the state space. If $q(x’

x)$ is too narrow, the chain will explore slowly, if too wide, the acceptance rate may drop.

Symmetric Proposals: When $q(x’|x) = q(x|x’)$ (as in a Gaussian random walk), the acceptance probability simplifies to: $\alpha(x, x') = \min\left\{1, \frac{\pi(x')}{\pi(x)}\right\}$

Algorithm & Pseudocode:

Initialize x₀
For t = 0 to N - 1:
    Propose x' ~ q(x'|x_t)
    Calculate acceptance probability:
        α = min{1, [π(x') q(x_t|x')] / [π(x_t) q(x'|x_t)] }
    With probability α:
        Set x_(t+1) = x'
    Else:
        Set x_(t+1) = x_t

3.3. Gibbs Sampling

Context & Objective:

Gibbs sampling is a special case of the Metropolis–Hastings algorithm, optimized for high-dimensional problems where the joint distribution $\pi(x)$ is difficult to sample from directly, but the full conditional distributions $π(x_i∣x_{−i})$ (where $x_{-i}$ denotes all components except $x_i$) are tractable.

Suppose $x = (x_1, x_2, \dots, x_d)$ is a $d$-dimensional vector. The Gibbs sampler iterates over each coordinate and updates it by sampling from the full conditional distribution:

\[x_i^{(t+1)} \sim \pi\left(x_i \mid x_1^{(t+1)}, \dots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \dots, x_d^{(t)}\right)\]

Since each update is drawn exactly from the full conditional, the move is automatically accepted. The chain is constructed to have $\pi(x)$ as its stationary distribution.

Theoretical details:

Consistency of conditionals: For Gibbs sampling to work, the set of full conditionals must be consistent with a joint distribution $\pi(x)$. Under this condition, if the chain is run long enough the joint ditribution of the samples converges to $\pi(x)$.
Detailed balance in Gibbs sampling: Although Gibbs sampling does not require an explicit acceptance step, one can show that it satisfies detailed balance. For two states $x$ and $x’$ that differ only in the $i$-th coordinate, the update coordinate is given by the full conditional:

\[P(x→x′)=π(x'_{i}∣x_{−i})\]

It can be verified by:

\[π(x)π(x'_i∣x_{−i})=π(x')\]

which is consistent with the detailed balance requirement.

Practical considerations:

Consistency of Conditionals: For Gibbs sampling to work, the set of full conditionals must be consistent with a joint distribution $\pi(x)$. Under this condition, if the chain is run long enough, the joint distribution of the samples converges to $\pi(x)$.
Detailed Balance in Gibbs Sampling: Although Gibbs sampling does not require an explicit acceptance step, one can show that it satisfies detailed balance. For two states $x$ and $x’$ that differ only in the $i$-th coordinate, the update probability is given by the full conditional:

$P(x \to x’) = \pi\left(x_i’ \mid x_{-i}\right)$ It can be verified that:

$\pi(x) \pi\left(x_i’ \mid x_{-i}\right) = \pi(x’)$ which is consistent with the detailed balance requirement.

Convergence and ergodicity:

Gibbs sampling inherits the convergence properties of Markov chains. Provided that the chain is irreducible and aperiodic (often ensured by the structure of the conditional distributions), the Gibbs sampler is ergodic, meaning that the empirical averages converge to the true expectations under $\pi(x)$.

Practical considerations:

Blocking: In practice, it might be beneficial to update groups of variables together (blocked Gibbs sampling) when they are strongly correlated.
Mixing: The rate at which the Gibbs sampler explores the state space (its mixing time) can be slow if the variables are highly correlated. In such cases, combining Gibbs updates with other moves (or using reparameterizations) may improve performance.
Implementation: Gibbs sampling is particularly attractive when full conditionals are available in closed form (e.g., in many Bayesian hierarchical models).

Pseudocode:

Initialize x = (x₁, x₂, …, x_d)
For t = 0 to N - 1:
    For i = 1 to d:
        Sample x_i^(t+1) ~ π(x_i | x₁^(t+1), …, x_(i-1)^(t+1), x_(i+1)^(t), …, x
        d^(t))

3.4. Hamiltonian Monte Carlo (HMC)

Context & Objective:

HMC is based on the Hamiltonian function:

\[H(x, p) = U(x) + K(p)\]

where:

$U(x) = -\log \pi(x)$ is the potential energy (derived from the target density $\pi(x)$.
$K(p) = \frac{1}{2}p^\top M^{-1}p$ is the kinetic energy, typically assuming $p \sim \mathcal{N}(0, M)$ with mass matrix $M$.

Hamilton’s equations describe the evolution of $x$ and $p$:

$\frac{dx}{dt} = \nabla_p H(x, p) = M^{-1}p, \qquad \frac{dp}{dt} = -\nabla_x H(x, p) = -\nabla U(x)$

Leapfrog integrator

In practice, Hamilton’s equations are solved numerically using the leapfrog integrator, which is chosen for its symplectic (volume-preserving) and time-reversible properties. The leapfrog update is performed in three steps:

Half-step momentum update:

$p\left(t + \frac{\epsilon}{2}\right) = p(t) - \frac{\epsilon}{2}\nabla U(x(t))$

Full-step position update:

$x(t + \epsilon) = x(t) + \epsilon\, M^{-1} p\left(t + \frac{\epsilon}{2}\right)$

Half-step momentum update:

$p(t + \epsilon) = p\left(t + \frac{\epsilon}{2}\right) - \frac{\epsilon}{2}\nabla U(x(t + \epsilon))$

Repeating these steps for $L$ iterations produces a proposal $(x^, p^)$.

Because numerical integration introduces discretization errors, HMC employs a Metropolis acceptance step to correct for these errors. The acceptance probability is given by:

\[\alpha = \min\left\{1, \exp\Big[-H(x^*, p^*) + H(x, p)\Big]\right\}\]

This step ensures that the overall transition kernel satisfies detailed balance with respect to the augmented target distribution $\pi(x) \, \mathcal{N}(p;0,M)$.

Convergence and efficiency:

Reduction of Random Walk Behavior: HMC can make large moves in state space while maintaining a high acceptance rate, thereby reducing autocorrelation.
Tuning Parameters: The step size $\epsilon$ and the number of leapfrog steps $L$ must be carefully tuned. Too large a step size or too many steps can result in low acceptance probabilities, while too small values may result in inefficient exploration.
Theoretical Guarantees: Under proper conditions (e.g. the leapfrog integrator’s error is bounded and the chain is irreducible and aperiodic), HMC is ergodic and converges to the target distribution.

Pseudocode:

Initialize x₀
For t = 0 to N - 1:
    Sample momentum p ~ N(0, M)
    Set (x, p) = (x_t, p)
    Simulate Hamiltonian dynamics using the leapfrog integrator:
        For l = 1 to L:
            p = p + (ε/2)*∇log π(x)
            x = x + ε * M^(-1) * p
            p = p + (ε/2)*∇log π(x)
    Perform MH accept/reject step with probability:
        α = min{1, exp[ -H(x*, p*) + H(x_t, p) ]}
    If accepted:
        x_(t+1) = x*
    Else:
        x_(t+1) = x_t

3.5. Metropolis Adjusted Langevin Algorithm (MALA)

Context & Objective:

MALA enhances the standard Metropolis–Hastings algorithm by incorporating gradient information to propose moves that are more likely to be accepted. It is sometimes viewed as a discretized version of the Langevin diffusion process.

Langevin dynamics:

Consider the overdamped Langevin equation, which describes the evolution of $x$ in continuous time:

\[dx_t = \frac{1}{2}\nabla \log \pi(x_t) \, dt + dW_t\]

where $dW_t$ represents a Wiener process (or Brownian motion). The stationary distribution of this stochastic differential equation is $\pi(x)$.

Discretization and proposal:

Discretizing the Langevin equation with step size $\epsilon$ gives the proposal:

\[x' = x^{(t)} + \frac{\epsilon^2}{2}\nabla \log \pi(x^{(t)}) + \epsilon\, \eta, \quad \eta \sim \mathcal{N}(0, I)\]

This proposal is asymmetric due to the drift term $\frac{\epsilon^2}{2}\nabla \log \pi(x^{(t)})$.

Metropolis correction for MALA:

To correct for the discretization error and ensure that the chain converges to $\pi(x)$ an MH acceptance step is applied. The acceptance probability is computed as:

\[\alpha = \min\left\{1, \frac{\pi(x') \, q(x^{(t)} \mid x')}{\pi(x^{(t)}) \, q(x' \mid x^{(t)})}\right\}\]

where the proposal density $q$ is given by:

\[q(x' \mid x) = \mathcal{N}\left(x'; x + \frac{\epsilon^2}{2}\nabla \log \pi(x), \, \epsilon^2 I\right)\]

Convergence and efficiency:

Incorporation of Gradient Information: The use of $\nabla \log \pi(x)$ helps propose moves that are informed by the geometry of the target distribution, often leading to a higher acceptance rate compared to random-walk proposals.
Trade-off in Step Size: A small $\epsilon$ leads to high acceptance rates but slow exploration (small moves), while a large $\epsilon$ can improve exploration but may reduce the acceptance probability.
Theoretical Guarantees: Under suitable conditions (including proper scaling of $\epsilon$ and the ergodicity of the underlying Langevin diffusion), MALA converges to $\pi(x)$.

Pseudocode:

Initialize x₀
For t = 0 to N - 1:
    Compute gradient g = ∇ log π(x_t)
    Propose x' = x_t + (ε²/2)*g + ε*η, where η ~ N(0,I)
    Compute asymmetric proposal densities q(x'|x_t) and q(x_t|x')
    Calculate acceptance probability:
        α = min{1, [π(x')q(x_t|x')] / [π(x_t)q(x'|x_t)] }
    Accept or reject accordingly.

4. Use cases implementations

Instead of implementing vanilla MCMC algorithms of toy examples, and retrieving simple distributions with libraries like Numpy and Scipy, we chose to implement these algorithms in the context of real world use cases, that range from computer vision to audio signal modeling.

These implementations can be tweaked and used for solving real world problems.

Here is a link to a Github repository where you can find the full implementations with some nice visualizations:

MCMC

4.1. Bayesian Linear Regression (via Gibbs Sampling)

Context:

In Bayesian linear regression, we model the relationship:

\[y = X\beta + \epsilon,\quad \epsilon\sim\mathcal{N}(0,\sigma^2I)\]

with priors $\beta \sim \mathcal{N}(\mu_0, \Sigma_0)$ and $\sigma^2 \sim \text{Inv-Gamma}(\alpha_0, \beta_0)$. The Gibbs sampler alternates between sampling $\beta$ and $\sigma^2$ .

Python Implementation:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import invgamma, multivariate_normal

# Generate synthetic data
np.random.seed(42)
n, d = 100, 2
X = np.hstack((np.ones((n,1)), np.random.randn(n, d-1)))
true_beta = np.array([1.0, 2.0])
sigma_true = 1.0
y = X @ true_beta + sigma_true * np.random.randn(n)

# Prior hyperparameters
beta_prior_mean = np.zeros(d)
beta_prior_cov = np.eye(d) * 10
alpha_prior = 2.0
beta_prior_val = 1.0

# Number of iterations for Gibbs sampling
iterations = 5000
beta_samples = np.zeros((iterations, d))
sigma2_samples = np.zeros(iterations)

# Initial values
beta_current = np.zeros(d)
sigma2_current = 1.0

for i in range(iterations):
    # Sample beta | sigma^2, y, X
    V_beta = np.linalg.inv(np.linalg.inv(beta_prior_cov) + (X.T @ X) / sigma2_current)
    m_beta = V_beta @ (np.linalg.inv(beta_prior_cov) @ beta_prior_mean + (X.T @ y) / sigma2_current)
    beta_current = multivariate_normal.rvs(mean=m_beta, cov=V_beta)

    # Sample sigma^2 | beta, y, X
    alpha_post = alpha_prior + n/2
    residuals = y - X @ beta_current
    beta_post = beta_prior_val + 0.5 * np.sum(residuals**2)
    sigma2_current = invgamma.rvs(a=alpha_post, scale=beta_post)

    beta_samples[i, :] = beta_current
    sigma2_samples[i] = sigma2_current

4.2. Audio Signal Reconstruction with Metropolis–Hastings (MH) and Preprocessing

Context:

Reconstructing a clean audio signal $s(t)$ from a noisy observation $y(t)$ involves preprocessing (e.g. filtering) and sampling from the posterior:

\[p(s|y) \propto p(y|s)p(s)\]

where $p(s)$ enforces smoothness.

Preprocessing & MH Implementation:

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import butter, filtfilt

# Generate synthetic audio: sine wave with noise
np.random.seed(42)
t = np.linspace(0, 1, 500)
s_true = np.sin(2 * np.pi * 5 * t)
noise_std = 0.3
y_noisy = s_true + noise_std * np.random.randn(len(t))

# --- Preprocessing ---
# Apply a low-pass Butterworth filter to remove high-frequency noise
def butter_lowpass_filter(data, cutoff, fs, order=4):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    y_filtered = filtfilt(b, a, data)
    return y_filtered

fs = 500  # Sampling frequency (Hz)
cutoff = 10  # Cutoff frequency (Hz)
y_filtered = butter_lowpass_filter(y_noisy, cutoff, fs)

# --- Metropolis–Hastings for Signal Reconstruction ---
def likelihood(s, y, sigma):
    return np.exp(-0.5 * np.sum((y - s)**2) / sigma**2)

def smoothness_prior(s, lambda_reg=100):
    diff = np.diff(s)
    return np.exp(-lambda_reg * np.sum(diff**2))

def target(s, y, sigma, lambda_reg=100):
    return likelihood(s, y, sigma) * smoothness_prior(s, lambda_reg)

def mh_audio_reconstruction(y, sigma_noise, iterations=3000, proposal_std=0.05):
    s_current = y.copy()  # Initialize with the preprocessed signal
    samples = []
    for i in range(iterations):
        s_proposal = s_current + np.random.normal(0, proposal_std, size=s_current.shape)
        ratio = target(s_proposal, y, sigma_noise) / target(s_current, y, sigma_noise)
        if np.random.rand() < min(1, ratio):
            s_current = s_proposal
        samples.append(s_current.copy())
    return np.array(samples)

# Run MH sampler on the preprocessed (filtered) signal
iterations = 3000
samples = mh_audio_reconstruction(y_filtered, sigma_noise=noise_std, iterations=iterations)
s_reconstructed = samples[-1]

4.3. Image Reconstruction with Hamiltonian Monte Carlo (HMC)

Context:

For image deblurring or reconstruction, consider the posterior:

\[p(I|Y) \propto p(Y|I)p(I)\]

where $I$ is the image, $Y$ is the observation, and $p(I)$ encodes spatial smoothness. HMC efficiently explores high-dimensional image spaces.

Simplified HMC Implementation (2D Image Denoising):

import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter

# Create synthetic image: gradient image with noise
np.random.seed(42)
image_size = 64
true_image = np.outer(np.linspace(0, 1, image_size), np.linspace(0, 1, image_size))
noisy_image = true_image + 0.3 * np.random.randn(image_size, image_size)

# Define the target log-probability (negative energy) for the image.
# Combines a data fidelity term with a smoothness prior.
def log_target(I, Y, sigma, lambda_reg):
    fidelity = -0.5 * np.sum((Y - I)**2) / sigma**2
    # Smoothness via a quadratic penalty on finite differences
    smoothness = -lambda_reg * (np.sum(np.diff(I, axis=0)**2) + np.sum(np.diff(I, axis=1)**2))
    return fidelity + smoothness

# HMC parameters
step_size = 0.001
num_steps = 20
iterations = 100
sigma_noise = 0.3
lambda_reg = 0.1

# Initialize with the noisy image
I_current = noisy_image.copy()

def hmc_update(I_current, Y, step_size, num_steps, sigma, lambda_reg):
    I = I_current.copy()
    momentum = np.random.randn(*I.shape)
    current_momentum = momentum.copy()

    # Compute gradient of log_target using finite differences (central differences)
    def grad_log_target(I):
        grad = np.zeros_like(I)
        # Fidelity term gradient
        grad += (Y - I) / sigma**2
        # Smoothness gradient (using differences)
        grad[:-1, :] += 2 * lambda_reg * (I[:-1, :] - I[1:, :])
        grad[1:, :]  -= 2 * lambda_reg * (I[:-1, :] - I[1:, :])
        grad[:, :-1] += 2 * lambda_reg * (I[:, :-1] - I[:, 1:])
        grad[:, 1:]  -= 2 * lambda_reg * (I[:, :-1] - I[:, 1:])
        return grad

    # Leapfrog integration
    grad = grad_log_target(I)
    momentum += 0.5 * step_size * grad
    for _ in range(num_steps):
        I += step_size * momentum
        grad = grad_log_target(I)
        momentum += step_size * grad
    momentum += 0.5 * step_size * grad
    # Negate momentum for symmetry
    momentum = -momentum

    # Compute Hamiltonians
    current_H = -log_target(I_current, Y, sigma, lambda_reg) + 0.5 * np.sum(current_momentum**2)
    proposed_H = -log_target(I, Y, sigma, lambda_reg) + 0.5 * np.sum(momentum**2)
    if np.random.rand() < np.exp(current_H - proposed_H):
        return I, True
    else:
        return I_current, False

hmc_images = []
accepted = 0
for it in range(iterations):
    I_new, acc = hmc_update(I_current, noisy_image, step_size, num_steps, sigma_noise, lambda_reg)
    if acc:
        accepted += 1
    I_current = I_new
    hmc_images.append(I_current.copy())

print(f"HMC Acceptance Rate: {accepted/iterations:.3f}"

4.4. Implicit Neural Representations (Neural SDF) with MALA

Context:

Implicit neural representations (e.g., Neural Signed Distance Functions, SDFs) model continuous signals (e.g., 3D shapes) using neural networks. Uncertainty can be captured by placing a prior over latent variables. Here, we use MALA to sample from the posterior over a latent variable in a small Neural SDF model.

Improved Implementation & Visualization:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Define a simple Neural SDF with a latent vector parameter
class NeuralSDF(nn.Module):
    def __init__(self, latent_dim=5):
        super(NeuralSDF, self).__init__()
        self.latent = nn.Parameter(torch.randn(latent_dim))
        self.fc = nn.Sequential(
            nn.Linear(3 + latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
    def forward(self, x):
        # x: [batch, 3]
        latent_expand = self.latent.unsqueeze(0).expand(x.size(0), -1)
        x_input = torch.cat([x, latent_expand], dim=1)
        return self.fc(x_input)

# Generate simulated observations: points near a sphere of radius 0.8
def generate_sdf_observations(n_points=200):
    points = torch.rand(n_points, 3) * 2 - 1  # uniformly in [-1,1]^3
    sdf_true = torch.norm(points, dim=1, keepdim=True) - 0.8
    sdf_obs = sdf_true + 0.05 * torch.randn_like(sdf_true)
    return points, sdf_obs

points, sdf_obs = generate_sdf_observations(200)

# Define log-likelihood and log-prior for the latent variable
def log_likelihood(model, points, sdf_obs, sigma=0.05):
    pred = model(points)
    return -0.5 * torch.sum((sdf_obs - pred)**2) / sigma**2

def log_prior(model):
    return -0.5 * torch.sum(model.latent**2)

def log_posterior(model, points, sdf_obs, sigma=0.05):
    return log_likelihood(model, points, sdf_obs, sigma) + log_prior(model)

# MALA update for the latent variable
def mala_update(model, points, sdf_obs, sigma=0.05, step_size=1e-3):
    model.zero_grad()
    logp = log_posterior(model, points, sdf_obs, sigma)
    logp.backward()
    grad = model.latent.grad.detach()
    latent_current = model.latent.detach().clone()
    noise = torch.randn_like(latent_current)
    latent_proposal = latent_current + 0.5 * step_size * grad + torch.sqrt(torch.tensor(step_size)) * noise

    # Compute acceptance probability (using symmetric proposal assumption)
    latent_old = model.latent.data.clone()
    model.latent.data = latent_proposal
    logp_proposal = log_posterior(model, points, sdf_obs, sigma)
    accept_prob = torch.exp(logp_proposal - logp)
    if torch.rand(1) < accept_prob:
        accepted = True
    else:
        model.latent.data = latent_old
        accepted = False
    return accepted, model.latent.data.clone(), logp_proposal.item()

model = NeuralSDF(latent_dim=5)
iterations = 2000
latent_samples = []
log_probs = []
accepts = 0

for i in range(iterations):
    accepted, latent_sample, lp = mala_update(model, points, sdf_obs, sigma=0.05, step_size=1e-3)
    latent_samples.append(latent_sample.numpy())
    log_probs.append(lp)
    if accepted:
        accepts += 1

print(f"MALA Acceptance Rate: {accepts/iterations:.3f}")

5. Conclusion

In this post, we have presented a rigorous exploration of MCMC methods. We began with theoretical foundations and then developed multiple algorithms with step-by-step pseudocode and theoretical justification. We detailed five major techniques: Importance Sampling, Metropolis–Hastings, Gibbs Sampling, Hamiltonian Monte Carlo, and MALA.

The use cases further demonstrate the practicality of these methods:

Bayesian linear regression uses Gibbs sampling to infer regression parameters.
Audio signal reconstruction incorporates signal preprocessing before applying MH.
Image reconstruction leverages HMC for efficient exploration in high dimensions.
Implicit neural representations (Neural SDF) showcase MALA for sampling latent variables in modern deep learning models.