This post is an attempt to bridge the gap between different ideas around the latest techniques in generative modeling. We will try to do it in a mathematically rigorous fashion, and meticulously unpacking the theory and the links between these models.
Throughout, we use the following notations:
$\mathbf{x} \in \mathbb{R}^d$ denotes data (or a random variable in the data space).
$p(\mathbf{x})$ denotes the data distribution.
$\pi(\mathbf{z})$ typically denotes a base (or prior) distribution in a latent space $\mathbf{z}\in \mathbb{R}^d$. A common choice is $\pi(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$.
For time-dependent distributions, we write $q_t(\mathbf{x})$ or $p_t(\mathbf{x})$.
$\nabla_{\mathbf{x}}$ denotes the gradient operator w.r.t. $\mathbf{x}$.
$\nabla_{\mathbf{x}} \cdot (\cdot)$ denotes the divergence operator w.r.t. $\mathbf{x}$.
Introduction to generative modeling
A generative model is a parameterized family of probability distributions $p_{\theta}(\mathbf{x})$ that we seek to match to a true data distribution $p_{\text{data}}(\mathbf{x})$. One typically has i.i.d. samples from $p_{\text{data}}$ (the training data). We want to:
Train $p_{\theta}(\mathbf{x})$ so that $p_{\theta}\approx p_{\text{data}}$.
Generate (sample) new data $\mathbf{x}$ from $p_{\theta}$.
Potentially evaluate or compare densities for model-based reasoning.
Different generative modeling paradigms include:
Normalizing flows (explicitly invertible mappings or continuous-time analogs).
Variational Autoencoders (VAEs) (encoder-decoder with latent variables).
Score-Based / Diffusion Models (using a forward noising process and reverse-time score estimation).
In this post, we will focus on:
Flow matching: A continuous-time method to learn velocity fields that morph one distribution into another.
Score matching: A technique to learn the gradient of a log-density function.
Diffusion models: A special case of (time-dependent) score-matching that uses an SDE to degrade data and a reverse SDE to generate.
Flow-based generative modeling
Traditional normalizing flows
In a discrete normalizing flow, one designs a sequence of invertible mappings $ f_i: \mathbb{R}^d \to \mathbb{R}^d$, $i=1,\dots,L$. Denote the base distribution $\pi(\mathbf{z})$, often $\mathcal{N}(\mathbf{0},\mathbf{I})$. A sample from the model is constructed as:
where $f = f_L \circ \dots \circ f_1$. Training typically maximizes the log-likelihood $\log p_{\theta}(\mathbf{x})$ over data $\mathbf{x}$. But carefully designing invertible $f_i$ with tractable Jacobian determinants can be restrictive.