Making sense of Score, Flow, and Diffusion models

This post is an attempt to bridge the gap between different ideas around the latest techniques in generative modeling. We will try to do it in a mathematically rigorous fashion, and meticulously unpacking the theory and the links between these models.

Throughout, we use the following notations:

$\mathbf{x} \in \mathbb{R}^d$ denotes data (or a random variable in the data space).
$p(\mathbf{x})$ denotes the data distribution.
$\pi(\mathbf{z})$ typically denotes a base (or prior) distribution in a latent space $\mathbf{z}\in \mathbb{R}^d$. A common choice is $\pi(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})$.
For time-dependent distributions, we write $q_t(\mathbf{x})$ or $p_t(\mathbf{x})$.
$\nabla_{\mathbf{x}}$ denotes the gradient operator w.r.t. $\mathbf{x}$.
$\nabla_{\mathbf{x}} \cdot (\cdot)$ denotes the divergence operator w.r.t. $\mathbf{x}$.

Introduction to generative modeling

A generative model is a parameterized family of probability distributions $p_{\theta}(\mathbf{x})$ that we seek to match to a true data distribution $p_{\text{data}}(\mathbf{x})$. One typically has i.i.d. samples from $p_{\text{data}}$ (the training data). We want to:

Train $p_{\theta}(\mathbf{x})$ so that $p_{\theta}\approx p_{\text{data}}$.
Generate (sample) new data $\mathbf{x}$ from $p_{\theta}$.
Potentially evaluate or compare densities for model-based reasoning.

Different generative modeling paradigms include:

Normalizing flows (explicitly invertible mappings or continuous-time analogs).
Variational Autoencoders (VAEs) (encoder-decoder with latent variables).
Generative Adversarial Networks (GANs) (adversarial training).
Energy-Based Models (unnormalized densities).
Score-Based / Diffusion Models (using a forward noising process and reverse-time score estimation).

In this post, we will focus on:

Flow matching: A continuous-time method to learn velocity fields that morph one distribution into another.
Score matching: A technique to learn the gradient of a log-density function.
Diffusion models: A special case of (time-dependent) score-matching that uses an SDE to degrade data and a reverse SDE to generate.

Flow-based generative modeling

Traditional normalizing flows

In a discrete normalizing flow, one designs a sequence of invertible mappings $ f_i: \mathbb{R}^d \to \mathbb{R}^d$, $i=1,\dots,L$. Denote the base distribution $\pi(\mathbf{z})$, often $\mathcal{N}(\mathbf{0},\mathbf{I})$. A sample from the model is constructed as:

\[\mathbf{z}_0 \sim \pi(\mathbf{z}), \quad \mathbf{z}_1 = f_1(\mathbf{z}_0), \quad \mathbf{z}_2 = f_2(\mathbf{z}_1), \quad \dots \quad \mathbf{z}_L = f_L(\mathbf{z}_{L-1}) =: \mathbf{x}.\]

Hence $\mathbf{x} \sim p_{\theta}(\mathbf{x})$. If each $f_i$ is invertible, the model distribution can be exactly expressed:

\[p_{\theta}(\mathbf{x}) = \pi \bigl(f^{-1}(\mathbf{x})\bigr)\, \left\lvert \det \nabla_{\mathbf{x}} f^{-1}(\mathbf{x}) \right\rvert,\]

where $f = f_L \circ \dots \circ f_1$. Training typically maximizes the log-likelihood $\log p_{\theta}(\mathbf{x})$ over data $\mathbf{x}$. But carefully designing invertible $f_i$ with tractable Jacobian determinants can be restrictive.