Diffusion Models

Generating images by denoising

The Art of Controlled Noise

Imagine taking a photograph and gradually adding noise until it becomes random static. Now imagine reversing that process—starting with pure noise and iteratively refining it to reveal a coherent image. That's the principle behind diffusion models.

The Core Insight

Diffusion models learn to denoise. They're trained on millions of images that have been progressively corrupted with noise. The model learns: "Given this noisy image, what did the slightly-less-noisy version look like?"

By chaining many small denoising steps together, the model can start from pure randomness and arrive at a realistic image.

Forward Process: Adding Noise

The forward process is simple and doesn't require learning:

Start with a clean image
Add a tiny bit of Gaussian noise
Repeat many times (typically 1000 steps)
End with pure noise

Each step is deterministic given the previous step and the noise schedule.

Reverse Process: Creating Images

The reverse process is where the core computation occurs:

Start with random noise
Predict what the slightly-cleaner version looks like
Remove a bit of noise
Repeat until you have a clean image

The neural network learns to perform step 2—predicting the noise that was added at each step.

Why This Works

Several properties make diffusion models effective:

Gradual transformation: Small steps are easier to learn than big jumps
Probabilistic: Each step is a probability distribution, enabling diverse outputs
Stable training: The objective is simple (predict added noise)
High quality: Many refinement steps = fine details

Guidance: Controlling What's Generated

Classifier-free guidance lets you control the output with text or other conditions:

Train the model both with and without the condition
At generation time, amplify the difference between conditional and unconditional predictions
Higher guidance = more faithful to the prompt, but less diverse

"A photo of a cat" with high guidance → definitely a cat "A photo of a cat" with low guidance → maybe a cat-like creature

Key Models

DALL-E 2 (2022): Text-to-image via diffusion in CLIP's latent space

Stable Diffusion (2022): Open-source, runs in latent space for efficiency

Midjourney: Known for artistic, stylized outputs

SDXL / SD3: Higher resolution, better prompt following

Latent Diffusion

Running diffusion on full-resolution images is expensive. Latent diffusion works in a compressed space:

Encode: Compress image to a small latent representation
Diffuse: Run diffusion in latent space (faster!)
Decode: Expand back to full resolution

This makes high-resolution generation practical.

Applications

Text-to-image: Generate images from descriptions
Image-to-image: Transform existing images (style transfer, editing)
Inpainting: Fill in missing regions
Super-resolution: Enhance low-resolution images
Video generation: Extend to temporal sequences
3D generation: Create 3D models from text

Sampling Speed vs. Quality

More denoising steps = higher quality but slower generation

Researchers have developed faster samplers:

DDIM: Deterministic sampling, fewer steps needed
DPM-Solver: Optimized ODE solvers
LCM: Consistency models, 4-8 steps possible

Current Limitations

Composition: Struggles with multiple objects and their relationships
Text in images: Often generates garbled text
Counting: "Three cats" might yield two or four
Hands/anatomy: Famous failure modes with fingers
Training data: Inherits biases from training images

References

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn