Diffusion Models
Generating images by denoising
The Art of Controlled Noise
Imagine taking a photograph and gradually adding noise until it becomes random static. Now imagine reversing that process—starting with pure noise and iteratively refining it to reveal a coherent image. That's the principle behind diffusion models.
The Core Insight
Diffusion models learn to denoise. They're trained on millions of images that have been progressively corrupted with noise. The model learns: "Given this noisy image, what did the slightly-less-noisy version look like?"
By chaining many small denoising steps together, the model can start from pure randomness and arrive at a realistic image.
Forward Process: Adding Noise
The forward process is simple and doesn't require learning:
- Start with a clean image
- Add a tiny bit of Gaussian noise
- Repeat many times (typically 1000 steps)
- End with pure noise
Each step is deterministic given the previous step and the noise schedule.
Reverse Process: Creating Images
The reverse process is where the core computation occurs:
- Start with random noise
- Predict what the slightly-cleaner version looks like
- Remove a bit of noise
- Repeat until you have a clean image
The neural network learns to perform step 2—predicting the noise that was added at each step.
Why This Works
Several properties make diffusion models effective:
- Gradual transformation: Small steps are easier to learn than big jumps
- Probabilistic: Each step is a probability distribution, enabling diverse outputs
- Stable training: The objective is simple (predict added noise)
- High quality: Many refinement steps = fine details
Guidance: Controlling What's Generated
Classifier-free guidance lets you control the output with text or other conditions:
- Train the model both with and without the condition
- At generation time, amplify the difference between conditional and unconditional predictions
- Higher guidance = more faithful to the prompt, but less diverse
"A photo of a cat" with high guidance → definitely a cat "A photo of a cat" with low guidance → maybe a cat-like creature
Key Models
DALL-E 2 (2022): Text-to-image via diffusion in CLIP's latent space
Stable Diffusion (2022): Open-source, runs in latent space for efficiency
Midjourney: Known for artistic, stylized outputs
SDXL / SD3: Higher resolution, better prompt following
Latent Diffusion
Running diffusion on full-resolution images is expensive. Latent diffusion works in a compressed space:
- Encode: Compress image to a small latent representation
- Diffuse: Run diffusion in latent space (faster!)
- Decode: Expand back to full resolution
This makes high-resolution generation practical.
Applications
- Text-to-image: Generate images from descriptions
- Image-to-image: Transform existing images (style transfer, editing)
- Inpainting: Fill in missing regions
- Super-resolution: Enhance low-resolution images
- Video generation: Extend to temporal sequences
- 3D generation: Create 3D models from text
Sampling Speed vs. Quality
More denoising steps = higher quality but slower generation
Researchers have developed faster samplers:
- DDIM: Deterministic sampling, fewer steps needed
- DPM-Solver: Optimized ODE solvers
- LCM: Consistency models, 4-8 steps possible
Current Limitations
- Composition: Struggles with multiple objects and their relationships
- Text in images: Often generates garbled text
- Counting: "Three cats" might yield two or four
- Hands/anatomy: Famous failure modes with fingers
- Training data: Inherits biases from training images
References
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.