All Posts

Diffusion Models: How AI Creates Art from Noise

Midjourney and DALL-E don't paint; they 'denoise'. We explain the physics-inspired magic behind D...

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Cover Image for Diffusion Models: How AI Creates Art from Noise
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Diffusion models work by first learning to add noise to an image, then learning to undo that noise. At inference time you start from pure static and iteratively denoise into a meaningful image. They power DALL-E, Midjourney, and Stable Diffusion.


๐Ÿ“– The Reverse Photograph: Making Sense from Static

Imagine you take a clear photo and run it through a machine that adds a tiny bit of static 1,000 times in a row. At step 1,000 the image is pure white noise โ€” indistinguishable from random snow on a TV.

Now you train a neural network to reverse that process: given an image at step $t$, predict what it looked like at step $t-1$. Once the model can do this reliably, you can start from pure noise at step 1,000 and run the denoiser 1,000 times to arrive at a sharp, coherent image.

That is the entire intuition behind diffusion models.


๐Ÿ”ข Two Processes: Forward Diffusion and Reverse Denoising

Forward process (training only):

  • Deterministic and mathematical. No network needed.
  • At each step $t$, add a small amount of Gaussian noise.
  • After $T$ steps (~1000), the image is purely random noise.

Reverse process (what the model learns):

  • The model learns to predict and remove the noise added at each step.
  • At inference time, run the reverse process from $t=T$ to $t=0$.
flowchart LR
    Clean[Clean Image\nt=0] -->|Add noise step by step| Noisy[Pure Noise\nt=1000]
    Noisy -->|Learned denoising| Clean2[Generated Image\nt=0]
StepForward (training)Reverse (inference)
InputClean imagePure noise
OperationAdd Gaussian noiseRemove predicted noise
Known?Yes โ€” we add itNo โ€” model predicts it
OutputNoisy imageClean image

โš™๏ธ What the Model Actually Learns: Predicting Noise, Not Pixels

The model is not trained to directly predict the clean image. It is trained to predict the noise that was added.

$$L = \| \epsilon - \epsilon_\theta(x_t, t) \|^2$$

  • $\epsilon$: the actual Gaussian noise we added (known, because we added it)
  • $\epsilon_\theta(x_t, t)$: the model's prediction of that noise
  • $x_t$: the noisy image at step $t$

Plain English: "Look at this noisy image at step $t$. Guess what static I mixed in. The better you guess, the more precisely I can subtract it to recover the clean image."

The U-Net architecture (with skip connections) is commonly used as $\epsilon_\theta$ because it can process images at multiple scales while retaining fine-grained spatial information.


๐Ÿง  Stable Diffusion: Latent Space and Text Conditioning

Running 1,000 denoising steps on a full 512ร—512 image is slow. Stable Diffusion (Rombach et al., 2022) adds two improvements:

  1. Latent diffusion: Compress the image into a small latent representation using a VAE first. Run the diffusion process on the smaller latent (8ร— smaller), then decode back to full resolution.

  2. Text conditioning (CLIP): Feed the text prompt through a text encoder (CLIP). Inject the text embedding into each denoising step via cross-attention. The model learns to denoise toward images that match the text.

flowchart TD
    Prompt[Text Prompt] --> CLIP[CLIP Text Encoder]
    Noise[Latent Noise] --> UNet[U-Net Denoiser\nwith Cross-Attention]
    CLIP --> UNet
    UNet --> VAE[VAE Decoder]
    VAE --> Image[Generated Image]

๐ŸŒ Real-World Generators and What They Use

ProductUnderlying modelKey innovation
Stable DiffusionLatent diffusion (Runway ML / Stability AI)Open weights, runs on consumer GPUs
DALL-E 3 (OpenAI)Diffusion with better text alignmentTrained with synthetic high-quality captions
MidjourneyProprietary diffusionAesthetic tuning and community ranking
Adobe FireflyDiffusion trained on licensed imagesIP-safe training data
Sora (OpenAI, video)Diffusion Transformers (DiT) on video tokensTemporal coherence over long clips

โš–๏ธ Inference Speed vs Quality: Steps, Samplers, and Guidance

Steps: More denoising steps = sharper image but slower. 20โ€“50 steps is a common sweet spot.

Samplers/Schedulers: DDPM (original) needs 1,000 steps. DDIM, DPM++, LCM reduce this to 4โ€“50 steps with comparable quality.

Classifier-Free Guidance (CFG scale): Trades creativity for prompt adherence.

  • Low CFG (1โ€“3): dreamlike, diverse, may ignore prompt
  • High CFG (10โ€“20): closely follows prompt but can look oversaturated

๐Ÿ“Œ Key Takeaways

  • Diffusion models learn to remove noise iteratively; inference starts from pure noise and denoises.
  • The model predicts noise $\epsilon$, not pixels โ€” subtraction recovers the cleaned image.
  • Stable Diffusion runs in latent space (8ร— compression) and uses CLIP for text conditioning.
  • Common samplers (DDIM, DPM++) reduce required steps from 1,000 to ~20 with comparable quality.
  • CFG scale controls the prompt-adherence vs creativity trade-off.

๐Ÿงฉ Test Your Understanding

  1. Why does training a diffusion model require adding noise to images first?
  2. The model outputs $\epsilon_\theta(x_t, t)$. How is this used to recover the image at step $t-1$?
  3. How does text conditioning work in Stable Diffusion?
  4. A user complains the output looks oversaturated and plasticky. Which setting might be too high?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms