DLCV - Generative Models (II) - Diffusion Model

LAVI

Final Project

Snack/drink provided

  • 1:30pm-5pm, Dec 26th 2024, Thursday
  • 3~4 people per group (team up in mid Nov.)
  • Adapt from latest CVPR/ICCV/ECCV challenges or competitions
  • Poster presentation; code required for reproduction
  • Intra/inter-group evaluation

Generative Models

Autoencoder (AE)

  • Unsupervised learning for deriving latent representation
    • Train AE with reconstruction objectives
  • Train autoencoder (AE) for downstream tasks
    • After AE training is complete, freeze/finetune the encoder and
      learn additional modules (e.g., MLP) for downstream tasks
    • E.g., to train a DNN for classification,
      one can freeze the encoder and learn an additional MLP as the classifier.

Classifier:
1. BCE
2. contrastive learning

From AE to Variational Autoencoder (VAE)

Reparameterization Trick in VAE

  • Remarks
    • Given x, sample z from latent distribution (described by output parameters of encoder), we apply (ε simply generated by Normal distribution).
    • For training, this enables BP gradients in encoder through μ and σ; for inference, this introduces generation stochasticity

Denoising Diffusion Probabilistic Models (DDPM)

Learning to generate by denoising

  • 2 processes required for training:
    • Forward diffusion process
      • gradually add noise to input
    • Reverse diffusion process
      • learns to generate/restore data by denoising
      • typically implemented via a conditional U-net

From DDPM to DDIM:
Denoising Diffusion Implict Models

  • DDIM
    • Sampling process for generation
  • Additional comment on : stochastic vs. deterministic generation process
  • Since DDIM and DDPM share the same objective function,
    so one can use a pretrained DDPM for DDIM generation.

Diffusion Model

Conditional Diffusion Model

Classifier Guidance

Classifier-free Guidance

Conditional Generation with Classifier-Free Guidance

(1-r)、r 都是 hyperparater,當 r < 0.5,(1 - r) 比較大,代表比較在乎生圖 ;反之,r > 0.5,代表比較在乎 labeling

Text/Image Guidance

Conditional Generation with Image Guidance

  • Palette: Image-to-Image Diffusion Models, Google Research, arXiv 2022
    • Applications: colorization, inpainting, outpainting, etc.
    • Input image as condition (via concatenation)

Conditional Generation with Text Guidance

  • GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, OpenAI, arXiv 2022
    • CLIP (Contrastive Language-Image Pretraining) is previously proposed to measure alignment between text and image inputs
    • Classifier guidance -> CLIP guidance (not training-free)
    • What is CLIP?

CLIP: Contrastive Language-Image Pretraining

  • Learning Transferable Visual Models From Natural Language Supervision, OpenAI, NeurIPS WS 2021 (w/ 22000+ citations)
  • Why not just CNN?
    • Require annotated data for training image classification
    • Domain gap between closed and open-world domain data
    • Lack of ability for zero-shot classification

Objectives

  • Cross-domain contrastive learning from large-scale image-language data
  • Next-token prediction (what’s this & why?); will talk more about this for the lecture of Transformer

Zero-shot

  • 就算一開始的 training 的 label 沒有定義過,還是可以讓他在 test 的時候去檢測
  • ex: training 時的 label 沒有定義過長頸鹿,只有其他獅子老虎等十種動物,但在 test 的時候給他一張長頸鹿的照片,並告知他的 label 是長頸鹿,test 就能看出來

Questions for Image Generation

  • How to evaluate your unconditional image generation results?
    • noisy latent
  • How to evaluate your conditional image generation results?
    • classification acc.
  • Any objective/subjective and quantitative/qualitative evaluation?

Personalization via Diffusion Model

Diffusion Model for Personalization (1): Textual Inversion

  • Proposed by NV Research, ICLR 2023
  • Goal: Learn a special token (e.g., S*) to represent the concept of interest
  • Learning of special token S*
    • Pre-train and fix text encoder & diffusion model (i.e., generator)
    • Randomly initialize a token as the text encoder input
    • Optimize this token via image reconstruction objectives

Textual Inversion 會遇到的問題: overfitting

Diffusion Model for Personalization (2): DreamBooth

  • Proposed by Google Research, CVPR 2023
  • Finetune the diffusion model w/ a fixed token to represent the image concept
    • Determine and fix a rare token (e.g., [V])
    • Finetune the diffusion model for image restoration objectives
    • Enforce a class-specific prior

Diffusion Model for Personalization (3): ControlNet

  • Proposed by Stanford, ICCV 2023

  • Goal: personalization via user-determined condition

  • Initialized from UNet’s encoder

  • Notations:

    • x: input noise of each layer
    • y: output noise of each layer
    • c: conditions (e.g., edge, pose, sketch, etc.)

Generative Adversarial Network

From VAE to GAN

  • Remarks
    • We only need the decoder/generator in practice.
    • We prefer fast generation.
    • How do we know if the output images are sufficiently good?
    • not necessary normal distribution
  • Example
    • TPA3D: Triplane Attention for Fast Text-to-3D Generation, ECCV 2024
    • Bin-Shih Wu, Hong-En Chen, Shen-Yu Huang, and Y.-C. Frank Wang

Generative Adversarial Network

  • Idea
    • Generator to convert a vector z (sampled from ) into fake data x (from ), while we need
    • Discriminator classifies data as real or fake (1/0)
    • How? Impose an adversarial loss on the observed data distribution!
  • Key idea:
    • Impose adversarial loss on data distribution
  • Remarks
    • A function maps normal distribution to
    • How good we are in mapping to ?
      • Train & ask the discriminator!
    • Conduct a two-player min-max game (see next slide for more details)

當 GAN 被訓練得超級好後,Discriminator 辨認出一張 Generator 生出的圖片機率為 50/50,因為它分不出真假了

Training Objective of GAN

  • Jointly train generator G and discriminator D with a min-max game

  • Train G & D with alternating gradient updates

  • Potential Problem

  • At start of training, G is not OK yet (obviously);
    D easily tells apart real/fake data (i.e., D(G(z)) close to 0).

  • Possible Solution:

    • Instead of training G to minimize log(1-D(G(z))) in the beginning,
      we train G to minimize -log(D(G(z)).
    • With strong gradients from G, we start the training of the above min-max game

Optimality of GAN

Remarks on Optimality of GAN

  • Caution!
    • G and D are learned models (i.e., DNNs) with fixed architectures.
      We don’t know whether we can actually represent the optimal D & G.
    • Optimality of GAN does not tell anything about convergence to the optimal D/G.

Conditional GANs

  • Remarks
    • ICLR 2016
    • Conditional generative model p(x|y) instead of p(x)
    • Both G and D take the label y as an additional input…
      i.e., a conditional discriminator is deployed….Why?
    • Why not just use D as designed in the standard GAN?

Representation Disentanglement: Conditional GAN

  • Goal
    • Interpretable deep feature representation
    • Disentangle attribute of interest c from the derived latent representation z
      • Unsupervised: InfoGAN
      • Supervised: AC-GAN

InfoGAN

  • Unsupervised Disentanglement

  • No guarantee in disentangling particular semantics