DLCV - Generative Models (II) - Diffusion Model
Final Project
Snack/drink provided
- 1:30pm-5pm, Dec 26th 2024, Thursday
- 3~4 people per group (team up in mid Nov.)
- Adapt from latest CVPR/ICCV/ECCV challenges or competitions
- Poster presentation; code required for reproduction
- Intra/inter-group evaluation
Generative Models
Autoencoder (AE)
- Unsupervised learning for deriving latent representation
- Train AE with reconstruction objectives
- Train autoencoder (AE) for downstream tasks
- After AE training is complete, freeze/finetune the encoder and
learn additional modules (e.g., MLP) for downstream tasks - E.g., to train a DNN for classification,
one can freeze the encoder and learn an additional MLP as the classifier.
- After AE training is complete, freeze/finetune the encoder and
Classifier:
1. BCE
2. contrastive learning
From AE to Variational Autoencoder (VAE)
Reparameterization Trick in VAE
- Remarks
- Given x, sample z from latent distribution (described by output parameters of encoder), we apply
(ε simply generated by Normal distribution). - For training, this enables BP gradients in encoder through μ and σ; for inference, this introduces generation stochasticity
- Given x, sample z from latent distribution (described by output parameters of encoder), we apply
Denoising Diffusion Probabilistic Models (DDPM)
Learning to generate by denoising
- 2 processes required for training:
- Forward diffusion process
- gradually add noise to input
- Reverse diffusion process
- learns to generate/restore data by denoising
- typically implemented via a conditional U-net
- Forward diffusion process
From DDPM to DDIM:
Denoising Diffusion Implict Models
- DDIM
- Sampling process for generation
- Additional comment on : stochastic vs. deterministic generation process
- Since DDIM and DDPM share the same objective function,
so one can use a pretrained DDPM for DDIM generation.
Diffusion Model
Conditional Diffusion Model
Classifier Guidance
Classifier-free Guidance
Conditional Generation with Classifier-Free Guidance
(1-r)、r 都是 hyperparater,當 r < 0.5,(1 - r) 比較大,代表比較在乎生圖 ;反之,r > 0.5,代表比較在乎 labeling
Text/Image Guidance
Conditional Generation with Image Guidance
- Palette: Image-to-Image Diffusion Models, Google Research, arXiv 2022
- Applications: colorization, inpainting, outpainting, etc.
- Input image as condition (via concatenation)
Conditional Generation with Text Guidance
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, OpenAI, arXiv 2022
- CLIP (Contrastive Language-Image Pretraining) is previously proposed to measure alignment between text and image inputs
- Classifier guidance -> CLIP guidance (not training-free)
- What is CLIP?
CLIP: Contrastive Language-Image Pretraining
- Learning Transferable Visual Models From Natural Language Supervision, OpenAI, NeurIPS WS 2021 (w/ 22000+ citations)
- Why not just CNN?
- Require annotated data for training image classification
- Domain gap between closed and open-world domain data
- Lack of ability for zero-shot classification
Objectives
- Cross-domain contrastive learning from large-scale image-language data
- Next-token prediction (what’s this & why?); will talk more about this for the lecture of Transformer
Zero-shot
- 就算一開始的 training 的 label 沒有定義過,還是可以讓他在 test 的時候去檢測
- ex: training 時的 label 沒有定義過長頸鹿,只有其他獅子老虎等十種動物,但在 test 的時候給他一張長頸鹿的照片,並告知他的 label 是長頸鹿,test 就能看出來
Questions for Image Generation
- How to evaluate your unconditional image generation results?
- noisy latent
- How to evaluate your conditional image generation results?
- classification acc.
- Any objective/subjective and quantitative/qualitative evaluation?
Personalization via Diffusion Model
Diffusion Model for Personalization (1): Textual Inversion
- Proposed by NV Research, ICLR 2023
- Goal: Learn a special token (e.g., S*) to represent the concept of interest
- Learning of special token S*
- Pre-train and fix text encoder & diffusion model (i.e., generator)
- Randomly initialize a token as the text encoder input
- Optimize this token via image reconstruction objectives
Textual Inversion 會遇到的問題: overfitting
Diffusion Model for Personalization (2): DreamBooth
- Proposed by Google Research, CVPR 2023
- Finetune the diffusion model w/ a fixed token to represent the image concept
- Determine and fix a rare token (e.g., [V])
- Finetune the diffusion model for image restoration objectives
- Enforce a class-specific prior
Diffusion Model for Personalization (3): ControlNet
Proposed by Stanford, ICCV 2023
Goal: personalization via user-determined condition
Initialized from UNet’s encoder
Notations:
- x: input noise of each layer
- y: output noise of each layer
- c: conditions (e.g., edge, pose, sketch, etc.)
Generative Adversarial Network
From VAE to GAN
- Remarks
- We only need the decoder/generator in practice.
- We prefer fast generation.
- How do we know if the output images are sufficiently good?
- not necessary normal distribution
- Example
- TPA3D: Triplane Attention for Fast Text-to-3D Generation, ECCV 2024
- Bin-Shih Wu, Hong-En Chen, Shen-Yu Huang, and Y.-C. Frank Wang
Generative Adversarial Network
- Idea
- Generator to convert a vector z (sampled from
) into fake data x (from ), while we need - Discriminator classifies data as real or fake (1/0)
- How? Impose an adversarial loss on the observed data distribution!
- Generator to convert a vector z (sampled from
- Key idea:
- Impose adversarial loss on data distribution
- Remarks
- A function maps normal distribution
to - How good we are in mapping
to ? - Train & ask the discriminator!
- Conduct a two-player min-max game (see next slide for more details)
- A function maps normal distribution
當 GAN 被訓練得超級好後,Discriminator 辨認出一張 Generator 生出的圖片機率為 50/50,因為它分不出真假了
Training Objective of GAN
Jointly train generator G and discriminator D with a min-max game
Train G & D with alternating gradient updates
Potential Problem
At start of training, G is not OK yet (obviously);
D easily tells apart real/fake data (i.e., D(G(z)) close to 0).Possible Solution:
- Instead of training G to minimize log(1-D(G(z))) in the beginning,
we train G to minimize -log(D(G(z)). - With strong gradients from G, we start the training of the above min-max game
- Instead of training G to minimize log(1-D(G(z))) in the beginning,
Optimality of GAN
Remarks on Optimality of GAN
- Caution!
- G and D are learned models (i.e., DNNs) with fixed architectures.
We don’t know whether we can actually represent the optimal D & G. - Optimality of GAN does not tell anything about convergence to the optimal D/G.
- G and D are learned models (i.e., DNNs) with fixed architectures.
Conditional GANs
- Remarks
- ICLR 2016
- Conditional generative model p(x|y) instead of p(x)
- Both G and D take the label y as an additional input…
i.e., a conditional discriminator is deployed….Why? - Why not just use D as designed in the standard GAN?
Representation Disentanglement: Conditional GAN
- Goal
- Interpretable deep feature representation
- Disentangle attribute of interest c from the derived latent representation z
- Unsupervised: InfoGAN
- Supervised: AC-GAN
InfoGAN
Unsupervised Disentanglement
No guarantee in disentangling particular semantics