Final Project

Snack/drink provided

1:30pm-5pm, Dec 26th 2024, Thursday
3~4 people per group (team up in mid Nov.)
Adapt from latest CVPR/ICCV/ECCV challenges or competitions
Poster presentation; code required for reproduction
Intra/inter-group evaluation

Generative Models

Autoencoder (AE)

Unsupervised learning for deriving latent representation
- Train AE with reconstruction objectives
Train autoencoder (AE) for downstream tasks
- After AE training is complete, freeze/finetune the encoder and
  learn additional modules (e.g., MLP) for downstream tasks
- E.g., to train a DNN for classification,
  one can freeze the encoder and learn an additional MLP as the classifier.

Classifier:
1. BCE
2. contrastive learning

From AE to Variational Autoencoder (VAE)

Reparameterization Trick in VAE

Remarks
- Given x, sample z from latent distribution (described by output parameters of encoder), we apply (ε simply generated by Normal distribution).
- For training, this enables BP gradients in encoder through μ and σ; for inference, this introduces generation stochasticity

Denoising Diffusion Probabilistic Models (DDPM)

Learning to generate by denoising

2 processes required for training:
- Forward diffusion process
  - gradually add noise to input
- Reverse diffusion process
  - learns to generate/restore data by denoising
  - typically implemented via a conditional U-net

From DDPM to DDIM:
Denoising Diffusion Implict Models

DDIM
- Sampling process for generation

Additional comment on : stochastic vs. deterministic generation process
Since DDIM and DDPM share the same objective function,
so one can use a pretrained DDPM for DDIM generation.

Diffusion Model

Conditional Diffusion Model

Classifier Guidance

Classifier-free Guidance

Conditional Generation with Classifier-Free Guidance

(1-r)、r 都是 hyperparater，當 r < 0.5，(1 - r) 比較大，代表比較在乎生圖；反之，r > 0.5，代表比較在乎 labeling

Text/Image Guidance

Conditional Generation with Image Guidance

Palette: Image-to-Image Diffusion Models, Google Research, arXiv 2022
- Applications: colorization, inpainting, outpainting, etc.
- Input image as condition (via concatenation)

Conditional Generation with Text Guidance

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, OpenAI, arXiv 2022
- CLIP (Contrastive Language-Image Pretraining) is previously proposed to measure alignment between text and image inputs
- Classifier guidance -> CLIP guidance (not training-free)
- What is CLIP?

CLIP: Contrastive Language-Image Pretraining

Learning Transferable Visual Models From Natural Language Supervision, OpenAI, NeurIPS WS 2021 (w/ 22000+ citations)
Why not just CNN?
- Require annotated data for training image classification
- Domain gap between closed and open-world domain data
- Lack of ability for zero-shot classification

Objectives

Cross-domain contrastive learning from large-scale image-language data
Next-token prediction (what’s this & why?); will talk more about this for the lecture of Transformer

Zero-shot

就算一開始的 training 的 label 沒有定義過，還是可以讓他在 test 的時候去檢測
ex: training 時的 label 沒有定義過長頸鹿，只有其他獅子老虎等十種動物，但在 test 的時候給他一張長頸鹿的照片，並告知他的 label 是長頸鹿，test 就能看出來

Questions for Image Generation

How to evaluate your unconditional image generation results?
- noisy latent
How to evaluate your conditional image generation results?
- classification acc.
Any objective/subjective and quantitative/qualitative evaluation?

Personalization via Diffusion Model

Diffusion Model for Personalization (1): Textual Inversion

Proposed by NV Research, ICLR 2023
Goal: Learn a special token (e.g., S*) to represent the concept of interest

Learning of special token S*
- Pre-train and fix text encoder & diffusion model (i.e., generator)
- Randomly initialize a token as the text encoder input
- Optimize this token via image reconstruction objectives

Textual Inversion 會遇到的問題: overfitting

Diffusion Model for Personalization (2): DreamBooth

Proposed by Google Research, CVPR 2023
Finetune the diffusion model w/ a fixed token to represent the image concept
- Determine and fix a rare token (e.g., [V])
- Finetune the diffusion model for image restoration objectives
- Enforce a class-specific prior

Diffusion Model for Personalization (3): ControlNet

Proposed by Stanford, ICCV 2023
Goal: personalization via user-determined condition
Initialized from UNet’s encoder
Notations:
- x: input noise of each layer
- y: output noise of each layer
- c: conditions (e.g., edge, pose, sketch, etc.)

Generative Adversarial Network

From VAE to GAN

Remarks
- We only need the decoder/generator in practice.
- We prefer fast generation.
- How do we know if the output images are sufficiently good?
- not necessary normal distribution
Example
- TPA3D: Triplane Attention for Fast Text-to-3D Generation, ECCV 2024
- Bin-Shih Wu, Hong-En Chen, Shen-Yu Huang, and Y.-C. Frank Wang

Generative Adversarial Network

Idea
- Generator to convert a vector z (sampled from ) into fake data x (from ), while we need
- Discriminator classifies data as real or fake (1/0)
- How? Impose an adversarial loss on the observed data distribution!
Key idea:
- Impose adversarial loss on data distribution
Remarks
- A function maps normal distribution to
- How good we are in mapping to ?
  - Train & ask the discriminator!
- Conduct a two-player min-max game (see next slide for more details)

當 GAN 被訓練得超級好後，Discriminator 辨認出一張 Generator 生出的圖片機率為 50/50，因為它分不出真假了

Training Objective of GAN

Jointly train generator G and discriminator D with a min-max game
Train G & D with alternating gradient updates
Potential Problem
At start of training, G is not OK yet (obviously);
D easily tells apart real/fake data (i.e., D(G(z)) close to 0).
Possible Solution:
- Instead of training G to minimize log(1-D(G(z))) in the beginning,
  we train G to minimize -log(D(G(z)).
- With strong gradients from G, we start the training of the above min-max game

Optimality of GAN

Remarks on Optimality of GAN

Caution!
- G and D are learned models (i.e., DNNs) with fixed architectures.
  We don’t know whether we can actually represent the optimal D & G.
- Optimality of GAN does not tell anything about convergence to the optimal D/G.

Conditional GANs

Remarks
- ICLR 2016
- Conditional generative model p(x|y) instead of p(x)
- Both G and D take the label y as an additional input…
  i.e., a conditional discriminator is deployed….Why?
- Why not just use D as designed in the standard GAN?

Representation Disentanglement: Conditional GAN

Goal
- Interpretable deep feature representation
- Disentangle attribute of interest c from the derived latent representation z
  - Unsupervised: InfoGAN
  - Supervised: AC-GAN

InfoGAN

Unsupervised Disentanglement
No guarantee in disentangling particular semantics

DLCV - Generative Models (II) - Diffusion Model

Generative Models

Autoencoder (AE)

From AE to Variational Autoencoder (VAE)

Reparameterization Trick in VAE

Denoising Diffusion Probabilistic Models (DDPM)

Diffusion Model

Conditional Diffusion Model

Classifier Guidance

Classifier-free Guidance

Text/Image Guidance

Conditional Generation with Image Guidance

Conditional Generation with Text Guidance

CLIP: Contrastive Language-Image Pretraining

Objectives

Zero-shot

Personalization via Diffusion Model

Diffusion Model for Personalization (1): Textual Inversion

Diffusion Model for Personalization (2): DreamBooth

Diffusion Model for Personalization (3): ControlNet

Generative Adversarial Network

From VAE to GAN

Generative Adversarial Network

Training Objective of GAN

Optimality of GAN

Remarks on Optimality of GAN

Conditional GANs

Representation Disentanglement: Conditional GAN

InfoGAN