DLCV - Recurrent Neural Networks & Transformer

Mode Collapse

Remarks
The generator only outputs a limited number of image variants
regardless of the inputs.

What’s DA?
- Leveraging info source to target domains, so that the same learning task across domains (or particularly in the target domain) can be addressed.
- Typically all the source-domain data are labeled.
Settings
- Semi-supervised DA: few target-domain data are with labels.
- Unsupervised DA: no label info available in the target-domain.
  (shall we address supervised DA?)
- Imbalanced DA: fewer classes of interest in the target domain
- Homogeneous vs. heterogeneous DA

unsupervised domain adaptation

讓紅色和藍色的資料混在一起分類

利用計算 "距離" 來訓練拉近紅色和藍色的同類別資料，但這個方法十分 "單純"

兩個步驟都可以同時 train
這張圖左邊底下藍色的 Labeled Images 其實是 Source data (紅色圓圈內的資料)
右邊底下紅色的 Unlabeled Images 其實是 Target data (藍色圓圈內的資料)

Domain-Adversarial Training of Neural Networks (DANN)
- Y. Ganin et al., ICML 2015
- Maximize domain confusion = maximize domain classification loss
- Minimize source-domain data classification loss
- The derived feature f can be viewed as a disentangled & domain-invariant feature

兩條線都可以同時進行 train

Domain Separation Network (DSN)
- Bousmalis et al., NIPS 2016
- Separate encoders for domain-invariant and domain-specific features
- Private/common features are disentangled from each other.

橘色區域取的是前景，綠色區域取的是背景，橘色和綠掃區域生出的圖片要越不像越好

Parameter sharing + unrolling
- Keeps the number of parameters fixed
- Allows sequential data with varying lengths
Memory ability
- Capture and preserve information which has been extracted/processed

h: hidden state

-Let’s focus on one training instance.

The divergence to be computed is between the sequence of outputs by the network and the desired output sequence.
Generally, this is not just the sum of the divergences at individual times

Long Short-term Memory (LSTM) [Hochreiter et al., 1997]
- Additional memory cell
- Input/Forget/Output Gates
- Handle gradient vanishing
- Learn long-term dependencies
Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014]
- Similar to LSTM
  - handle gradient vanishing & learn long-term dependencies
- No additional memory cell
- Reset / Update Gates
- Fewer parameters than LSTM
- Comparable performance to LSTM [Chung et al., NIPS Workshop 2014]

Multi-task learning

左邊白色區域是 Encoder
藍色區域是 Decoder#1，做 data recondstruction 或 recovery
橘色區域是 Decoder#2，做 data predict

data 跟 posture 對得起來才是 true (test 第一個案例)，包含時間點對不起來也是 false (test 第三個案例)

Each hidden state vector extracts/carries information across time steps (some might be diluted downstream).
Information of the entire input sequence is embedded into a single hidden state vector.
Outputs at different time steps have particular meanings.
However, synchrony between input and output seqs is not required

What should the attention model be?
- A NN whose inputs are z and h while output is a scalar, indicating the similarity between z and h.
Most attention models are jointly learned with other parts of network (e.g., recognition, etc.)