Machine Learning - Decision Tree Learning

Decision Tree

在訓練過程中決策樹會提出一系列是非問題，從而將 dataset 中的資訊進行分割，分割的原則是始知獲得最大的資訊增益 (Information gain)

Decision Tree 能夠學習 disjunctive expression
每個 internal node (至少有一個 child 的 node) 都代表一個 attribute
每個 branch 樹枝都對應一個 attribute value
每個 leaf node 都分配一個 classification
Decision Tree 本身就形成 hypothesis
- 包含 Disjunction (OR’s) 和 Conjunction (AND’s)
- 從 root 到 leaf 的路徑形成 conjunction attribute
- 每個獨立的 branch 都是 disjunction
Decision Tree 本身就形成 hypothesis
- 樹根到樹葉的每條路徑都形成 attribute 的 conjunction(AND) 限制條件
- 每個獨立的 branch 彼此都是 disjunction(OR)

一種 Decision Tree 的歸納演算法，簡單，但有代表性

在可能的 decision tree space 由 Top-down、Greedy 的方式搜索
沿著 decision tree 向下搜索，為每個節點選擇屬性，永遠不回頭重新考慮之前的選擇，每走一步都最大程度改善優化效果的決定

categorical input and output attribute only

ID3 算法旨在處理分類特徵，無法直接處理連續特徵

用以確定哪個屬性最能對數據進行分類
評估 how well 一個 attribute 分類這筆 data
使用 entropy 來定義 Information Gain

熵是隨機變數不確定性的度量
當所有隨機變數都機率相同時，熵達到最大值
當隨機變數只能為一個值時，熵達到最小值

Entorpy 是每個 information 權重的平均

計算方法:

S 是 training examples 的集合
是 S 中，是 positive example 的部分
是 S 中，是 negative example 的部分

Entropy 用以衡量 S 的不純度 (impurity)

公式

現在回來看 Information Gain

Information Gain 對於 S 集合中的屬性 A，為

Gain 值愈大，表示屬性內資料的凌亂程度愈小，用來分類資料愈佳
Gain 值愈小，表示屬性內資料的凌亂程度愈大，用來分類資料愈差