EECS - RL - INCLAVIC

因為老師極度建議這門不要和 DLCV、ADL 一起修

RL:
The science of decision making

About Reinforcement Learning

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?
- There is no supervisor, only reward signals
  
  reward signals 現在做得多好，不一定代表以後做得多好
- Feedback is delayed, not instantaneous
  
  現在做得很好的 feedback 可能是很久以前做的某些努力，現在獲得了 feedback，但不代表現在就很好，過去做得某些努力讓現在過得很好
- Agent’s actions affect the subsequent data it receives
- Time really matters
  - Sequential, non i.i.d. data

The Reinforcement Learning Problem

Rewards

Definition
- A reward is a scalar feedback signal
- Indicates how well agent is doing at time step
- The agent’s job is to maximize cumulative reward(or return)
  
  Reinforcement learning is based on the Reward Hypothesis

Reward Hypothesis
All goals can be described by the maximization of expected cumulative reward

“all of what we mean by goals and purposes can be well thought of as maximization
of the expected value of the cumulative sum of a received scalar signal (reward).”
所有目標都可以用預期累積報酬的最大化來描述
可以用 maximization reward 來描述所有目標導向的行為

Examples of Rewards

Robot manipulation — door closing or cube rotating
- (+) reward for closing the door or correctly rotating the cube
- (-) reward otherwise
Playing Atari games
- Reward = score
Playing Go or StarCraft II
- (+) reward for winning
- (-) reward for losing
RL from human feedback
- (+) reward for producing responses more preferred by humans
- (-) reward for producing responses less preferred by humans
Managing an investment portfolio
- Reward = profit or loss

Sequential Decision Making

Goal
- Select actions to maximize total future reward, i.e., reward-to-go
  
  過去的事已經發生改變不了，我該怎麼在現在下決定，使未來可以獲得最大回報
  你明知現在可以選擇怠惰，或是做可以當下就會得回報的事情，但你去選擇未來可能可以獲得最大回報的決定
Properties
- Actions may have long-term consequences
- Reward maybe delayed
- It may be better to sacrifice immediate reward to gain more long-term reward
Examples
- A financial investment (may take months to mature)
- Blocking opponent moves (might help winning chances many moves from now)
- Refueling a helicopter (might prevent a crash in several hours)

Agent and Environment

At each time step

The agent
Receives observation
Receives scalar reward
Executes action
The environment
Receives action
Emits observation
Emits scalar reward

increments at environment step
The environment

History and State

History
The history is the sequence of observations, action, and rewards

i.e., all observable variables up to time
- e.g., the sensorimotor system of a robot or embodied agent
What happens next depends on the history
- The agent selects actions
- The environment emits observations/rewards
State
- State is the information used to determine what happens next
- Formally, state is a function of the history

Environment State

The environment state is the environment’s private representation
i.e., whatever data the environment uses to pick the next observation/reward
e.g., the position, velocity, acceleration of each joint of a robot arm and objects in a tabletop manipulation setup

Agent State

The agent state is the agent’s internal representation
- i.e., whatever information the agent uses to pick the next action
- i.e., it is the information used by reinforcement learning algorithms
- It can be any function of history

Information State

Definition
- A state is Markov if and only if
An information state (a.k.a. Markov state) contains all useful information from the history
- “The future is independent of the past given the present”
- Once the state is known, the history may be thrown away
- i.e., the state is a sufficient statistic of the future
- The environment state is Markov
- The history is Markov
  
  你的設計會影響正確答案，based on 對環境的認知

Fully Observable Environments

Full observability: agent directly observes environment state
Agent state = environment
state = information state
Formally, this is a Markov decision process (MDP)
Next lecture and the majority of this course

Partially Observable Environments

Partial observability: agent indirectly observes environment state
- A robot with camera vision isn’t told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public cards and its own cards
Now agent state ≠ environment state
- Formally, this is a partially observable Markov decision process (POMDP)
- Pronounced as “pom-dp”
Agent must construct its own state representation , e.g.,
- Complete history:
- Beliefs of environment state:
- Learned by recurrent neural network or Transformer, e.g.,

Inside a Reinforcement Learning Agent

Major Components of an RL Agent

An RL agent may include one or more of these components
- Policy: agent’s behavior function
- Value function: how good is each state and/or action
- Model: agent’s representation of the environment

Policy
- A policy is the agent’s behavior, mapping from state to action, e.g.,
- Deterministic policy:
- Stochastic policy:
Value function
- Value function is a prediction of future reward
- Used to evaluate the goodness/badness of states
- And therefore to select between actions, e.g.,
Model
- A model predicts what the environment will do next
- predicts the next state
- predicts the next state

Maze Example - Policy and Value Function

Depends on your policy

Categorizing RL Agents

Problems with Reinforcement Learning

Learning and Planning

Two fundamental problems in sequential decision making
Reinforcement learning
- The environment is initially unknown
- The agent interacts with environment
- The agent improves its policy
- Reinforcement learning is like trial-and-error learning
Planning
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. search, deliberation, reasoning, introspection, pondering, thought

Atari Example - Reinforcement Learning

- Learning to play Atari games with RL - Rules of game are unknown - Dynamics (transitions) - Reward (game scores) - RL agent learns directly from interactive game-play - Pick actions on joystick - See pixels (observation) and scores (reward)

Atari Example - Planning

Playing Atari games via planning

Rules of game are known
- Dynamics (transitions)
- Reward (game scores)
Can query emulator
- i.e., perfect model inside
  agent’s brain
If I take action from state
- What would the next state
  (game screen) be?
- What would the score be?
Goal: plan ahead to find an optimal policy
- e.g., tree search

Exploration and Exploitation

Exploration and exploitation
- Exploration finds more information about the environment
- Exploitation exploits known information to maximize reward
- It is usually important to explore as well as exploit
  - Balancing is the key
    
    比起去餐廳選已經吃過的好吃食物，每次都點不一樣的食物，有可能可以遇到更好吃的，也可能更難吃，但沒吃過的食物就不知道裡面是不是有爆好吃的
Example
- Restaurant selection
  - Go to your favorite restaurant (exploitation) vs. try a new restaurant (exploration)
- Online banner advertisements
  - Show the most successful advert (exploitation) vs. show a different advert (exploration)
- Oil drilling
  - Drill at the best known location (exploitation) vs. drill at a new location (exploration)
- Game playing
  - Play the move you believe is best (exploitation) vs. play an experimental move (exploration)

Prediction and Control

Reference

[NTUEE] Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning by Shao-Hua Sun (孫紹華), Assistant Professor in Electrical Engineering, National Taiwan University