EECS - RL
因為老師極度建議這門不要和 DLCV、ADL 一起修
RL:
The science of decision making
About Reinforcement Learning
Characteristics of Reinforcement Learning
- What makes reinforcement learning different from other machine learning paradigms?
- There is no supervisor, only reward signals
reward signals 現在做得多好,不一定代表以後做得多好
- Feedback is delayed, not instantaneous
現在做得很好的 feedback 可能是很久以前做的某些努力,現在獲得了 feedback,但不代表現在就很好,過去做得某些努力讓現在過得很好
- Agent’s actions affect the subsequent data it receives
- Time really matters
- Sequential, non i.i.d. data
- There is no supervisor, only reward signals
The Reinforcement Learning Problem
Rewards
- Definition
- A reward
is a scalar feedback signal - Indicates how well agent is doing at time step
- The agent’s job is to maximize cumulative reward(or return)
Reinforcement learning is based on the Reward Hypothesis
- A reward
Reward Hypothesis
All goals can be described by the maximization of expected cumulative reward
- “all of what we mean by goals and purposes can be well thought of as maximization
of the expected value of the cumulative sum of a received scalar signal (reward).”
所有目標都可以用預期累積報酬的最大化來描述
可以用 maximization reward 來描述所有目標導向的行為
Examples of Rewards
- Robot manipulation — door closing or cube rotating
- (+) reward for closing the door or correctly rotating the cube
- (-) reward otherwise
- Playing Atari games
- Reward = score
- Playing Go or StarCraft II
- (+) reward for winning
- (-) reward for losing
- RL from human feedback
- (+) reward for producing responses more preferred by humans
- (-) reward for producing responses less preferred by humans
- Managing an investment portfolio
- Reward = profit or loss
Sequential Decision Making
- Goal
- Select actions to maximize total future reward, i.e., reward-to-go
過去的事已經發生改變不了,我該怎麼在現在下決定,使未來可以獲得最大回報
你明知現在可以選擇怠惰,或是做可以當下就會得回報的事情,但你去選擇未來可能可以獲得最大回報的決定
- Select actions to maximize total future reward, i.e., reward-to-go
- Properties
- Actions may have long-term consequences
- Reward maybe delayed
- It may be better to sacrifice immediate reward to gain more long-term reward
- Examples
- A financial investment (may take months to mature)
- Blocking opponent moves (might help winning chances many moves from now)
- Refueling a helicopter (might prevent a crash in several hours)
Agent and Environment
At each time stepThe agent
Receives observation
Receives scalar reward
Executes actionThe environment
Receives action
Emits observation
Emits scalar reward
The environment
History and State
- History
- The history is the sequence of observations, action, and rewards
i.e., all observable variables up to time
- e.g., the sensorimotor system of a robot or embodied agent
What happens next depends on the history
- The agent selects actions
- The environment emits observations/rewards
State
- State is the information used to determine what happens next
- Formally, state is a function of the history
Environment State
- The environment state
is the environment’s private representation - i.e., whatever data the environment uses to pick the next observation/reward
- e.g., the position, velocity, acceleration of each joint of a robot arm and objects in a tabletop manipulation setup
Agent State
- The agent state
is the agent’s internal representation - i.e., whatever information the agent uses to pick the next action
- i.e., it is the information used by reinforcement learning algorithms
- It can be any function of history
Information State
Definition
- A state
is Markov if and only if
- A state
An information state (a.k.a. Markov state) contains all useful information from the history
- “The future is independent of the past given the present”
- Once the state is known, the history may be thrown away
- i.e., the state is a sufficient statistic of the future
- The environment state
is Markov - The history
is Markov 你的設計會影響正確答案,based on 對環境的認知
- “The future is independent of the past given the present”
Fully Observable Environments
- Full observability: agent directly observes environment state
- Agent state = environment
state = information state - Formally, this is a Markov decision process (MDP)
- Next lecture and the majority of this course
Partially Observable Environments
- Partial observability: agent indirectly observes environment state
- A robot with camera vision isn’t told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public cards and its own cards
- Now agent state
≠ environment state - Formally, this is a partially observable Markov decision process (POMDP)
- Pronounced as “pom-dp”
- Agent must construct its own state representation
, e.g., - Complete history:
- Beliefs of environment state:
- Learned by recurrent neural network or Transformer, e.g.,
- Complete history:
Inside a Reinforcement Learning Agent
Major Components of an RL Agent
- An RL agent may include one or more of these components
- Policy: agent’s behavior function
- Value function: how good is each state and/or action
- Model: agent’s representation of the environment
- Policy
- A policy is the agent’s behavior, mapping from state to action, e.g.,
- Deterministic policy:
- Stochastic policy:
- Value function
- Value function is a prediction of future reward
- Used to evaluate the goodness/badness of states
- And therefore to select between actions, e.g.,
- Model
- A model predicts what the environment will do next
predicts the next state predicts the next state
Maze Example - Policy and Value Function
Depends on your policy
Categorizing RL Agents
Problems with Reinforcement Learning
Learning and Planning
- Two fundamental problems in sequential decision making
- Reinforcement learning
- The environment is initially unknown
- The agent interacts with environment
- The agent improves its policy
- Reinforcement learning is like trial-and-error learning
- Planning
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. search, deliberation, reasoning, introspection, pondering, thought
Atari Example - Reinforcement Learning
- Learning to play Atari games with RL - Rules of game are unknown - Dynamics (transitions) - Reward (game scores) - RL agent learns directly from interactive game-play - Pick actions on joystick - See pixels (observation) and scores (reward)Atari Example - Planning
Playing Atari games via planning
- Rules of game are known
- Dynamics (transitions)
- Reward (game scores)
- Can query emulator
- i.e., perfect model inside
agent’s brain
- i.e., perfect model inside
- If I take action from state
- What would the next state
(game screen) be? - What would the score be?
- What would the next state
- Goal: plan ahead to find an optimal policy
- e.g., tree search
Exploration and Exploitation
- Exploration and exploitation
- Exploration finds more information about the environment
- Exploitation exploits known information to maximize reward
- It is usually important to explore as well as exploit
- Balancing is the key
比起去餐廳選已經吃過的好吃食物,每次都點不一樣的食物,有可能可以遇到更好吃的,也可能更難吃,但沒吃過的食物就不知道裡面是不是有爆好吃的
- Balancing is the key
- Example
- Restaurant selection
- Go to your favorite restaurant (exploitation) vs. try a new restaurant (exploration)
- Online banner advertisements
- Show the most successful advert (exploitation) vs. show a different advert (exploration)
- Oil drilling
- Drill at the best known location (exploitation) vs. drill at a new location (exploration)
- Game playing
- Play the move you believe is best (exploitation) vs. play an experimental move (exploration)
- Restaurant selection
Prediction and Control
Reference
- [NTUEE] Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning by Shao-Hua Sun (孫紹華), Assistant Professor in Electrical Engineering, National Taiwan University