EECS - RL

LAVI

因為老師極度建議這門不要和 DLCV、ADL 一起修

RL:
The science of decision making

About Reinforcement Learning

Characteristics of Reinforcement Learning

  • What makes reinforcement learning different from other machine learning paradigms?
    • There is no supervisor, only reward signals

      reward signals 現在做得多好,不一定代表以後做得多好

    • Feedback is delayed, not instantaneous

      現在做得很好的 feedback 可能是很久以前做的某些努力,現在獲得了 feedback,但不代表現在就很好,過去做得某些努力讓現在過得很好

    • Agent’s actions affect the subsequent data it receives
    • Time really matters
      • Sequential, non i.i.d. data

The Reinforcement Learning Problem

Rewards

  • Definition
    • A reward is a scalar feedback signal
    • Indicates how well agent is doing at time step
    • The agent’s job is to maximize cumulative reward(or return)

      Reinforcement learning is based on the Reward Hypothesis

Reward Hypothesis
All goals can be described by the maximization of expected cumulative reward

  • “all of what we mean by goals and purposes can be well thought of as maximization
    of the expected value of the cumulative sum of a received scalar signal (reward).”
    所有目標都可以用預期累積報酬的最大化來描述
    可以用 maximization reward 來描述所有目標導向的行為

Examples of Rewards

  • Robot manipulation — door closing or cube rotating
    • (+) reward for closing the door or correctly rotating the cube
    • (-) reward otherwise
  • Playing Atari games
    • Reward = score
  • Playing Go or StarCraft II
    • (+) reward for winning
    • (-) reward for losing
  • RL from human feedback
    • (+) reward for producing responses more preferred by humans
    • (-) reward for producing responses less preferred by humans
  • Managing an investment portfolio
    • Reward = profit or loss

Sequential Decision Making

  • Goal
    • Select actions to maximize total future reward, i.e., reward-to-go

      過去的事已經發生改變不了,我該怎麼在現在下決定,使未來可以獲得最大回報
      你明知現在可以選擇怠惰,或是做可以當下就會得回報的事情,但你去選擇未來可能可以獲得最大回報的決定

  • Properties
    • Actions may have long-term consequences
    • Reward maybe delayed
    • It may be better to sacrifice immediate reward to gain more long-term reward
  • Examples
    • A financial investment (may take months to mature)
    • Blocking opponent moves (might help winning chances many moves from now)
    • Refueling a helicopter (might prevent a crash in several hours)

Agent and Environment

At each time step
  • The agent
    Receives observation
    Receives scalar reward
    Executes action

  • The environment
    Receives action
    Emits observation
    Emits scalar reward

increments at environment step
The environment

History and State

  • History
  • The history is the sequence of observations, action, and rewards

  • i.e., all observable variables up to time

    • e.g., the sensorimotor system of a robot or embodied agent
  • What happens next depends on the history

    • The agent selects actions
    • The environment emits observations/rewards
  • State

    • State is the information used to determine what happens next
    • Formally, state is a function of the history

Environment State

  • The environment state is the environment’s private representation
  • i.e., whatever data the environment uses to pick the next observation/reward
  • e.g., the position, velocity, acceleration of each joint of a robot arm and objects in a tabletop manipulation setup

Agent State

  • The agent state is the agent’s internal representation
    • i.e., whatever information the agent uses to pick the next action
    • i.e., it is the information used by reinforcement learning algorithms
    • It can be any function of history

Information State

  • Definition

    • A state is Markov if and only if
  • An information state (a.k.a. Markov state) contains all useful information from the history

    • “The future is independent of the past given the present”
    • Once the state is known, the history may be thrown away
    • i.e., the state is a sufficient statistic of the future
    • The environment state is Markov
    • The history is Markov

      你的設計會影響正確答案,based on 對環境的認知

Fully Observable Environments

  • Full observability: agent directly observes environment state
  • Agent state = environment
    state = information state
  • Formally, this is a Markov decision process (MDP)
  • Next lecture and the majority of this course

Partially Observable Environments

  • Partial observability: agent indirectly observes environment state
    • A robot with camera vision isn’t told its absolute location
    • A trading agent only observes current prices
    • A poker playing agent only observes public cards and its own cards
  • Now agent state ≠ environment state
    • Formally, this is a partially observable Markov decision process (POMDP)
    • Pronounced as “pom-dp”
  • Agent must construct its own state representation , e.g.,
    • Complete history:
    • Beliefs of environment state:
    • Learned by recurrent neural network or Transformer, e.g.,

Inside a Reinforcement Learning Agent

Major Components of an RL Agent

  • An RL agent may include one or more of these components
    • Policy: agent’s behavior function
    • Value function: how good is each state and/or action
    • Model: agent’s representation of the environment
  • Policy
    • A policy is the agent’s behavior, mapping from state to action, e.g.,
    • Deterministic policy:
    • Stochastic policy:
  • Value function
    • Value function is a prediction of future reward
    • Used to evaluate the goodness/badness of states
    • And therefore to select between actions, e.g.,
  • Model
    • A model predicts what the environment will do next
    • predicts the next state
    • predicts the next state

Maze Example - Policy and Value Function

Depends on your policy

Categorizing RL Agents

Problems with Reinforcement Learning

Learning and Planning

  • Two fundamental problems in sequential decision making
  • Reinforcement learning
    • The environment is initially unknown
    • The agent interacts with environment
    • The agent improves its policy
    • Reinforcement learning is like trial-and-error learning
  • Planning
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. search, deliberation, reasoning, introspection, pondering, thought

Atari Example - Reinforcement Learning

- Learning to play Atari games with RL - Rules of game are unknown - Dynamics (transitions) - Reward (game scores) - RL agent learns directly from interactive game-play - Pick actions on joystick - See pixels (observation) and scores (reward)

Atari Example - Planning

Playing Atari games via planning

  • Rules of game are known
    • Dynamics (transitions)
    • Reward (game scores)
  • Can query emulator
    • i.e., perfect model inside
      agent’s brain
  • If I take action from state
    • What would the next state
      (game screen) be?
    • What would the score be?
  • Goal: plan ahead to find an optimal policy
    • e.g., tree search

Exploration and Exploitation

  • Exploration and exploitation
    • Exploration finds more information about the environment
    • Exploitation exploits known information to maximize reward
    • It is usually important to explore as well as exploit
      • Balancing is the key

        比起去餐廳選已經吃過的好吃食物,每次都點不一樣的食物,有可能可以遇到更好吃的,也可能更難吃,但沒吃過的食物就不知道裡面是不是有爆好吃的

  • Example
    • Restaurant selection
      • Go to your favorite restaurant (exploitation) vs. try a new restaurant (exploration)
    • Online banner advertisements
      • Show the most successful advert (exploitation) vs. show a different advert (exploration)
    • Oil drilling
      • Drill at the best known location (exploitation) vs. drill at a new location (exploration)
    • Game playing
      • Play the move you believe is best (exploitation) vs. play an experimental move (exploration)

Prediction and Control

Reference

  • [NTUEE] Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning by Shao-Hua Sun (孫紹華), Assistant Professor in Electrical Engineering, National Taiwan University