Author:David Silver
He was awarded the 2019 ACM Prize in Computing for breakthrough advances in computer game-playing.
![]()
Outline
- Introduction
- Incremental Methods
- Batch Methods
Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems, e.g.
- Backgammon:
states
- Computer Go:
states
- Helicopter: continuous state space
How can we scale up the model-free methods for prediction and control from the last two lectures?
Value Function Approximation
Types of Value Function Approximation
Which Function Approximator?
Gradient Descent
Value Function Approx. By Stochastic Gradient Descent
Feature Vectors
Linear Value Function Approximation
Table Lookup Features
Incremental Prediction Algorithms
Monte-Carlo with Value Function Approximation
TD Learning with Value Function Approximation
TD(λ) with Value Function Approximation
Control with Value Function Approximation
Action-Value Function Approximation
Linear Action-Value Function Approximation
Incremental Control Algorithms
Linear Sarsa with Coarse Coding in Mountain Car
Linear Sarsa with Radial Basis Functions in Mountain Car
Study of λ: Should We Bootstrap?
Baird’s Counterexample
Parameter Divergence in Baird’s Counterexample
Convergence of Prediction Algorithms
Gradient Temporal-Difference Learning
Convergence of Control Algorithms
Batch Reinforcement Learning
- Gradient descent is simple and appealing
- But it is not sample efficient
- Batch methods
seek tofind the best fitting value function - Given the agent’s experience
(“training data”)
Least Squares Prediction
Stochastic Gradient Descent with Experience Replay
Stochastic Gradient Descent with Experience Replay
Experience Replay in Deep Q-Networks (DQN)
DQN in Atari
DQN Results in Atari
How much does DQN help?
Linear Least Squares Prediction
- Experience replay finds least squares solution
- But it may take many iterations
- Using linear value function approximation
- We can solve the least squares solution directly
Linear Least Squares Prediction (2)
Linear Least Squares Prediction Algorithms
Linear Least Squares Prediction Algorithms (2)
Convergence of Linear Least Squares Prediction Algorithms
Least Squares Policy Iteration
Least Squares Action-Value Function Approximation
Least Squares Control
Least Squares Q-Learning
Least Squares Policy Iteration Algorithm
Convergence of Control Algorithms
Chain Walk Example
LSPI in Chain Walk: Action-Value Function
LSPI in Chain Walk: Policy
Questions?
Reference:《UCL Course on RL》













网友评论