Lecture 10: Model-based Planning

作者: Ysgc | 来源:发表于2020-01-26 14:39 被阅读0次

Lecture 10: Model-based Planning
18/10/2019 Lecture3: Planning by
Lecture 3: Planning by Dynamic P
Lecture 3: Planning by Dynamic P
WEEK 3 Bistability in Biochemica
MA 598学习计划
TED演讲 The gospel of doubt |02
ISL视频课程学习笔记
2020-02-10 教课反馈
Lecture #09 & #10

Question: Why bad idea?
Answer: Don't gain information every step

in theory, any optimization method can be used here. but for this particular model-based rl case, some are better than others.

eg. 1st order GD is not a good idea

for now -> derivative-free method

easy to code and parallize

http://web.mit.edu/6.454/www/www_fall_2003/gew/CEtutorial.pdf
???
看上去是两回事

here -> low variance, compared with vanilla PG -> only care about the sort, rather than the numerical value

small dim (64 or less) and the low number of time steps

MCTS -> game planning -> handles stochasticity very well

number of time step can be very large

search to a certain depth (say, 3 here), and then just randomly play the game to the end

idea is a random policy from that state has better outcome -> the state has higher value

Question: Can we have a better policy, replacing the random policy
Answer: Yes. eg. a policy from NN. Actually MCTS can be improved in lots of sense.

a popular choice

MCTS with better action policy -> better estimation of the value
-> in reality, random policy is preferred, probably because of its simplicity
-> also, for a small problem, random policy is not bad

Question: How about continuous space?
Answer: infinite number of action -> will discuss later
(Bayesian optimization ?)

Additional reading

Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener,
Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree
Search Methods.
• Survey of MCTS methods and basic summary

a 6 years ago paper

state changing at the beginning has larger effect
numerically unstable -> Hessian Matrix -> ill conditioned -> extremely sensitive to some parameters, insensitive to others

shooting method: take all actions and then BP ->

for shooting methods, instead of GD, use a method similar to 2nd order Newton's method, without building a full Hessian

assume f is a linear function

Newton's method has 2nd order term of dynamics

Both iLQR and Newton't method converge, at the same rate

Additional reading

Mayne, Jacobson. (1970). Differential dynamic programming.
• Original differential dynamic programming algorithm.
Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex
Behaviors through Online Trajectory Optimization.
• Practical guide for implementing non-linear iterative LQR.
Levine, Abbeel. (2014). Learning Neural Network Policies with Guided
Policy Search under Unknown Dynamics.
• Probabilistic formulation and trust region alternative to deterministic line search.