美文网首页
Lecture 10: Model-based Planning

Lecture 10: Model-based Planning

作者: Ysgc | 来源:发表于2020-01-26 14:39 被阅读0次

Question: Why bad idea?
Answer: Don't gain information every step

in theory, any optimization method can be used here. but for this particular model-based rl case, some are better than others.

eg. 1st order GD is not a good idea

for now -> derivative-free method

easy to code and parallize

http://web.mit.edu/6.454/www/www_fall_2003/gew/CEtutorial.pdf
???
看上去是两回事

here -> low variance, compared with vanilla PG -> only care about the sort, rather than the numerical value

small dim (64 or less) and the low number of time steps

MCTS -> game planning -> handles stochasticity very well

number of time step can be very large

search to a certain depth (say, 3 here), and then just randomly play the game to the end

idea is a random policy from that state has better outcome -> the state has higher value

Question: Can we have a better policy, replacing the random policy
Answer: Yes. eg. a policy from NN. Actually MCTS can be improved in lots of sense.

a popular choice

MCTS with better action policy -> better estimation of the value
-> in reality, random policy is preferred, probably because of its simplicity
-> also, for a small problem, random policy is not bad

Question: How about continuous space?
Answer: infinite number of action -> will discuss later
(Bayesian optimization ?)

Additional reading

  1. Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener,
    Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree
    Search Methods.
    • Survey of MCTS methods and basic summary

a 6 years ago paper

state changing at the beginning has larger effect
numerically unstable -> Hessian Matrix -> ill conditioned -> extremely sensitive to some parameters, insensitive to others

shooting method: take all actions and then BP ->

for shooting methods, instead of GD, use a method similar to 2nd order Newton's method, without building a full Hessian

assume f is a linear function

Newton's method has 2nd order term of dynamics

Both iLQR and Newton't method converge, at the same rate

Additional reading

  1. Mayne, Jacobson. (1970). Differential dynamic programming.
    • Original differential dynamic programming algorithm.
  2. Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex
    Behaviors through Online Trajectory Optimization.
    • Practical guide for implementing non-linear iterative LQR.
  3. Levine, Abbeel. (2014). Learning Neural Network Policies with Guided
    Policy Search under Unknown Dynamics.
    • Probabilistic formulation and trust region alternative to deterministic line search.

trajectory optimization does a great job, with a good model

相关文章

网友评论

      本文标题:Lecture 10: Model-based Planning

      本文链接:https://www.haomeiwen.com/subject/mlejthtx.html