美文网首页learning
on-policy RL, off-policy RL, off

on-policy RL, off-policy RL, off

作者: 吃醋不吃辣的雷儿 | 来源:发表于2022-06-15 15:44 被阅读0次

on-policy

on-policy:收集数据的策略和维护更新的策略为同一个策略。智能体根据当前策略和环境交互,收集一定步数的数据(s, a, r, s', terminal_flag)后进行当前策略的更新,不存在replay buffer,数据使用后即丢掉,无经验回放。

Behaviour policy(Policy used for data generation is called behaviour policy) == Policy used for action selection


on-policy

off-policy

off-policy:收集数据的策略和维护更新的策略为不同的策略,智能体和环境交互。智能体根据当前策略和环境交互,收集一定步数的数据(s, a, r, s', terminal_flag)丢进replay buffer,从replay buffer中选取一定步数的数据进行当前策略的更新。

Off-policy learning allows the use of older samples (collected using the older policies) in the calculation. To update the policy, experiences are sampled from a buffer which comprises experiences/interactions that are collected from its own predecessor policies. This improves sample efficiency since we don’t need to recollect samples whenever a policy is changed.


off-policy

offline

offline:未知数据收集策略,无环境交互。智能体不和环境交互,而是利用先前收集的数据集,从中选取一定步数的数据(s, a, r, s', terminal_flag)进行当前策略的更新,无新数据产生。

Offline reinforcement learning algorithms: those utilize previously collected data, without additional online data collection. The agent no longer has the ability to interact with the environment and collect additional transitions using the behaviour policy. The learning algorithm is provided with a static dataset of fixed interaction, D, and must learn the best policy it can using this dataset. The learning algorithm doesn’t have access to additional data as it cannot interact with the environment.


offline

nice reference

https://slideslive.com/38935785/offline-reinforcement-learning-from-algorithms-to-practical-challenges
https://kowshikchilamkurthy.medium.com/off-policy-vs-on-policy-vs-offline-reinforcement-learning-demystified-f7f87e275b48

相关文章

  • on-policy RL, off-policy RL, off

    on-policy on-policy:收集数据的策略和维护更新的策略为同一个策略。智能体根据当前策略和环境交互,...

  • PPO

    On-policy VS Off-policy On-policy: The agent learned and ...

  • mac 本机mysql无法启动

    sudo chown -RL root:mysql /usr/local/mysqlsudo chown -RL ...

  • 强化学习

    RL 种类 Model-Free RL不理解环境,通过试错来学习 Model-Based RL理解环境,通过想象学...

  • RL

    Q-learning Sarsa Sara-lambda

  • RL

    策略(搜索/优化)都是在学习控制律control law,即系统状态到控制输入的映射(本质上也是个回归问题)。强化...

  • RL

    RL 强化学习任务通常用马尔科夫决策过程(Markov Decision Process,简称 MDP)来描述: ...

  • rl

    recyclerview

  • 10.31 背

    单臂哑铃划船 20lbs 12*2组 RL 22.5lbs 10*4组 RL ...

  • Arrow Of RL

    This is my favorite APP, my own independent development, ...

网友评论

    本文标题:on-policy RL, off-policy RL, off

    本文链接:https://www.haomeiwen.com/subject/jpexvrtx.html