美文网首页
Efficient Reinforcement Learning

Efficient Reinforcement Learning

作者: 太捉急 | 来源:发表于2019-08-20 16:16 被阅读0次

这是一篇1994年的老文章,也是分析强化学习算法复杂度比较经典的一篇文章。废话少说,直接开看。
首先定义RL的基本概念:

  • States: \mathcal{S}
  • Actions: \mathcal{A}
  • Transition function: \mathcal{T} : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]
  • Reward function \mathcal{R} : \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow[0,1]
  • Policy function \pi: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]
  • Value function V : \mathcal{S} \rightarrow \mathbb{R}
  • discount factor \gamma \in (0, 1]
  • Initial state distribution d_0 : \mathcal{S} \rightarrow [0, 1]
  • Trajectory: \tau = (s_0, a_0, r_1, s_1, a_1, \dots) where s_0 \sim d_0(s), a_0 \sim \pi(s_0), r_1 \sim \mathcal{R}(s_0,a_0), s_1 \sim \mathcal{T}(s, a), \dots
  • Return: R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_{t+1}

Optimal Policy:
\pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \{R(\tau) \}
Optimal Value:
V^*(s) = V^{\pi^*}(s) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} \\ Q^*(s, a) = Q^{\pi^*}(s,a) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s, a_0=a} \{R(\tau) \}

Obviously for any policy \pi, \quad V^{\pi} \leq V^*.
V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} = \sum_{a \in \mathcal{A}} \pi(s,a) \{ \mathbb{E}[\mathcal{R}(s,a)] + \gamma \sum_{s' \in \mathcal{S}} V^\pi(s')\mathcal{T}(s,a,s') \}

这里定义本文最重要的概念PAC Learner (Probability Approximate Correct) :

An algorithm \mathcal{L} is a PAC learner if for any environment \mathcal{E} = (\mathcal{S, A, T, R}, d_0), \epsilon > 0, \delta > 0, \gamma \in (0, 1], it produce a policy \pi such that:
\Pr \{ |V^{*}(s_0) - V^{\pi}(s_0)| \leq \epsilon \} \geq 1-\delta
in time polynomial in N = |\mathcal{S}|, K=|\mathcal{A}|, r_\max = \max \mathcal{R}(s,a), 1/\epsilon, 1/\delta, 1/(1-\gamma)

就是说,假如有一个算法,它能在一定时间内,得到一个策略\pi,使得V^{\pi}V^*的误差小于\epsilon的概率大于1-\delta,这里的一定时间是关于N, K, r_\max, 1/\epsilon, 1/\delta, 1/(1-\gamma)的多项式,那么这个算法是PAC Learner。


我们首先考虑限制步长为M的MDP, \tau = (s_0, a_0, r_1, \dots, s_M), \quad R(\tau) = \sum_{t=0}^{M-2}\gamma^t r_{t+1}, \quad V_M ^\pi(s_M) = 0, V_M^\pi(s_0) = \mathbb{E}_{\tau \sim \pi}\{R(\tau)\}, then
V^{\pi}(s_0) = \lim_{M\rightarrow \infty}V_M^{\pi}(s_0)

For any s_0 \in \mathcal{S},
|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^{M}r_\max / (1 - \gamma)

这个很容易证明, 对于 \tau_{>M} = (s_M, a_M, r_{M+1}, s_{M+1}, \dots)
|V^\pi(s_0) - V_M^{\pi}(s_0) | = \gamma^M |\mathbb{E} [ R(\tau_{>M}) ]| \leq \gamma^M v_\max, \quad v_\max = r_\max / (1 - \gamma)
我们要求

|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^M v_\max < \epsilon_c
那么

M \geq \log_\gamma (\epsilon_c / v_\max) = \ln(v_\max / \epsilon_c) / (\ln(1/\gamma)) \\
Use \ln x \approx x - 1 for x \in (0, 1],
M \geq ln(v_\max / \epsilon_c) /(1-\gamma)

相关文章

网友评论

      本文标题:Efficient Reinforcement Learning

      本文链接:https://www.haomeiwen.com/subject/ajvpsctx.html