General Reinforcement Learning info

Problem Statement :

An agent interacts with it’s environment over discrete time steps. At time step ($t$), the agent observes an observation vector ($x_t$) and chooses an action ($a_t$) according to it’s policy $\pi(a|x_t)$ and observers a reward $r_t$ produced by the environment.

The goal is to maximize the discounted returns as defined by $R_t = \sum_{t\leq 0}\gamma^i r_{t+i}$ where $\gamma$ is the discount factor. $\gamma \in [0,1))$

We can use state, action values or state values to define the expected returns from a given state, action pair or state respectively.

  • $Q^\pi(x_t,a_t) = \mathbb{E}_{x_{t+1:\inf}, a_{t+1:\inf}}[R_t|x_t,a_t]$

  • $V^\pi(x_t) = \mathbb{E}_{a_t}[Q^\pi(x_t,a_t)|x_t]$

The advantage function provides arelative measure of value of each action since $E_{a_t}[A^pi(x_t,a_t)] = 0$

  • $A^\pi(x_t,a_t) = Q^\pi(x_t,a_t) - V^\pi(x_t)$

Previous
Previous

Starcraft 2 AI

Next
Next

Taking a stab at implementing ACER