General Reinforcement Learning info
Problem Statement :
An agent interacts with it’s environment over discrete time steps. At time step ($t$), the agent observes an observation vector ($x_t$) and chooses an action ($a_t$) according to it’s policy $\pi(a|x_t)$ and observers a reward $r_t$ produced by the environment.
The goal is to maximize the discounted returns as defined by $R_t = \sum_{t\leq 0}\gamma^i r_{t+i}$ where $\gamma$ is the discount factor. $\gamma \in [0,1))$
We can use state, action values or state values to define the expected returns from a given state, action pair or state respectively.
$Q^\pi(x_t,a_t) = \mathbb{E}_{x_{t+1:\inf}, a_{t+1:\inf}}[R_t|x_t,a_t]$
$V^\pi(x_t) = \mathbb{E}_{a_t}[Q^\pi(x_t,a_t)|x_t]$
The advantage function provides arelative measure of value of each action since $E_{a_t}[A^pi(x_t,a_t)] = 0$
$A^\pi(x_t,a_t) = Q^\pi(x_t,a_t) - V^\pi(x_t)$