Taking a stab at implementing ACER
My previous actor critic implementation used a replay buffer to collect experience from a bunch of simulated environments. The performance on Cartpole was pretty good but while I was playing with the code I started reading the ACER implimentation (Sample Efficient Actor-Critic With Experience Replay). I think it would be good experience to implement this with all the bells and whistles. From what I gathered the additional pieces I need to add are :
Retrace Algorithm
Truncated importance sampling with bias correction,
Stochastic dueling network architectures
efficient trust region policy optimization
So with my previous implementation I was sampling actions from the policy I was optimizing so I had an ‘on policy’ update. I believe I’ll be implementing an off-policy method with ACER.
The Retrace Algorithm
Retrace Paper : https://arxiv.org/pdf/1606.02647.pdf
What is the retrace algorithm doing?
It is an off-policy return based reinforcement learning algorithm.
In the policy evaluation setting, we are given a fixed policy π whose value Qπ we wish to estimate from sample trajectories drawn from a behaviour policy μ.
In the control setting, we consider a sequence of policies that depend on our own sequence of Q-functions (such as ε-greedy policies), and seek to approximate Q
It was designed with three properties in mind:
Low Variance
Safe use of samples collected from any behaviour policy
Is efficient as it makes the best use of samples collected from near on-policy behavior policies.
The general operator that we consider for comparing several return-based off-policy algorithms
(Section 2 : Safe and efficient off-policy reinforcement learning)