Taking a stab at implementing ACER

My previous actor critic implementation used a replay buffer to collect experience from a bunch of simulated environments. The performance on Cartpole was pretty good but while I was playing with the code I started reading the ACER implimentation (Sample Efficient Actor-Critic With Experience Replay). I think it would be good experience to implement this with all the bells and whistles. From what I gathered the additional pieces I need to add are :

Retrace Algorithm
Truncated importance sampling with bias correction,
Stochastic dueling network architectures
efficient trust region policy optimization

So with my previous implementation I was sampling actions from the policy I was optimizing so I had an ‘on policy’ update. I believe I’ll be implementing an off-policy method with ACER.

The Retrace Algorithm

Retrace Paper : https://arxiv.org/pdf/1606.02647.pdf

What is the retrace algorithm doing?

It is an off-policy return based reinforcement learning algorithm.
- In the policy evaluation setting, we are given a fixed policy π whose value Qπ we wish to estimate from sample trajectories drawn from a behaviour policy μ.
- In the control setting, we consider a sequence of policies that depend on our own sequence of Q-functions (such as ε-greedy policies), and seek to approximate Q
It was designed with three properties in mind:
- Low Variance
- Safe use of samples collected from any behaviour policy
- Is efficient as it makes the best use of samples collected from near on-policy behavior policies.

The general operator that we consider for comparing several return-based off-policy algorithms (Section 2 : Safe and efficient off-policy reinforcement learning) — The general operator that we consider for comparing several return-based off-policy algorithms
(**Section 2 :** Safe and efficient off-policy reinforcement learning)

Eligibility Traces

Chapter 12 : http://incompleteideas.net/book/RLbook2020.pdf

Taking a stab at implementing ACER

The Retrace Algorithm

Eligibility Traces

General Reinforcement Learning info

Actor Critic with CartPole