Taking a stab at implementing ACER

My previous actor critic implementation used a replay buffer to collect experience from a bunch of simulated environments. The performance on Cartpole was pretty good but while I was playing with the code I started reading the ACER implimentation (Sample Efficient Actor-Critic With Experience Replay). I think it would be good experience to implement this with all the bells and whistles. From what I gathered the additional pieces I need to add are :

  1. Retrace Algorithm

  2. Truncated importance sampling with bias correction,

  3. Stochastic dueling network architectures

  4. efficient trust region policy optimization

So with my previous implementation I was sampling actions from the policy I was optimizing so I had an ‘on policy’ update. I believe I’ll be implementing an off-policy method with ACER.

The Retrace Algorithm

Retrace Paper : https://arxiv.org/pdf/1606.02647.pdf

What is the retrace algorithm doing?

  • It is an off-policy return based reinforcement learning algorithm.

    • In the policy evaluation setting, we are given a fixed policy π whose value Qπ we wish to estimate from sample trajectories drawn from a behaviour policy μ.

    • In the control setting, we consider a sequence of policies that depend on our own sequence of Q-functions (such as ε-greedy policies), and seek to approximate Q

  • It was designed with three properties in mind:

    • Low Variance

    • Safe use of samples collected from any behaviour policy

    • Is efficient as it makes the best use of samples collected from near on-policy behavior policies.

The general operator that we consider for comparing several return-based off-policy algorithms (Section 2 : Safe and efficient off-policy reinforcement learning)

The general operator that we consider for comparing several return-based off-policy algorithms

(Section 2 : Safe and efficient off-policy reinforcement learning)






Eligibility Traces

Chapter 12 : http://incompleteideas.net/book/RLbook2020.pdf

Previous
Previous

General Reinforcement Learning info

Next
Next

Actor Critic with CartPole