Actor Critic with CartPole
For the past few weeks I’ve been working on an Actor Critic RL implementation as a sort of sandbox to test implementing some RL concepts. The following is a document explaining some of the lessons learned as well as demonstrating the code.
Originally I was working with a single colabs notebook but the organizational benefits of splitting things into separate files was necessary as the code grew. Below are links to the github repositories for both the ipynb and the python files I’ve been working on
Code:
Python Files : Github
Colab Notebook : CoLab *Note To Self : Need to add this link. Notebook needs a little cleaning up first..
Features :
Utilizing vectorized environments to speed up training. I noticed while looking at the replay buffer that the inference from f(state)=action_logits was taking far longer than the env.step(action) operation. With multiple environments I can fill the buffers faster by inputting a state tensor and receiving an action_logit tensor in return. Since tensor opperations are parallelized on the gpu, this saves time by reducing the number of inference calls.
Variable N-step updates. In the training loop I can set how many steps I want each environment to take before performing a training update. Below are some plots comparing N=1, N=10, and N=100. For this test that are 100 environments running.
Currently I’m determining the performance of the algorithm by keeping track of each environments cumulative rewards in the replay buffer. This allows me to avoid running evaluation environments which could become painfully slow as episode length increases. Below are plots comparing an evaluation environment vs the batch_mean_reward.
Current Implementation Problems :
The replay buffer is tailored to the environment and I need to manually change it when I change to different environments. I created different classes of replay buffer for different environments but I don’t love this solution, it is just a stopgap.
My current implementation doesn’t work with the standard environments right now. In order to speed up training I added a cumulative reward to all the environments which is returned during step in the info dictionary. I can average terminal rewards by multiplying the cumulative reward and done tensors, summing the result and dividing by the number of done=True in the batch. I did this to avoid needing to run evaluation environment trials which were tediously slow. I probably could do something with running those eval trials in a different thread but that is a future problem to work on.
self.total_reward_metric sometimes can be weird if there are no transitions with done=True in the batch. Since I’m averaging over 10 training update steps this usually isn’t the case but if it does happen that no done=True states were seen, the result of the total_reward_metric is 0. This was negatively effecting the reported performance of the agent. This might be able to be fixed by checking for the length of the entries in the metric variable.
The replay buffer is sampled sequentially instead of randomly. This wasn’t an issue when I had written my own replay buffer but there doesn’t seem to be an easy way to do this with the tensorflow agents replay buffer…
Observations :
The infinite episodes seemed to have a degrading quality on the policy. It felt like it got so good at performing one task that it started to get worse at more infrequent tasks such as recovery and not going past the linear boundaries. To fix this I started limiting the episodes to 1000 steps.
I think learning only occurs through trial and failure. I started increasing the sampling range of the starting states to give the agent more examples of challenging situations. This seemed to improve the performance and learning rate of the agent.
Learning seems to come from the difference in rewards received. In cartpole every step results in a reward of (1). Thus the only chance for the critic network to update its idea about state values happens at the episode termination when the reward is (0). After some updating the reward begins to propagate back from terminal states towards starting states. I guess I think of this as learning occurring due to the difference in rewards.
Decorating the training step with @tf.function resulted in about a 3x speed increase
Experiments :
Learning and Performance with different environment steps and training steps
Environment steps are how many steps each environment takes before a series of mini-batchs are used to update the agent policy. When each environment takes a step they add the experience to the experience replay buffer.
Training steps are how many mini-batch updates are run between environment step updates