Getting Unstuck : Learning Failure after Reduced State Observations

I’ve switched to a quadruped design to start, I think the added stability of two more legs will give me a better chance at early success in my sim to real work. It allows for a larger chassis size which I needed for additional components like a battery, power distribution system, and a real-sense camera. All in all I think this a good direction for now and I’ll return to the two legged variant when I have a better understanding of the sim to real challenges I will be facing.

So lets recap. Basically for months, I’ve been stuck with a bug of sorts. The learned policies when given the “Gods Eye” observations (Full State knowledge) were really good. There robots were able to locomote well. The reduced observation set has consistently had issues with learning good policies.

  • Originally I was creating a simulated IMU type measurement and feeding the raw signals to the network

  • Now I’m just calculating the local velocity and angular velocity and giving that to the network

  • I imagined orientation could be learned from the IMU signals but processing it and feeding the results in will probably save some training time

    • This hasn’t been done yet

I wasn’t sure what type of bug I was looking for. Tensor shapes have been such a common source of problems that I tend to always start here. I also had changed the reward and observation at the same time, so I wasn’t sure if the bug was in one or the other. To me, how the reward is given could have been a source of problems. The early reward structure was attempting to match the robots local vel with the command vel. I imagine if the robot was only able to walk well at one specific speed (Due to my slapdash design), this type of reward could be very detrimental to the learning process. I switched over to matching a command heading with any achievable velocity, although this could have it’s pitfalls as well. Coupled with rewards designed to minimize torque, it may learn how to move but at a very low speed.

I want to have a hybrid of the first reward structure and the current command input style I was going for. We basically can set the target location to :

  • Walkbot_Pos * Command_Vel * dt

Anyways there have been training problems and I’ve been picking apart the code for bugs for a while now. I think I finally stumbled into some positive news, providing roll-pitch-yaw data seems to have helped train working polices.

Previous
Previous

Now that it is ‘Working’ What are we doing?