Failure is Simply a Success at Something Else

This is an intuitive, yet promising new paper from the folks at OpenAI. This work opens up avenues to make learning some tasks practical that are otherwise simply out of reach with our current popular RL techniques.

Sparse rewards is one of the big challenges in RL today. This is related to the classic credit assignment problem. If you get little to no learning signal - you failed, you failed, you failed - how is that useful? This work addresses “sample-efficient learning from rewards which are sparse and binary. It can be combined with an arbitrary off-policy RL algorithm”. They test this on three tasks: pushing, sliding, and pick-and-place, each using only binary rewards indicating whether or not the task is completed. Since these are sparse reward problems, they are considered hard to solve by today’s dominant deep RL methods.

A lot of RL today involves engineering reward functions each step of the way. You’ll see this is the case in many OpenAI Gym environments. This is cumbersom, doesn’t always make sense, in many cases is a bad proxy for the real goal and sometimes, outright not possible. So if we can effectively train on binary, sparse reward signals, that could be huge!

The key idea behind Hindsight Experience Replay (HER) is “to learn almost as much from achieving an undesired outcome as from the desired one.” If you shoot a goal in soccer and miss, you can either learn that that was a failure - weak learning signal from 0 reward or almost-uniformly negative reward, like getting a reward of -1 at every step - or you can learn that that is a good shot if the goal were located where the ball ended up. In other words, this would be a positive reward signal for the task of shooting at this imaginary, made up goal.

Naturally, this way of thinking is only applicable whenever there are multiple goals which can be achieved. Because of this, the slightly unusual thing they do (from frequent RL formulation) is that they frame the policy learning problem as mapping (current state, goal) → action, instead of state → action, as is conventionally done (and same for the value function; the inputs are state and goal).

“Instead of shaping the reward we propose a different solution which does not require any domain knowledge. Consider an episode with a state sequence $s_1, ..s_T$ and a goal $g$ , which implies that the agent received a reward of -1 at every timestep. The pivotal idea behind our approach is to re-examine this trajectory with a different goal — while this trajectory may not help us learn how to achieve the state g, it definitely tells us something about how to achieve the state $s_T$ . This information can be harvested by using an off-policy RL algorithm and experience replay where we replace g in the replay buffer by $s_T$ .”

Head over to arxiv to read about the details of the algorithm, the experimental setup and more cool stuff.