Learning Hierarchies to Solve a Range of RL Tasks

Reinforcement learning algorithms are typically used to learn individual tasks from scratch. While a large body of research seeks to improve the sample efficiency of reinforcement learning algorithms (eg. TRPO), there is a limit to learning speed in the absence of prior knowledge. This is especially the case for model free algorithms that start by assuming little to nothing about the dynamics of the environment.

Intuitively, we humans know how to use a shared set of important skills for various tasks - walking, object manipulation and some other skills are broadly applicable to many tasks that we perform daily. It’d be nice if we can get agents to learn these skills and choose the right one to use for each task. This paper by Kevin Frans (who by the way was a high school student when he produced this work at OpenAI) addresses this problem in a creative way. The goal is similar to the meta learning goal - to learn faster on unseen tasks by sharing information between different but related tasks. Let’s say we want to do well on a bunch of tasks. It might not be optimal to learn one policy because these tasks might have much more optimal policies when trained independently on that task.

In this approach, the main idea is to train a bunch of “sub-policies”, which can be interpreted as skills, and also train a master policy, which given a task, picks which sub-policy to use for that task. The authors train a single model over a distribution of tasks. The goal is to maximize expected return in an episode over this distribution of tasks.

At a high level, the algorithm works as follows. They let the master policy choose a sub-policy (they call this the warm up period). They then train the chosen subpolicy on the task for a while, and give the master policy a chance to update its guess once every few steps. After a while, they switch out the task and choose a different task so that the master policy gets plenty opportunity to choose the right sub-policy and alter (by gradient descent) how it makes that decision.

“By discovering a strong set of sub-policies, learning on new tasks can be handled solely by updating the master policy. Furthermore, since the master policy chooses actions only every N time steps, it sees a learning problem with a horizon that is only 1/N times as long.”

They intuit that warmup makes the master policy pick the right sub-policy for the task, and then they optimize the chosen sub-policy further, while allowing the master policy to update its subpolicy decision process making every few steps.

Head over to arxiv to read about the details of the algorithm, the experimental setup and more cool stuff.