Model Agnostic Meta Learning

One of the big uncracked problems in machine learning that researchers are now taking a crack at is “meta learning”. Meta learning is concerned with the following question: if we have learned to perform 100 tasks, can we learn the 101st a lot more efficiently? Note that this isn’t saying can we perform better on 101st task using the previous tasks’ knowledge. It is saying can we learn the 101st one more efficiently after training on it for a few steps.

Ideally, a good meta-learner can explore more intelligently, avoid trying new actions that are known to be useless and acquire the right features more quickly. Today we will look at one such approach. This paper is from Chelsea Finn, who is part of the Berkeley AI Research lab. In their work, they train the model with the objective of getting the parameters to a region such that they are a small number of gradient steps away from generalizing well on a new task. The method described here is applicable to not just RL, but also supervised regression and classification settings. The goal is to train it so as to maximize the sensitivity of the loss functions of new tasks with respect to the parameters. That is, small local changes should quickly give us good generalization on the new task. At least that’s the goal.

They consider the k-shot learning setting. This means that they want to be able to do well on a task after training on k episodes (or training examples in case of supervised learning). In this way, they formulate the problem such that meta-learner treats entire tasks as training examples.

The algorithm is quite simple and principled: They first sample a batch of tasks. They then train on each of these tasks for $k$ iterations using some gradient based learning algorithm. They then test each of these tasks on new data from those tasks. Lets call this the test error. The initial parameters (the ones we had before training individually for k steps on each task) are then adjusted based on how the test error (after training on k tasks) changes with respect to the original parameters. If you work out the math you will see that this backprop involves a 2nd order gradient, which can be implemented with a autodiff library such as Tensorflow.

By framing the a task as having a loss function $L(x_1, a_1...x_n, a_n)$ with $n = 1$ specifying the supervised learning case (because episode length is 1), MAML can be applied to both RL and supervised regression and classification problems.

Head over to arxiv to read about the details of the algorithm, the experimental setup and more cool stuff.