Designing Neural Net Architectures with Reinforcement Learning

In this paper from Google Brain, the authors train a RNN (called the controller) using reinforcement learning to generate model descriptions of neural networks with the goal of maximizing expected performance of the generated model on some image recognition and NLP tasks. This work is one of the key ideas behind AutoML, the new machine learning service that’s part of Google Cloud.

This paper shows how to generate CNN and RNN architectures for image recognition and language tasks respectively. This work is based on the fact that we can represent the connectivity structure of a neural net as a string. They show how to train the RNN to generate good strings based on a reward signal. The reward signal is the validation test performance of the generated network, after training it. The authors mention a range of related work like neuro-evolution algorithms, hyperparameter optimization algorithms, probabilistic program induction, learning to learn by gradient descent by gradient descent and a few others that we won’t go into here. The authors use the simple policy gradient based REINFORCE algorithm to generate a network, get a reward signal which is the validation accuracy of the generated neural network, after training. From this they compute policy gradients and update the controller RNN.

As a quick refresh, the REINFORCE learning algorithm involves sampling a trajectory, which is a sequence of actions. In this case, this corresponds to sampling a child network architecture, computing the reward for for that generated network (which is its accuracy on a held out set) and then updating the policy using the score function gradient estimator trick. As in standard in policy gradient methods, they use a baseline to reduce the variance estimates of the baseline. Their baseline is an exponential moving average of the accuracies of the last few previously generated architectures. They parallelize the sampling for this using a parameter server architecture. They sample multiple architectures in parallel, using multiple controllers. Each batch of architectures generated by a controller is trained on the CIFAR dataset, reward signal sent to the controller that produced that architecture. Then the gradient is computed by this and sent back to all the parameter server replicas that perform a gradient update step.

The authors briefly mention how to generate other layer types like batch norm, max pooling, skip connections etc but we won’t go into here. It basically involves changing up the string we’d generate to add a scheme to encode other layer types but the main idea is still the same → controller RNN, followed by policy updates with the reinforce algorithm.

Next, let’s look at how to generate recurrent architectures. The goal is have the controller generate a functional form for $h_t$ , with the inputs being $x_{t}$ and $h_{t-1}$ . In other words, we’d like to generate the functional form for the RNN cell (examples of these are LSTM, basic RNN cell etc). Instead of using a LSTM, we want to generate a good cell. In this paper, they frame this problem as follows: think of the cell formulation as a tree. Each node in the tree is labeled by an index number. The controller’s job is to label each node with a combination function (add, element wise multiplication etc) and an activation function (ReLU, sigmoid etc). Then, as before, we train the thus sampled RNN and feed its validation accuracy as the reward to the controller and perform policy updates.

During the training of the controller, they use a schedule of increasing number of layers in the child networks as training progresses in both the CNN and the RNN case. They mention experimental details of training on CIFAR-10 and Penn Treebank which we won’t go into here.

Head over to arxiv to read about the details of the algorithm, the experimental setup and more cool stuff.