Calendar of Events
Renormalizing the Optimal Hyperparameters of a Neural Network
Speaker: Greg Yang (Microsoft Research)
Hyperparameter tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters that often can only be trained once. We show that, in the recently discovered Maximal Update Parametrization (?P), many optimal hyperparameters remain stable even as model size changes. Using this insight, for example, we are able to re-tune the 6.7-billion-parameter model of GPT-3 and obtain performance comparable to the 13-billion-parameter model of GPT-3, effectively doubling the model size.
In this context, there is a rich analogy we can make to Wilsonian effective field theory. For example, if “coupling constants” in physics correspond to “optimal hyperparameters” in deep learning and “cutoff scale” corresponds to “model size”, then we can say “?P is a renormalizable theory of neural networks.” We finish by formulating the question of whether there is a “Grand Unifying Theory” of neural networks at scale that can inform our quest toward general intelligence.