Achieving Noob gains in AI
I explain why I think AI research has been slowing down, not speeding up, in the past few years.
This is a repost from lesswrong. The discussion there was great, so I recommend taking a look whether if you find this post interesting, regardless of whether you agree. One clarification that arises from that discussion: This post focuses on “fundamental” research progress, not what all has been accomplished in the field.
TL;DR I explain why I think AI research has been slowing down, not speeding up, in the past few years.
How have your expectations for the future of AI research changed in the past three years? Based on recent posts in this forum, it seems that results in text generation, protein folding, image synthesis, and other fields have accomplished feats beyond what was thought possible. From a bird's eye view, it seems as though the breakneck pace of AI research is already accelerating exponentially, which would make the safe bet on AI timelines quite short.
This way of thinking misses the reality on the front lines of AI research. Innovation is stalling beyond just throwing more computation at the problem, and the forces that made scaling computation cheaper or more effective are slowing. The past three years of AI results have been dominated by wealthy companies throwing very large models at novel problems. While this expands the economic impact of AI, it does not accelerate AI development.
To figure out whether AI development is actually accelerating, we need to answer a few key questions:
What has changed in AI in the past three years?
Why has it changed, and what factors have allowed that change?
How have those underlying factors changed in the past three years?
By answering these fundamental questions, we can get a better understanding of how we should expect AI research to develop over the near future. And maybe along the way, you'll learn something about lifting weights too. We shall see.
What has changed in AI research in the past three years?
Gigantic models have achieved spectacular results on a large variety of tasks.
How large is the variety of tasks? In terms of domain area, quite varied. Advances have been made in major hard science problems like protein synthesis, imaginative tasks like creating images from descriptions, and playing complex games like Starcraft.
How large is the variety of models used? While each model features many domain specific model components and training components, the core of each of these models is a giant transformer trained with a variant of gradient descent, usually ADAM.
How large are these models? That depends. DALLE2 and AlphaFold are O(10GB), AlphaStar is O(1GB), and the current state of the art few shot NLP models (Chinchilla) are O(100GB).
One of the most consistent findings of the past decade of AI research is that larger models trained with more data get better results, especially transformers. If all of these models are built on top of the same underlying architecture, why is there so much variation in size?
Think of training models like lifting weights. What limits your ability to lift heavy weights?
Data availability: (Nutrition) If you don't eat enough food, you'll never gain muscle! Data is the food that makes models learn, and the more "muscle" you want the more "food" you need. When looking for text on the internet, it is easy to get terabytes of data to train a model. This is harder for other tasks
Cost (exhaustion): No matter how rich your corporation is, training a model is expensive. Each polished model you see comes after a lot of experimentation and trials, which uses a lot of computational resources. AI labs are notorious cost sinks. The talent they acquire is expensive, and in addition to their salaries the talent demands access to top of the line computational resources.
Training methodology (What exercises you do). NLP models only require to train one big transformer. More complex models like DALLE-2 and AlphaFold have many subcomponents optimized for their use cases. Training an NLP model is like deadlifting a loaded barbell and training AlphaFold is like lifting a box filled with stuff: at equivalent weight, the barbell is much easier to pick up because the load is balanced, uniform, and in one motion. When picking up the box, the weight is unevenly distributed which makes the task harder. Alphastar was trained by creating a league of AlphaStars which competed against each other in actual games. To continue our weightlifting analogy, this is like a higher rep range with lower weight.
Looked at this way, what has changed over the past three years? In short, we have discovered how to adapt a training method/exercise (the transformer) to a variety of use cases. This exercise allows us to engage our big muscles (scalable hardware and software optimized for transformers). Sure, some of these applications are more efficient than others, but overall they are way more efficient than what they were competing against. We have used this change in paradigm to "lift more weight", increasing the size and training cost of our model to achieve more impressive results.
(Think about how AlphaFold2 and Dalle-2, despite mostly being larger versions of their predecessors, drew more attention than their predecessors ever did. The prior work in the field paved the way by figuring out how to use transformers to solve these problems, and the attention comes from when they scaled the solution to achieve eye popping results. In our weightlifting analogy, we are learning a variation of an exercise. The hard part is learning the form that allows you to leverage the same muscles, but the impressive looking part is adding a lot of weight.)
Why and how has it changed?
In other words: why are we only training gigantic models and getting impressive results now?
There are many reasons for this, but the most important one is that no one had the infrastructure to train models of this size efficiently before.
How have those underlying factors changed in the past three years?
TL;DR not much. We haven't gotten stronger in the past four years, just did a bunch of different exercises which used the same muscles.
All of the advances I mentioned in the last section were from 2018 or earlier.
(For the purists, Self supervised learning went mainstream for vision in 2020 by finally outperforming supervised learning).
Chips are not getting twice as fast every two years like they used to (Moore's law is dying). The cost of a single training run for the largest ML models is on the order of ten million dollars. Adding more GPUs and more computation is pushing against the amount that companies are willing to burn on services that don't generate money for the company. Unlike the prior four years, we cannot scale up the size of models by a thousand times again. No one is willing to spend billions of dollars on training runs yet.
From a hardware perspective, we should expect the pace of innovation to slow in the coming years.
Software advances are mixed. Using ML models is becoming easier by the day. With libraries like huggingface, a single line of code can run a state of the art model for your particular use case. There is a lot of room for software innovations to make it easier to use for non technical audiences, but right now very little research is bottlenecked by software.
Research advances are the X factor. Lots of people are working on these problems, and its possible there is a magic trick for intelligence at existing compute budgets. However, that is and always was true. However, the most important research advances of the last few years primarily enabled us to use more GPUs for a given problem. Now that we are starting to run up against the limits of data acquisition and monetary cost, less low hanging fruit is available.
(Side note: Even facebook has trouble training current state of the art models. Here are some chronicles of them trying to train a GPT-3 size model).
I don't think we should expect performance gains in AI to accelerate over the next few years. As a researcher in the field, I expect the next few years will involve a lot of advances in the "long tail" of use cases and have less growth in the most studied areas. This is because we have achieved the easy pickings gains from hardware and software over the past decade.
This is my first time posting to lesswrong, and I decided to post a lightly edited first draft because if I start doing heavy edits I don't stop. Every time I see a very fast AGI prediction or someone claiming Moore's law will last a few more decades I start to write something, but this time I actually finished it before deciding to rewrite. As a result, it isn't an airtight argument, but more my general feelings as someone who has been at two of the top research institutions in the world.