Spooky Recommendation System Scaling

Oct 31, 2024

Last month I published an initial (very rough) draft of an idea I had for giving users control over what data they produce is utilized by recommendation systems. Today I wish to focus on one reason that recommendation systems should be regulated: the benefits of scaling that famously apply to language models also apply to recommendation systems. While applications of recommendation systems are perhaps less visible than LLMs, they are far more prevalent and impactful in our daily lives. As a result, making sure that they are safe/aligned is just as, if nor more important.

Background: Scaling Laws

A Scaling Law refers to a robust relationship between the computational cost of an algorithm and its performance on certain benchmarks. The most commonly discussed scaling laws are the so called "Kaplan" and "Chinchilla" scaling laws, which govern the relationship between the training cost, measured in numeric operations¹, and the language modeling capability, measured using next token prediction2, of transformer models.

For LLMs, we care about these scaling laws because training cost measured in operations correlates well with training cost measured in time or money, and language modeling capability correlates well with performance on downstream use cases we care about.

In retrospect, the reason that transformers ended up being such a big deal for NLP can be summarized as follows:

On Modern GPUs, transformers with a given flops budget train much faster than prior architectures (e.g. RNNs)
Keeping FLOPs budget constant, transformers seem to scale "better" than prior architectures(see Kaplan)
Transformers seem to achieve better downstream performance than similar architectures with similar language modeling capability (see this for an exploration of one phenomenon).

As a result, scaling transformers became the backbone of the modern era of NLP. However, there were/are a few limitations:

The primary way to use transformers for NLP ended up being through generation, which requires a memory intensive trick known as "KV caching" for transformers. This made serving transformers at scale more difficult.
Language modeling was a good but not perfect proxy for many downstream tasks. The goal of fully utilizing the capabilities of these models led to extensive work in prompting / RL / tool usage etc.
We very quickly ended up scaling to using a filtered version of all open internet language data available. This led to the current interest in synthetic data, which is challenging to use.

What if instead:

The primary usage of our models was predicting one future token
We could always directly measure / optimize for our downstream task.
Every action of every person interacting with every sensor data collectors can grab is fair game for the dataset. Wouldn't you expect robust scaling laws to be even more impactful?

Background: Recommendation systems

You probably understand recommendation systems in the abstract sense. They are the systems that power your google search, twitter feeds, instagram ads, and amazon product recommendations. You probably also know that recommendation systems are profitable. They are the backbone of some the most valuable companies in the globe. What slips under the radar is how profitable these recommendation systems are. Alphabet's revenue of $330 billion amounts to $40 per year per man, woman, and child on this planet, or a third of one percent of world GDP, or more than the GDP of Portugal. That is driven primarily by ads. Improvements to ad revenue will drive the focus of many of the top players in the AI space, including GDM and Meta Research.

These companies are naturally working on recommendation systems that scale as well as transformers scale for natural language processing. While prior approaches to RecSys scale poorly with compute, more recent approaches have achieved much stronger empirical results

Recommendation Scaling Laws

Advances in LLMs accelerate Recommendation Systems in two ways: first, LMs are used to improve the performance of existing recommendation systems (see here for a survey), and motivating the design of architectures that scale more efficiently by utilizing advances in LLMs (writing this post was strongly motivated by reading this Meta paper that does just this).

Both directions will likely be very impactful. For the second, the Meta paper I mentioned above transforms certain problems of recommendation, ranking and retrieval, into sequence prediction tasks3. These new architectures adapt lessons from transformers, such as pure self attention, SwiGLU, and fused PaLM style parallel layers4.

Critically, these architectures scale much, much better than prior architectures.

Applications of these models lack the 3 limitations I highlighted for transformers earlier:

In many cases, instead of needing the next five actions the user takes, we need the top five candidates for the next action the user takes. For autoregressive models, the former is five times as expensive as the latter.
Click through rate (CTR) = $$$. We can directly optimize for it!
Any interaction with a website can be used as data. Good scaling seems very impactful here!

If we integrate these improved recommendation systems with improved large scale multimodal models, the possibilities abound, for better and worse.

So what?

Pretty much every contemporary worry about AI is more applicable to large scale recommendation systems than large scale language models.

For those of you who worry about compulsive internet usage:

These are the models that maximize engagement .

For those of you who worry about deepfakes and misinformation

These models are the ones that will spread them to maximize attention.
For those of you who worry about AI turning into Goodharting paperclip maximizers:

These models are far more ruthlessly optimized than LLMs.

For those of you who worry about AI powered state control:

These models directly optimize the behavior of the user.

For those of you who worry that we don't understand the internal workings of large models: Unlike language and vision applications, not even the input and output of these models are human readable.

If you worry about the negative implications of AI as it pertains to society in any way, shape, or form, you should be just as worried about AIs that rank and retrieve as you are AIs which write, draw, and walk.

These models also raise troubling philosophical questions. How addictive and detrimental to your life does an app need to be before we consider it "digital opium", worthy of regulation? Are there training objectives which we believe to be inherently immoral, the way we thought advertising cigarettes to children was? Regulating these systems will be difficult due to their seamless integration in our lives. But I truly believe it to be essential if we prize independent human thought in our future.

Practical considerations

Practically speaking, I think that existing worries about governance are overly focused on language models due to a lack of public, high quality recommendation systems for research, and the inherent unsexyness of RecSys relative to language and vision.

The problem right now is that the only groups with the tools to do recommendation system alignment / interpretability research are the companies serving these systems, and they have little interest in doing so beyond avoiding scandal.

There is a clear race dynamic and tragedy of the commons here, where social media/entertainment apps compete for attention with ever more engaging content. To avoid mass addiction to autogenerated content or worse for the next generation of children, a disruption of the status quo is needed.

I think this is an important problem. I put a (very rough) draft idea here for what greater user control of RecSys could look like, and a friend of mine developed a recommendation "alignment" tool here. These aim to address the current problems with recommendation systems, but I don't think these tools can keep up with better and better systems. We need to research what ways these systems can be used without detrimental consequences to ensure that consumers can make informed decisions, and give consumers the tools to make those decisions and/or regulate producers to prevent the worst impacts.

It is presently unclear what a healthy future with recommendation systems looks like for people. Lets aim to clarify that.

Specifically, they measure training cost in model floating point operations and the LM capability using cross entropy loss. "Model" means only mandatory model computations count, "floating point" is the preferred numerical format for these models, and "cross entropy loss" can be best understood

This is measured using Cross entropy loss for language modeling, understood easiest as follows: imagine playing 20 questions to guess the next word in a sentence. LM CE Loss is the number of questions you would have to ask to guess the next word.

They instead call this problem sequence transduction but its essentially sequence prediction.

While PaLM style attention did not catch on in transformers, the memory tradeoffs / better compute utilization seem to be worth in this use case.

Phoropter

Discussion about this post