Matt Rickard

Share this post

The RLHF Advantage

blog.matt-rickard.com

Discover more from Matt Rickard

Thoughts on engineering, startups, and AI.
Continue reading
Sign in

The RLHF Advantage

Jul 21, 2023
5
Share this post

The RLHF Advantage

blog.matt-rickard.com
1
Share

We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.  — Llama 2

Reinforcement learning from human feedback (RLHF) is one of the most effective techniques to align large language model (LLM) behavior with human preferences and instruction following. The simplest form is sampling human preference by letting human annotators choose which of the two model outputs they prefer. The human feedback is used to train a reward model. 

OpenAI uses RLHF extensively in its models and spends significant sums on human annotators. Now Meta is doing the same with Llama 2. Interestingly enough, it doesn’t even seem like Meta has reached the limit of the effectiveness of RLHF with Llama 2:

Scaling trends for the reward model. More data and a larger-sized model generally improve accuracy, and it appears that our models have not yet saturated from learning on training data. — Llama 2

What does this mean? Some ideas:

  • Base models are the commodity, RLHF is the complement. A curated reward model could turn base models into differentiated products. 

  • Human annotation is still important. While many thought that the data labeling companies of the last ML wave might be left behind in the age of LLMs, they might be more relevant than ever.

  • Is human preference the right model? It’s frustrating when chat-based models refuse to answer a tricky question. Some models trade off helpfulness for safety. On the other hand, we don’t want to perpetuate the biases and bad parts of the Internet in our models. Obviously a much deeper and more complex topic. 

  • Is RLHF a path-dependent product of OpenAI? Or is it the right long-term strategy? OpenAI is a pioneer of reinforcement learning (most of their products pre-GPT were RL). Is reinforcement learning the most effective way to steer LLMs, or was it just the hammer that OpenAI researchers knew best? Both can be true.

  • Who owns the best data for RLHF? Not all data is created equally. What kind of feedback system will be most effective for building a reward model for future LLMs? While companies like Google have insurmountable amounts of data, they might not have the right data. 

5
Share this post

The RLHF Advantage

blog.matt-rickard.com
1
Share
Previous
Next
1 Comment
Share this discussion

The RLHF Advantage

blog.matt-rickard.com
Andrew Smith
Writes Goatfury Writes
Jul 21

I think we've seen exactly what META can do when it chooses to throw a rock into the pond of social media companies. Threads was adopted stupefyingly fast not just because Twitter is a toxic waste dump, but also because they have a ridiculous user base.

At some point, LLM training is a numbers game, and META is absolutely in the game (for now, anwyay) in a way nobody else was really threatening Open AI and Microsoft's supremacy.

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Matt Rickard
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing