Matt Rickard

Share this post

The Context Length Observation

blog.matt-rickard.com

Discover more from Matt Rickard

Thoughts on engineering, startups, and AI.
Over 4,000 subscribers
Continue reading
Sign in

The Context Length Observation

Nov 7, 2023
6
Share this post

The Context Length Observation

blog.matt-rickard.com
4
Share

Large language models can only consider a limited amount of text at one time when generating a response or prediction. This is called the context length. It differs across models.

But one trend is interesting. Context length is increasing.

  • GPT-1 (2018) had a context length of 512 tokens.

  • GPT-2 (2019) supported 1,024.

  • GPT-3 (2020) supported 2,048.

  • GPT-3.5 (2022) supported 4,096

  • GPT-4 (2023) first supported 8,192. Then 16,384. Then 32,768. Now, it supports up to 128,000 tokens.

Just using the OpenAI models for comparison, context length has, on average, doubled every year for the last five years. An observation akin to Moore’s Law:

The maximum context length of state-of-the-art Large Language Models is expected to double approximately every two years, driven by advances in neural network architectures, data processing techniques, and hardware capabilities.

It’s generally hard to scale — for many years, the attention mechanism scaled quadratically (until FlashAttention). It’s even harder to get models to consider longer contexts (early models with high context lengths had trouble considering data in the middle).

Understanding relationships and dependencies across large portions of text is difficult otherwise. Small context lengths require documents to be chunked up and processed bit by bit (with something like retrieval augmented generation).

With long enough context lengths, we might ask questions on entire books or write full books with a single prompt. We might analyze an entire codebase in one pass. Or extract useful information from mountains of legal documents with complex interdependencies.

What might lead to longer context lengths?

Advances in architecture. Innovations like FlashAttention turned the computational complexity of the attention mechanism from quadratic to linear with respect to context length. Doubling the context length no longer means quadrupling the computation cost.

Rotary Positional Encoding (RoPE) is another architectural enhancement that makes context length scale more efficiently. It also helps models generalize to longer contexts.

Advances in data processing techniques. You can increase context length in two ways. First, you can train the model with longer context lengths. That’s difficult because it’s much more computationally expensive, and it’s hard to find datasets with long context lengths (most documents in CommonCrawl have fewer than 2,000 tokens).

The second, more common, way is to fine-tune a base model with a longer context window. Code Llama is a 16k context length fine-tuned version on top of Llama 2 (4k context length).

Advances in hardware capabilities. Finally, the more we can make the attention mechanism and other bottlenecks in training and inference more efficient, the more they can scale with advances in the underlying hardware.

There’s still work to be done. How do we determine context length for data? It’s simple enough if it’s the same file (a book, a webpage, a file). But how should we represent an entire codebase in the training data? Or a semester’s worth of lectures from a college class? Or a long online discussion? Or a person’s medical records from their entire life?

6
Share this post

The Context Length Observation

blog.matt-rickard.com
4
Share
Previous
Next
4 Comments
Share this discussion

The Context Length Observation

blog.matt-rickard.com
Markus Kohler
Nov 7

There is still for the time being the question of performance and cost even if the context size gets bigger. Therefore I wouldn’t be do sure that simple implementations which just send everything would be always competitive. Just sending what is needed might result in dramatic cost savings and at the same time improve performance.

Expand full comment
Reply
Share
Richard Guinness
Nov 7

Really interested to see how "well" these very long contexts are "used" (looking at paper "Lost in the middle: how language models use long contexts")

Expand full comment
Reply
Share
2 more comments...
Top
New
Community

No posts

Ready for more?

© 2023 Matt Rickard
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing