Large Language Model's Context Windows Get Huge

Strike up a conversation with a chatbot and you may run into a frustrating limitation: It can forget what you’re discussing. This happens as earlier parts of the conversation fall out of the large language model’s context window, which is the largest chunk of text it can consider when generating a response.

Magic, an AI software development company, recently claimed an advance that can overcome this problem: a large language model (LLM) with a context window of 100 million tokens. (Tokens are the basic units of text that LLMs process, typically representing words or parts of words.) A context window that long can fit about 750 novels, which is far more than enough to consider an entire chat. It can even allow the user to input tens or hundreds of documents for the LLM to reference.

AI Hardware Is in Its ‘Put Up or Shut Up’ Era

January 5, 2025

AlphaTheta DDJ-FLX2 Review: A Great Entry-Level DJ Controller

January 5, 2025

“The attention span is limited for large language models,” says Naresh Dulam, Vice President of Software Engineering at JPMorgan Chase. “But this attention span keeps on increasing. That’s what the long context window provides. With the attention span increasing, you can put more data in.”

Dramatic Growth of Context Windows

Magic’s claim that its latest LLM can use up to 100 million tokens of context easily tops the previous high water mark: Google Gemini 1.5 Pro’s context window of up to 2 million tokens. Other popular LLMs, such as the most recent versions of Anthropic’s Claude, OpenAI’s GPT, and Meta’s Llama, have context windows of 200,000 tokens or less.

To evaluate its model, Magic invented a new tool called HashHop, which is available on Github. In Magic’s blog post, the authors note that typical evaluations test a model’s memory by inserting an odd phrase in a long text document, such as putting a sentence about a coffee date in the text of Moby Dick. However, models can learn to identify the odd sentence, which makes it successful in the evaluation but unsuccessful in attempts to find other information in long documents. HashHop instead tests a model’s retrieval by providing a long document full of hashes—random strings of letters and numbers—and asking it to find specific ones. In HashHop, Magic’s model recalled hashes with up to 95 percent accuracy in a context window of 100 million tokens. Put more simply: It would be able to recall a single sentence from a corpus of up to 750 novels.

“In practice, context just works better” than alternative methods for improving model performance, said Magic CEO Eric Steinberger on the No Priors podcast. Instead of training the model on additional specialized data sets (as in the practice called fine tuning) or using a retriever algorithm to find data in an external set of documents (as in retrieval-augmented generation), Magic created this long context window that allows users to throw all their data into their prompt. “Our model sees all the data all the time,” Steinberger said.

But that’s not to say long context windows solve every problem, and Magic’s claims require some skepticism. “The claim of a model with such an extensive context window is certainly ambitious, if true,” says Daniel Khashabi, an assistant professor of computer science at John Hopkins University. “However, the practical implementation and efficiency of this model are critical factors to consider. From my quick reading of the blog post, it seems somewhat vague, lacking clear evaluations, architecture specifics, or training details.” While Magic published a blog post, it hasn’t released a paper, and the model isn’t yet available to the public. Magic did not respond to IEEE Spectrum’s requests for comment.

Dulam agrees with Khashabi’s concerns about the model’s practicality and efficiency, noting that long context can have downsides in this critical area. “If you have an infinite context, then you put a lot of unnecessary data in your context. There’s a human tendency to put in a lot of data,” he says. That can add “noise” to the LLM’s response, which can reduce quality and accuracy. It can also increase the compute and memory requirements beyond what another method, like retrieval-augmented generation (RAG), would need.

Long Context Windows + RAG

Without a publicly available version to try, the benefits of Magic’s claimed 100-million context window remain theoretical. Still, the development exemplifies an important trend: LLM context windows are growing, and fast. The initial version of OpenAI’s GPT-3.5 had a context window of just 4,096 tokens. Today, GPT-4o has a context window of 128,000 tokens. That’s a thirty-fold improvement in less than two years, and it has already improved the quality of modern chatbots. A chat powered by GPT-4o is unlikely to exceed the model’s context window, so ChatGPT can usually recall earlier topics within a single conversation.

However, while Magic seems to think that long context is the best way for LLM’s to recall data, others see it as a compliment to existing methods, including RAG. One researcher who’s been experimenting with such combinations is Ziyan Jiang, an applied scientist at Amazon AGI. “Previously, the retrieval task in RAG focused on very precise information, because the concept of long context models didn’t exist,” he says. Early implementations of RAG retrieved information in chunks of a few hundred tokens, and used a ranking algorithm to order the retrieved data appropriately. However, the small size of these chunks meant that a book-length text would be sliced into thousands of snippets, which made it difficult for the system to retrieve broader ideas and concepts that spanned multiple chunks.

Jiang co-authored a paper, currently in pre-print, which investigates the potential of long context windows to aid RAG. The method that he and his co-authors propose, LongRAG, pairs longer retrieval chunks (of at least 4,000 tokens) with a long-context LLM. The paper found that using these longer chunks improved the model’s ability to find specific information, because it retained more context about the information it retrieved. For example, when tasked with recalling a multi-sentence text passage, the model’s performance improved from a 52 percent recall rate using short context (averaging 130 tokens) to 72 percent with long context (averaging 6,000 tokens).

“It’s similar to a human trying to find some information,” says Jiang. “When you’re trying to learn something new, you don’t start with a search for the exact paragraph to answer your question. You scan documents that might be useful for your search.”

Jiang noted more research is needed to explore how RAG and long context compare. Other researchers appear to agree, as multiple papers on the topic have appeared in recent months. For his part, though, Jiang thinks RAG and long context need not be mutually exclusive. “A lot of people are talking about whether RAG should still exist, and whether RAG is better or a long context window in an LLM is better,” he said. “My point is that a combination is better.”

Khashabi shared similar thoughts and again highlighted efficiency as a key reason RAG could remain relevant. “Retrieval-Augmented Generation (RAG) is more efficient as it dynamically fetches relevant information. RAG and large context LLMs can work in tandem to provide a more efficient, effective solution.”

From Your Site Articles

Related Articles Around the Web