The MLnotes Newsletter

The MLnotes Newsletter

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Why use RAG in the Era of Long-Context Language Models?

Why use RAG in the Era of Long-Context Language Models?

Part 1

Angelina Yang's avatar
Angelina Yang
Sep 30, 2024
∙ Paid
1

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Why use RAG in the Era of Long-Context Language Models?
Share

In recent years, the field of natural language processing has seen remarkable advancements, particularly in the development of large language models (LLMs) with increasingly expansive context windows. Models like “GPT-4O OpenAI (2023), Claudi-3.5 Anthropic (2024), Llama3.1 Meta (2024b), Phi-3 Abdin et al. (2024), and Mistral-Large2 AI (2024)” all boast the ability to process up to 128,000 tokens or more in a single context. Gemini-1.5-pro even supports a 1M context window.

This begs the question:

Is there still a place for Retrieval Augmented Generation (RAG) in this new era of long-context LLMs?

Before we dive in, we’re excited to share that our RAG course is launching soon, and there’s still time to fill out the course survey to share your preferences!👇

📝 Course survey: https://maven.com/forms/e48159

Thanks, and we’re looking forward to seeing you there!

Illustrated gif. Chibi-like version of Benedict Cumberbatch as Doctor Strange whirling his arm in a circle, creating a portal that makes him disappear. Text, "Let's go!"

The Rise of Long-Context LLMs

Long-context LLMs have made significant strides in understanding and processing extensive inputs. These models can now directly engage with large amounts of text, potentially eliminating the need for complex retrieval systems.

This advancement has led to improved performance across various tasks. In a comprehensive study comparing RAG and long-context (LC) LLMs (paper), researchers found that when given sufficient resources, LC models consistently outperformed RAG approaches. Across multiple datasets, including NarrativeQA, Qasper, and MultiFieldQA, LC models showed superior results in terms of average performance.

However, in the experiments during the research, an exception is observed in the two longer datasets from ∞Bench (En.QA and En.MC), where RAG outperforms LC for GPT-3.5-Turbo. This result was likely due to the significantly longer context in these datasets (averaging 147k words) compared to GPT-3.5-Turbo’s limited context window of 16k. This finding underscores RAG’s effectiveness when the input text greatly exceeds the model’s context window, highlighting a specific use case for RAG.

In addition, RAG’s significantly lower cost remains a distinct advantage. Based on this observation…

The next question is:

Is there a way to leverage the best of both worlds?

The Continued Relevance of RAG

Despite the impressive capabilities of long-context LLMs, RAG remains a valuable tool in the AI practitioner's toolkit. Here's why:

1. Cost-Efficiency

The most significant advantage of RAG is its cost-effectiveness. While LC models may offer better performance, they come at a much higher computational cost. RAG significantly reduces the input length to LLMs, leading to lower costs since most LLM API pricing is based on the number of input tokens.

2. Effective for Majority of Queries

Interestingly, research has shown that the predictions from LC and RAG are identical for over 60% of queries. This means that for a large portion of tasks, RAG can provide the same level of performance as LC models but at a fraction of the cost.

3. Scalability

As the amount of information continues to grow exponentially, RAG offers a scalable solution for accessing vast amounts of knowledge without the need to constantly retrain or expand the base language model.

Introducing Self-Route: A Hybrid Approach

Recognizing the strengths of both RAG and LC models, researchers have proposed a method called Self-Route. This approach aims to get the best of both worlds by dynamically routing queries to either RAG or LC based on the model's self-reflection.

Refer to caption
Source: Figure 1:While long-context LLMs (LC) surpass RAG in long-context understanding, RAG is significantly more cost-efficient. A new approach -Self-Route, combining RAG and LC, achieves comparable performance to LC at a much lower cost.

Here's how Self-Route works:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLnotes
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share