In recent years, the field of natural language processing has seen remarkable advancements, particularly in the development of large language models (LLMs) with increasingly expansive context windows. Models like “GPT-4O OpenAI (2023), Claudi-3.5 Anthropic (2024), Llama3.1 Meta (2024b), Phi-3 Abdin et al. (2024), and Mistral-Large2 AI (2024)” all boast the ability to process up to 128,000 tokens or more in a single context. Gemini-1.5-pro even supports a 1M context window.
This begs the question:
Is there still a place for Retrieval Augmented Generation (RAG) in this new era of long-context LLMs?
Before we dive in, we’re excited to share that our RAG course is launching soon, and there’s still time to fill out the course survey to share your preferences!👇
📝 Course survey: https://maven.com/forms/e48159
Thanks, and we’re looking forward to seeing you there!
The Rise of Long-Context LLMs
Long-context LLMs have made significant strides in understanding and processing extensive inputs. These models can now directly engage with large amounts of text, potentially eliminating the need for complex retrieval systems.
This advancement has led to improved performance across various tasks. In a comprehensive study comparing RAG and long-context (LC) LLMs (paper), researchers found that when given sufficient resources, LC models consistently outperformed RAG approaches. Across multiple datasets, including NarrativeQA, Qasper, and MultiFieldQA, LC models showed superior results in terms of average performance.
However, in the experiments during the research, an exception is observed in the two longer datasets from ∞Bench (En.QA and En.MC), where RAG outperforms LC for GPT-3.5-Turbo. This result was likely due to the significantly longer context in these datasets (averaging 147k words) compared to GPT-3.5-Turbo’s limited context window of 16k. This finding underscores RAG’s effectiveness when the input text greatly exceeds the model’s context window, highlighting a specific use case for RAG.
In addition, RAG’s significantly lower cost remains a distinct advantage. Based on this observation…
The next question is:
Is there a way to leverage the best of both worlds?
The Continued Relevance of RAG
Despite the impressive capabilities of long-context LLMs, RAG remains a valuable tool in the AI practitioner's toolkit. Here's why:
1. Cost-Efficiency
The most significant advantage of RAG is its cost-effectiveness. While LC models may offer better performance, they come at a much higher computational cost. RAG significantly reduces the input length to LLMs, leading to lower costs since most LLM API pricing is based on the number of input tokens.
2. Effective for Majority of Queries
Interestingly, research has shown that the predictions from LC and RAG are identical for over 60% of queries. This means that for a large portion of tasks, RAG can provide the same level of performance as LC models but at a fraction of the cost.
3. Scalability
As the amount of information continues to grow exponentially, RAG offers a scalable solution for accessing vast amounts of knowledge without the need to constantly retrain or expand the base language model.
Introducing Self-Route: A Hybrid Approach
Recognizing the strengths of both RAG and LC models, researchers have proposed a method called Self-Route. This approach aims to get the best of both worlds by dynamically routing queries to either RAG or LC based on the model's self-reflection.

Here's how Self-Route works:
Keep reading with a 7-day free trial
Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.