Why RAG Exists (And When Not to Use It)

— RAG is not a universal fix for AI correctness. This article explains the real problem RAG addresses, its hidden costs, and how to decide whether retrieval is justified for a given system.

level: fundamentals topics: rag tags: rag, llm, retrieval, architecture, production

RAG Is Not a Magic Fix for Hallucination

When engineers first encounter LLM hallucinations, a common response is:

“We need RAG to make the model accurate.”

This is a misunderstanding of what RAG does.

RAG (Retrieval-Augmented Generation) is not a correctness layer. It is a pattern for providing models with relevant information they cannot otherwise access.

This article explains:

  • What problem RAG actually solves
  • When RAG is justified
  • When it adds unnecessary complexity
  • What alternatives exist

The Problem RAG Solves

Context Window Limitation

LLMs can only work with information inside their context window. This creates a fundamental constraint:

# Model cannot access this
company_knowledge_base = {
    "policies": 500_000_documents,
    "customer_data": 10_000_000_records,
    "product_specs": 50_000_pages
}

# Model only sees this
context_window = 128_000_tokens  # ~96,000 words

# Problem: How to give model access to relevant data?

Knowledge Cutoff Date

Models are trained on data up to a specific date:

  • GPT-4: Training data ends in April 2023
  • Claude 3: Training data ends in August 2023

After training, models have no awareness of:

  • Current events
  • Recent product changes
  • New company policies
  • User-specific data

Private or Proprietary Information

Models cannot know:

  • Your company’s internal documents
  • Customer conversation history
  • Proprietary codebases
  • Confidential records

RAG solves this by retrieving relevant information and injecting it into the prompt.


What RAG Actually Does

Basic RAG Pattern

def generate_with_rag(question: str) -> str:
    # 1. Retrieve relevant documents
    relevant_docs = retrieve(question, top_k=5)

    # 2. Inject into prompt context
    prompt = f"""
    Use the following documents to answer the question.
    Do not use information outside these documents.

    Documents:
    {format_docs(relevant_docs)}

    Question: {question}
    Answer:
    """

    # 3. Generate response
    return llm.generate(prompt)

RAG Components

  1. Document corpus: Source of truth (database, documents, knowledge base)
  2. Embeddings: Vector representations of documents and queries
  3. Vector store: Database for fast similarity search
  4. Retrieval: Finding most relevant documents for a query
  5. Augmentation: Injecting retrieved documents into prompt
  6. Generation: LLM produces answer grounded in retrieved context

RAG does not make the model smarter. It gives the model access to relevant information.


When RAG Is Justified

Use Case 1: Large, Structured Knowledge Base

Scenario: Customer support system with 10,000 help articles

# Without RAG: Cannot fit all articles in context
context_limit = 128k_tokens
total_articles = 10_000 * 500_words = 5M_tokens

# With RAG: Retrieve only relevant articles
question = "How do I reset my password?"
relevant = retrieve(question, top_k=3)  # Returns 3 most relevant articles
context_used = 3 * 500_words = 1,500_tokens

RAG is justified because:

  • Knowledge base is too large for context window
  • Only small subset is relevant per query
  • Information is structured and factual

Use Case 2: Frequently Updated Information

Scenario: Product documentation that changes weekly

# Model training data: Out of date
model_knowledge_cutoff = "2023-08-01"
current_date = "2026-02-11"

# RAG retrieves current version
current_docs = retrieve_from_latest_version(query)

RAG is justified because:

  • Information changes too frequently for retraining
  • Model’s parametric knowledge is outdated
  • Users need current information

Use Case 3: User-Specific or Private Data

Scenario: Enterprise system with customer records

# Model cannot have been trained on this
user_data = get_user_profile(user_id)
purchase_history = get_purchases(user_id)
support_tickets = get_tickets(user_id)

# RAG retrieves user-specific context
context = retrieve_user_context(user_id, query)

RAG is justified because:

  • Information is private/proprietary
  • Data is user-specific
  • Model cannot have prior knowledge

Use Case 4: Citation Requirements

Scenario: Legal or medical application requiring source references

def answer_with_sources(question: str):
    docs = retrieve(question, top_k=5)

    prompt = f"""
    Answer based only on provided documents.
    Cite sources using [Doc ID].

    Documents:
    {docs}

    Question: {question}
    """

    answer = llm.generate(prompt)
    sources = extract_citations(answer, docs)

    return {
        "answer": answer,
        "sources": sources  # Can be verified by humans
    }

RAG is justified because:

  • Answers must be verifiable
  • Sources must be traceable
  • Accountability is required

When RAG Adds Unnecessary Complexity

Anti-pattern 1: Using RAG for General Knowledge

# Unnecessary RAG
question = "What is Python?"
docs = retrieve(question)  # Retrieves basic Python documentation
answer = generate_with_rag(question, docs)

# Simpler: Model already knows this
answer = llm.generate("What is Python?")

RAG is not justified when:

  • Information is general knowledge
  • Model was trained on this information
  • No specialized/current knowledge needed

Anti-pattern 2: RAG as Hallucination Prevention

# Misusing RAG
question = "Calculate the ROI of this investment"
docs = retrieve("investment calculations")  # Generic guides

# Problem: RAG does not prevent math errors or logical mistakes
answer = generate_with_rag(question, docs)

# Better: Use deterministic calculation
def calculate_roi(initial, returns):
    return (returns - initial) / initial * 100

RAG is not justified when:

  • Task requires computation, not information retrieval
  • Problem is deterministic
  • Hallucination is not the actual issue

Anti-pattern 3: Tiny Document Corpus

# Overcomplicated RAG
docs = [
    "Our support email is support@company.com",
    "Our office hours are 9am-5pm",
    "Our return policy is 30 days"
]

# Problem: Entire corpus fits in context window
question = "What is the support email?"
relevant = retrieve(question, docs)  # Unnecessary retrieval step
answer = generate_with_rag(question, relevant)

# Simpler: Include all docs in prompt
prompt = f"""
Company information:
{all_docs}

Question: {question}
"""

RAG is not justified when:

  • Entire corpus fits in context window
  • Retrieval adds latency without benefit
  • Static information rarely changes

Anti-pattern 4: RAG for Structured Data Queries

# Misusing RAG for database queries
question = "How many orders did user 12345 place last month?"

# Wrong: Retrieve text documents about orders
docs = retrieve(question)
answer = generate_with_rag(question, docs)

# Right: Use database query
query = """
SELECT COUNT(*) FROM orders
WHERE user_id = 12345
  AND created_at >= DATE_TRUNC('month', NOW() - INTERVAL '1 month')
  AND created_at < DATE_TRUNC('month', NOW())
"""
count = db.execute(query)

RAG is not justified when:

  • Data is structured and queryable
  • Deterministic query is possible
  • Precision is critical

Hidden Costs of RAG

Infrastructure Complexity

# Components to build and maintain
class RAGSystem:
    embedding_model: EmbeddingService  # Separate model for embeddings
    vector_store: VectorDatabase       # Specialized database
    chunking_pipeline: TextProcessor   # Document preprocessing
    index_manager: IndexService        # Keep embeddings updated
    retrieval_service: SearchEngine    # Ranking and filtering
    generation_model: LLM              # Actual text generation

Each component requires:

  • Infrastructure setup and maintenance
  • Monitoring and alerting
  • Cost optimization
  • Failure handling

Latency Impact

# RAG adds multiple round trips
def rag_latency():
    t1 = embed_query(question)        # 50-100ms
    t2 = search_vector_db(embedding)  # 100-300ms
    t3 = fetch_documents(doc_ids)     # 50-200ms
    t4 = generate_response(prompt)    # 2000-5000ms

    total = t1 + t2 + t3 + t4  # 2200-5600ms

# Without RAG
def simple_latency():
    return generate_response(prompt)  # 2000-5000ms

# RAG adds 10-25% latency overhead

Data Pipeline Maintenance

# Keeping RAG system updated
class DocumentPipeline:
    def update_document(self, doc: Document):
        # 1. Chunk document
        chunks = self.chunker.split(doc)

        # 2. Generate embeddings
        embeddings = self.embed(chunks)

        # 3. Update vector store
        self.vector_store.upsert(embeddings)

        # 4. Handle deletions
        self.vector_store.delete_old_versions(doc.id)

        # 5. Reindex if needed
        if self.should_reindex():
            self.reindex_all()

Ongoing costs:

  • Document ingestion pipeline
  • Embedding generation costs
  • Vector database storage
  • Reindexing operations
  • Version management

Retrieval Quality Problems

# RAG is only as good as retrieval
question = "How do I troubleshoot connection errors?"

# Retrieved documents might be:
# - Semantically similar but not relevant
# - Outdated versions
# - Missing key information
# - Too generic or too specific

# Poor retrieval → Poor answers
# RAG does not fix retrieval quality

Decision Framework: Do You Need RAG?

Questions to Ask

  1. Can the model answer without external information?

    • No → Consider RAG
    • Yes → RAG probably unnecessary
  2. Is the information in a structured database?

    • Yes → Use database queries, not RAG
    • No → RAG might be appropriate
  3. Does the entire corpus fit in context?

    • Yes → Include directly in prompt
    • No → RAG might be necessary
  4. Is the information general knowledge?

    • Yes → Model likely already knows
    • No → RAG might add value
  5. Do you need citations or traceability?

    • Yes → RAG provides sources
    • No → Simpler approaches might work
  6. Can you afford the infrastructure complexity?

    • No → Explore simpler alternatives first
    • Yes → RAG infrastructure is feasible

Simplified Decision Tree

Does model need information not in its training data?
├─ No → Do not use RAG
└─ Yes → Is it structured data?
    ├─ Yes → Use database queries
    └─ No → Does entire corpus fit in context?
        ├─ Yes → Include in prompt directly
        └─ No → Is the complexity justified?
            ├─ No → Start simpler
            └─ Yes → Use RAG

Alternatives to RAG

Alternative 1: Direct Context Inclusion

# If data fits in context, include it
company_info = load_static_info()  # 5k tokens

prompt = f"""
{company_info}

Question: {question}
"""

When to use: Small, static information that rarely changes.

Alternative 2: Fine-tuning

# Teach model specialized knowledge
fine_tuned_model = train_on_domain_data(
    base_model="gpt-4",
    training_data=company_qa_pairs
)

# Model now has parametric knowledge
answer = fine_tuned_model.generate(question)

When to use: Stable domain knowledge, high query volume, latency-critical.

Alternative 3: Tool/Function Calling

# Let model query structured data
tools = [
    {
        "name": "get_order_status",
        "parameters": {"order_id": "string"}
    }
]

response = llm.generate_with_tools(question, tools)
if response.tool_calls:
    result = execute_tool(response.tool_calls[0])
    answer = llm.generate(f"Tool returned: {result}")

When to use: Structured data, APIs, real-time information.

Alternative 4: Prompt Engineering

# Constrain model to avoid hallucination
prompt = """
Answer only if you are certain.
If you do not know, respond: "I don't have that information."

Question: {question}
"""

When to use: General knowledge questions, acceptable to decline answering.


Starting Without RAG

Phase 1: Validate Core Use Case

# Test with manual context injection
test_prompt = f"""
Relevant information:
{manually_selected_docs}

Question: {question}
"""

# Does this solve the problem?
# If yes → RAG might help scale this
# If no → RAG will not help

Phase 2: Measure Information Needs

# How many documents are relevant per query?
def analyze_retrieval_needs(questions):
    for q in questions:
        relevant_docs = manually_identify_relevant(q)
        print(f"Question: {q}")
        print(f"Relevant docs: {len(relevant_docs)}")
        print(f"Total tokens: {count_tokens(relevant_docs)}")

# If relevant docs fit in context → No RAG needed
# If retrieval is too broad → RAG will struggle

Phase 3: Build Minimal RAG

# Simplest possible RAG
class MinimalRAG:
    def __init__(self, docs: list[str]):
        self.docs = docs
        self.embeddings = embed_documents(docs)

    def query(self, question: str) -> str:
        # Simple embedding search
        q_embed = embed_query(question)
        top_docs = semantic_search(q_embed, self.embeddings, k=3)

        # Basic prompt augmentation
        prompt = f"{top_docs}\n\nQuestion: {question}"
        return llm.generate(prompt)

# Test on real questions before adding complexity

Conclusion

RAG is a tool, not a mandate.

Use RAG when:

  • Information exists outside model’s training data
  • Corpus is too large for context window
  • Citations and traceability are required
  • Information changes frequently

Do not use RAG when:

  • Model already has necessary knowledge
  • Data is structured and queryable
  • Entire corpus fits in context
  • Complexity outweighs benefits

The best RAG system is often no RAG system at all.

Start simple. Add complexity only when justified.

Continue learning

Next in this path

Chunking Strategies That Actually Work

Effective chunking is an information architecture problem, not a text-splitting task. This article covers practical chunking strategies that improve retrieval accuracy in real-world RAG systems.