Context Engineering for RAG Pipelines

Most RAG implementations fail not because retrieval is bad, but because nobody thought about what happens after retrieval. Bad chunking, no re-ranking, and no context budgeting waste the tokens you spent retrieving.

Why This Matters

Teams spend weeks tuning embeddings and chunking strategies, then dump the results into the context window in whatever order the vector store returned them. The model ignores half of it, and nobody notices because the system still produces answers; they’re just worse than they should be.

RAG has two problems: getting the right chunks, and assembling those chunks into context the model can actually use. Most teams only work on the first one.

Chunking Is Where Most Pipelines Break

Chunking determines what gets retrieved, and bad chunking poisons everything downstream. A function definition split across two chunks means neither chunk makes sense on its own. A paragraph’s key point in one chunk and its explanation in another means the model sees the claim without the evidence.

Fixed-Size Chunking

The simplest approach:

def chunk_naive(text, chunk_size=500):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i+chunk_size])
    return chunks

The problem: chunks split in the middle of concepts.

Semantic Chunking

Split at natural boundaries instead:

def chunk_semantic(text):
    paragraphs = text.split("\n\n")
    chunks = []
    current = ""
    for para in paragraphs:
        if len(current) + len(para) < 500:
            current += "\n\n" + para
        else:
            if current:
                chunks.append(current)
            current = para
    if current:
        chunks.append(current)
    return chunks

Contextual Chunking

Add context to each chunk so it stands alone:

def chunk_with_context(doc, chunk):
    return f"""
Source: {doc.get('filename', 'unknown')}, Section: {doc.get('section', 'unknown')}

{chunk}
"""

This is what Anthropic’s contextual retrieval does. Each chunk carries its origin and surrounding context with it, so the model understands where it came from. In their testing, combining contextual chunking with BM25 hybrid search reduced retrieval failures by 49% compared to standard embedding-based retrieval. The improvement came from both sides: chunks that make sense on their own retrieve better, and keyword matching catches what embedding similarity misses.

Chroma’s research found that semantic chunking at 512-1024 tokens outperforms fixed-size chunking in most retrieval tasks. But the ideal size depends on your document structure: technical documentation with long explanations works better at the higher end, while conversational data or Q&A pairs need smaller chunks around 256-512 tokens. Don’t pick a chunk size and forget about it. Measure retrieval quality at 3-4 different sizes and pick the one that actually performs best on your data.

Retrieval Is a Ranking Problem

Top-k results from a vector store are sorted by embedding distance. Usefulness and embedding similarity are correlated but not the same, and the gap between them is where RAG quality lives. Embedding similarity and task relevance are correlated but not the same thing, and the gap between them is where RAG quality lives.

Re-Ranking

Re-rank after initial retrieval to improve quality:

def retrieve_with_reranking(query, documents, top_k=5):
    initial = vector_store.search(query, k=20)
    reranked = cross_encoder.rank(query, initial)
    return reranked[:top_k]

Retrieve 20, re-rank, keep 5. The initial retrieval is cheap; the quality comes from what you do with it.

Pure semantic search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine them:

def hybrid_search(query, k=10):
    bm25_results = bm25.search(query, k=k)
    semantic_results = vector_store.search(query, k=k)
    return rank_hybrid(bm25_results, semantic_results, weights=[0.3, 0.7])

The 0.3/0.7 split is a reasonable starting point. BM25 catches exact matches that semantic search misses. Adjust based on your recall numbers.

Assembling Retrieved Context

This is the part most teams skip entirely. You have 5 relevant chunks, and most teams never think about what order to put them in, whether all 5 are worth including, or how much of the token budget they should consume versus leaving room for instructions.

Apply the Pyramid

Use The Pyramid pattern: put the most important information first:

def assemble_context(query, retrieved_docs):
    docs = sorted(retrieved_docs, key=lambda d: d.relevance, reverse=True)
    context = docs[0].content
    for doc in docs[1:]:
        if len(context) + len(doc.content) < 8000:
            context += "\n\n" + doc.content
    return context

Apply Grounding

Explicitly tell the model to use the retrieved context:

Use ONLY the information provided in the context below to answer
the question. If the context does not contain the answer, say so.

Context:
[assembled retrieved documents, most relevant first]

Question:
[user query]

This applies the Grounding pattern. The model knows it should use the provided context to answer.

Putting It Together

A complete RAG pipeline with context engineering:

class ContextEngineeredRAG:
    def __init__(self):
        self.chunker = SemanticChunker()
        self.retriever = HybridRetriever()
        self.reranker = CrossEncoderReranker()
    
    def query(self, user_query, token_budget=8000):
        initial_docs = self.retriever.search(user_query, k=20)
        ranked_docs = self.reranker.rank(user_query, initial_docs)
        context = self._assemble_with_budget(ranked_docs, token_budget)
        prompt = self._build_prompt(user_query, context)
        return llm.generate(prompt)

Each step earns its place. Semantic chunking preserves concept boundaries. Hybrid search recovers the keyword matches that embedding similarity misses. Re-ranking filters 20 candidates down to 5 good ones. The assembly step puts the strongest results first. The grounding instruction tells the model to actually use what you retrieved. Skip any of these and you’re leaving quality on the table, usually without a clear signal of what broke.

How Perplexity Does It

Perplexity’s system prompt is one of the few production RAG implementations where the full retrieval-to-output pipeline is visible, and it shows the scars of real iteration. The solutions to the problems most RAG systems hit are baked directly into the prompt architecture.

Citation as schema steering: Perplexity enforces an exact inline citation format: [1][2] appended directly after the sentence, no space before the bracket. This is Schema Steering applied to attribution. The model can’t produce a response without actively linking each claim to a source, and the format is specified character by character, capped at three citations per sentence. Compare that to most RAG implementations that ask the model to “cite your sources” and get inconsistent results.

Banned self-reference: Seven specific phrases are explicitly banned: “According to the search results,” “Based on the provided sources,” “Given the search results,” and four more variations. Perplexity learned from production that the model kept revealing its retrieval mechanism in the output, which breaks the illusion that the system is answering from knowledge rather than searching. The fix is a compact list of Negative Constraints targeting the exact failure mode. This is the right use of negative constraints: specific phrases that provably appear in output, targeted at a confirmed failure mode.

Query-type routing: The prompt includes eight distinct response formats depending on query classification. Academic research gets “scientific write-up with paragraphs and sections,” coding queries get code-first then explanation, weather queries strip everything except the forecast, and recipes get step-by-step with precise quantities. This is Progressive Disclosure applied to output format; the model selects the appropriate template based on query type, producing a format calibrated to what was asked. If your RAG pipeline handles diverse query types, this kind of routing improves output quality significantly without touching the retrieval layer.

Grounding without saying so: The meta-instruction is telling: “Never mention that you are using search results or citing sources in your answer. Simply incorporate the information naturally.” The Grounding is structural (citation format, source integration) rather than conversational. The model uses the retrieved context because the prompt architecture makes it impossible not to, not because it was asked nicely.

Common Mistakes

Dumping all retrieved chunks: Retrieving 20 documents and including all of them is worse than including the top 5. The bottom 15 actively degrade quality by diluting attention on the good results. Use Select, Don’t Dump.

Fixed-size chunking on structured documents: If your documents have headings, paragraphs, and code blocks, fixed-size chunking will split them at random points. Use semantic boundaries.

Skipping re-ranking: The difference between top-5-by-embedding and top-5-after-reranking is often the difference between a usable answer and a hallucinated one.

No grounding instructions: Without explicit instructions to use the retrieved context, the model treats it as optional background. It hallucinates confidently, and the retrieved context just sits there unused.