Context Engineering for RAG Pipelines
Most RAG implementations fail not because retrieval is bad, but because nobody thought about what happens after retrieval. Bad chunking, no re-ranking, and no context budgeting waste the tokens you spent retrieving.
Why This Matters
Teams spend weeks tuning embeddings and chunking strategies, then dump the results into the context window in whatever order the vector store returned them. The model ignores half of it, and nobody notices because the system still produces answers; they’re just worse than they should be.
RAG has two problems: getting the right chunks, and assembling those chunks into context the model can actually use. Most teams only work on the first one.
Chunking Is Where Most Pipelines Break
Chunking determines what gets retrieved, and bad chunking poisons everything downstream. A function definition split across two chunks means neither chunk makes sense on its own. A paragraph’s key point in one chunk and its explanation in another means the model sees the claim without the evidence.
Fixed-Size Chunking
The simplest approach:
def chunk_naive(text, chunk_size=500):
chunks = []
for i in range(0, len(text), chunk_size):
chunks.append(text[i:i+chunk_size])
return chunks
The problem: chunks split in the middle of concepts.
Semantic Chunking
Split at natural boundaries instead:
def chunk_semantic(text):
paragraphs = text.split("\n\n")
chunks = []
current = ""
for para in paragraphs:
if len(current) + len(para) < 500:
current += "\n\n" + para
else:
if current:
chunks.append(current)
current = para
if current:
chunks.append(current)
return chunks
Contextual Chunking
Add context to each chunk so it stands alone:
def chunk_with_context(doc, chunk):
return f"""
Source: {doc.get('filename', 'unknown')}, Section: {doc.get('section', 'unknown')}
{chunk}
"""
This is what Anthropic’s contextual retrieval does. Each chunk carries its origin and surrounding context with it, so the model understands where it came from. In their testing, combining contextual chunking with BM25 hybrid search reduced retrieval failures by 49% compared to standard embedding-based retrieval. The improvement came from both sides: chunks that make sense on their own retrieve better, and keyword matching catches what embedding similarity misses.
Chroma’s research found that semantic chunking at 512-1024 tokens outperforms fixed-size chunking in most retrieval tasks. But the ideal size depends on your document structure: technical documentation with long explanations works better at the higher end, while conversational data or Q&A pairs need smaller chunks around 256-512 tokens. Don’t pick a chunk size and forget about it. Measure retrieval quality at 3-4 different sizes and pick the one that actually performs best on your data.
Retrieval Is a Ranking Problem
Top-k results from a vector store are sorted by embedding distance, not by usefulness. Embedding similarity and task relevance are correlated but not the same thing, and the gap between them is where RAG quality lives.
Re-Ranking
Re-rank after initial retrieval to improve quality:
def retrieve_with_reranking(query, documents, top_k=5):
initial = vector_store.search(query, k=20)
reranked = cross_encoder.rank(query, initial)
return reranked[:top_k]
Retrieve 20, re-rank, keep 5. The initial retrieval is cheap; the quality comes from what you do with it.
Hybrid Search
Pure semantic search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine them:
def hybrid_search(query, k=10):
bm25_results = bm25.search(query, k=k)
semantic_results = vector_store.search(query, k=k)
return rank_hybrid(bm25_results, semantic_results, weights=[0.3, 0.7])
The 0.3/0.7 split is a reasonable starting point. BM25 catches exact matches that semantic search misses. Adjust based on your recall numbers.
Assembling Retrieved Context
This is the part most teams skip entirely. You have 5 relevant chunks, and most teams never think about what order to put them in, whether all 5 are worth including, or how much of the token budget they should consume versus leaving room for instructions.
Apply the Pyramid
Use The Pyramid pattern: put the most important information first:
def assemble_context(query, retrieved_docs):
docs = sorted(retrieved_docs, key=lambda d: d.relevance, reverse=True)
context = docs[0].content
for doc in docs[1:]:
if len(context) + len(doc.content) < 8000:
context += "\n\n" + doc.content
return context
Apply Grounding
Explicitly tell the model to use the retrieved context:
Use ONLY the information provided in the context below to answer
the question. If the context does not contain the answer, say so.
Context:
[assembled retrieved documents, most relevant first]
Question:
[user query]
This applies the Grounding pattern. The model knows it should use the provided context, not fall back to its training data.
Putting It Together
A complete RAG pipeline with context engineering:
class ContextEngineeredRAG:
def __init__(self):
self.chunker = SemanticChunker()
self.retriever = HybridRetriever()
self.reranker = CrossEncoderReranker()
def query(self, user_query, token_budget=8000):
initial_docs = self.retriever.search(user_query, k=20)
ranked_docs = self.reranker.rank(user_query, initial_docs)
context = self._assemble_with_budget(ranked_docs, token_budget)
prompt = self._build_prompt(user_query, context)
return llm.generate(prompt)
Each step earns its place. Semantic chunking preserves concept boundaries. Hybrid search recovers the keyword matches that embedding similarity misses. Re-ranking filters 20 candidates down to 5 good ones. The assembly step puts the strongest results first. The grounding instruction tells the model to actually use what you retrieved. Skip any of these and you’re leaving quality on the table, usually without a clear signal of what broke.
Common Mistakes
Dumping all retrieved chunks. Retrieving 20 documents and including all of them is worse than including the top 5. The bottom 15 actively degrade quality by diluting attention on the good results. Use Select, Don’t Dump.
Fixed-size chunking on structured documents. If your documents have headings, paragraphs, and code blocks, fixed-size chunking will split them at random points. Use semantic boundaries.
Skipping re-ranking. The difference between top-5-by-embedding and top-5-after-reranking is often the difference between a usable answer and a hallucinated one.
No grounding instructions. Without explicit instructions to use the retrieved context, the model treats it as optional background. It hallucinates confidently, and the retrieved context just sits there unused.