Insights·AI

RAG isn't retrieval — it's context engineering

SS
Sylvester SFounder & CEO
Jan 15, 2025·7 min read
Data analytics workspace

Everyone calls it retrieval-augmented generation but the bottleneck is never retrieval. It's knowing what context an LLM actually needs to reason correctly.

The name 'retrieval-augmented generation' implies that retrieval is the thing. It's not. Retrieval is the easy part. The hard part — the part that determines whether your RAG pipeline produces useful answers or confident nonsense — is deciding what context to put in front of the model.

After building RAG systems for enterprise knowledge bases, legal document analysis, financial research, and customer support, here's our framework for thinking about context engineering.

The chunk strategy problem

Most teams chunk documents at fixed token counts. It's the default in LangChain, it's what tutorials show, and it's usually wrong. Fixed-size chunking splits sentences mid-thought, separates tables from their headers, and divorces conclusions from the evidence they summarise. The embedding model then generates a vector for a chunk that means nothing in isolation.

Chunk at semantic boundaries instead. For prose: paragraph-level chunking, with sentence-level overlap. For structured documents: section-level chunking that preserves hierarchical context. For tables and code: chunk by logical unit (one table, one function), never mid-structure. The additional complexity in your ingestion pipeline pays off dramatically in retrieval quality.

Embedding model choice matters less than you think

Teams spend significant time evaluating embedding models — ada-002 vs. BGE vs. Cohere vs. Jina. The performance differences between modern embedding models on typical enterprise retrieval tasks are smaller than the performance difference between good and bad chunk strategy. Get chunking right first. Then optimise your embedding model if retrieval quality still falls short.

The reranking layer is not optional

Vector similarity retrieval is a blunt instrument. It finds semantically similar chunks, but semantic similarity and relevance-to-this-specific-query are not the same thing. A cross-encoder reranker — which takes the query and each candidate chunk and scores them jointly — dramatically improves precision. We use Cohere Rerank or a fine-tuned cross-encoder as standard on every production RAG system.

Add a reranker before you add a more expensive embedding model. In our benchmarks, switching from no reranker to a cross-encoder reranker improved answer quality more than switching from ada-002 to a state-of-the-art embedding model.

Context assembly: what goes in the prompt

You've retrieved the right chunks. Now: what do you actually put in the prompt, in what order, and with what framing? This is context engineering. A few principles we've landed on:

  • Position matters: LLMs attend better to context at the start and end of the context window. Put the most critical evidence first.
  • Include metadata: chunk source, document date, section title. This helps the model reason about evidence provenance.
  • Filter before you fill: it's better to pass 3 high-quality chunks than 10 mediocre ones. Don't use the full context window by default.
  • Add explicit structure: label each chunk with [Source 1], [Source 2] etc. so the model can cite and distinguish between them.
  • State what you don't know: if retrieval returns nothing relevant, tell the model explicitly rather than sending it empty context.

Evaluating RAG quality

Build an evaluation set before you start building the pipeline. Sample 50-100 real questions from your target user group, pair them with ground-truth answers, and measure your pipeline against them at every stage. Evaluate retrieval quality (did the right chunks come back?) and generation quality (did the model use the chunks correctly?) separately — they have different failure modes and different fixes.

The teams that build great RAG systems are the ones that treat evaluation as the primary engineering task, not an afterthought. The retrieval and generation are the implementation. The eval suite is the product.

More in AI
AI neural network visualization
8 min read · Mar 12, 2025

Why most AI agents fail in production (and how to fix them)

AI system monitoring
5 min read · Nov 2, 2024

Evaluating AI agents: beyond accuracy metrics