Deep Dive · Section 04 · Updated April 2026

Retrieval-Augmented Generation (RAG)

How to build knowledge-grounded LLM applications: chunking strategies, embedding models, vector search, HyDE, hybrid retrieval, and the architecture decisions that determine pipeline reliability.

Video Overview · 56s
📑
Service
AI Content Systems
We design and deploy RAG pipelines for knowledge bases, documentation, and enterprise data.
01

What Is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is an architecture that fetches relevant documents from an external knowledge store at query time and provides them to an LLM as context before it generates a response. Instead of relying solely on what the model learned during training, RAG gives the model access to up-to-date, specific, and verifiable information. The result is responses grounded in real documents rather than statistical patterns from training data.

Why RAG outperforms prompting for knowledge tasks

A language model's parametric knowledge — what it learned from training data — is static, unverifiable, and has a cutoff date. For any task that depends on private company data, recent events, or large document corpora, prompting alone fails because the model doesn't have the information. RAG solves this by converting the knowledge problem into a retrieval problem: index your documents, retrieve the relevant ones at query time, and give them to the model. The model's job becomes synthesis and reasoning over provided context rather than recall from weights.

Three properties make RAG preferable to fine-tuning for knowledge tasks: retrieved content is auditable (you can inspect exactly which documents informed a response), knowledge is updatable without retraining (add new documents to the index instantly), and retrieval failures are detectable (when no relevant document exists, you know it, rather than getting a confident hallucination).

RAG Pipeline Overview
flowchart TD subgraph Ingestion A["Raw Documents"] --> B["Chunking\nstrategy"] B --> C["Embedding Model"] C --> D[("Vector Store")] end subgraph "Query Time" E["User Query"] --> F["Query Embedding"] F --> G["Vector Search\ntop-k chunks"] D --> G G --> H["Reranking\noptional"] H --> I["LLM\nwith context"] E --> I I --> J["Grounded Response"] end style D fill:#0d2520,stroke:#0d9488,color:#e6edf3 style J fill:#0d2520,stroke:#3fb950,color:#3fb950
Building a knowledge-grounded AI application? We design RAG pipelines with the retrieval quality and architecture your use case requires.Explore Content Systems →
02

Chunking Strategies

Chunking is splitting source documents into pieces small enough to embed and retrieve meaningfully. The chunking strategy is one of the highest-leverage decisions in a RAG pipeline — poor chunking breaks retrieval even with a perfect embedding model and vector search configuration. Chunks that are too large dilute relevance signals. Chunks that are too small lose the context needed for coherent answers.

Common chunking approaches

Fixed-Size
Split by token count (512–1024 tokens) with overlap (10–20%). Simple and fast. Breaks mid-sentence or mid-concept, which hurts embedding quality on dense technical text.
Semantic
Split at natural boundaries: paragraphs, headings, sentences. Preserves conceptual integrity. More complex to implement but significantly better retrieval on structured documents.
Hierarchical
Parent-child structure: store large parent chunks for context, retrieve small child chunks for precision. Retrieve the child, return the parent to the LLM. Best of both granularities.
Document-Aware
Use document structure (HTML headings, Markdown sections, PDF layout) to determine boundaries. Highest quality for structured sources; requires format-specific parsers.
Four chunking strategies: fixed-size, semantic boundary, hierarchical parent-child, and document-structure-aware splitting
Practical starting point: Recursive character splitting at 512 tokens with 10% overlap is the reliable default for most use cases. Switch to semantic chunking when you notice retrieval returning fragments that make no sense out of context.
Struggling with retrieval quality? Chunking is usually the first thing to fix. We audit and redesign RAG pipelines that aren't performing as expected.Talk to Us →
03

Embeddings and Vector Search

An embedding model converts text into a high-dimensional vector that encodes semantic meaning. Similar texts produce similar vectors. Vector search finds the k chunks whose vectors are closest to the query vector — typically measured by cosine similarity. The quality of your embedding model determines how well the system understands what "similar" means for your content domain.

Choosing an embedding model

General-purpose models (OpenAI text-embedding-3-large, Cohere Embed, Google Gemini embedding) work well for most English-language use cases. Domain-specific models outperform general ones on specialized vocabulary: legal, medical, code, or multilingual content. For on-premise or cost-sensitive deployments, bge-large-en-v1.5 and similar open-source models offer strong performance. Always evaluate on your actual data rather than benchmark leaderboards — distribution shifts between benchmarks and production corpora are common.

Vector store selection

For production, consider: pgvector (PostgreSQL extension, low operational overhead, good for existing Postgres deployments), Pinecone (fully managed, easy to start), Weaviate (strong metadata filtering), Qdrant (high-performance, open-source). For prototypes, Chroma or FAISS run locally with no infrastructure. The vector store matters less than chunking and embedding quality — optimize those first.

Vector embedding process: text converted to high-dimensional vectors, query vector finds nearest chunk vectors in vector space
Setting up embeddings for a knowledge base or document store? We select and configure the right embedding and vector infrastructure for your data volume and latency requirements.Explore Content Systems →
04

HyDE and Advanced Retrieval

Standard RAG embeds the user's query and searches for similar chunks. This fails when the query is short and semantically different from how the answer is expressed in documents. Advanced retrieval techniques address this mismatch between query style and document style.

Hypothetical Document Embeddings (HyDE)

HyDE generates a hypothetical answer to the user's question using the LLM, then embeds that hypothetical answer instead of the raw query. The hypothetical answer is written in the same style and vocabulary as real document content, so its embedding lands closer to relevant chunks in the vector space. This significantly improves recall for short or conversational queries against technical or formal document corpora. The cost is one additional LLM call per query before retrieval.

Multi-query retrieval

Generate 3–5 query variations from the original question, retrieve candidates for each, and deduplicate. This improves recall for ambiguous queries where a single formulation might miss relevant chunks that are reachable through slightly different phrasing.

Reranking

After vector retrieval returns top-k candidates (typically 20–50), a reranker model scores each chunk against the original query and returns the top-n most relevant (typically 3–8). Cross-encoder rerankers (Cohere Rerank, BGE Reranker) are slower than vector search but significantly more accurate. Always include a reranker in production pipelines with more than a few thousand chunks.

HyDE implementation note: Generate the hypothetical document with a low-temperature call using a concise system prompt ("Write a short factual answer to this question as it might appear in technical documentation"). Too much creativity in the hypothetical document hurts retrieval precision.
Getting poor recall from your RAG pipeline? Advanced retrieval techniques often double recall without changing your document index. We tune retrieval quality with measurable evals.Start a Conversation →
05

Hybrid Search

Hybrid search combines dense vector retrieval (semantic similarity) with sparse keyword retrieval (BM25 or full-text search). The two approaches have complementary failure modes: vector search excels at semantic similarity but misses exact keyword matches; keyword search excels at precise terms but fails on paraphrase and conceptual queries. Combining them with Reciprocal Rank Fusion (RRF) consistently outperforms either alone.

Hybrid search architecture: vector search and BM25 keyword search running in parallel, merged via reciprocal rank fusion into final ranked results

When to use hybrid search

Use hybrid search when your queries include both conceptual questions ("how does the refund process work") and exact-term lookups ("what is the SKU-2847 return policy"). Pure vector search handles the first well but may miss the second. Hybrid search consistently outperforms pure vector on corpora with mixed query types and is the standard approach in production enterprise deployments.

Implementation note: Not all vector stores support hybrid search natively. Elasticsearch, OpenSearch, Weaviate, and pgvector with pg_trgm are common choices that support both modes. Pinecone requires a separate sparse index. Factor this into vector store selection early.
Choosing a vector store architecture for hybrid search? We evaluate options against your data volume and query mix.Explore Content Systems →
06

Production RAG Architecture

A production RAG system requires more than a vector store and an embedding call. Ingestion pipelines need to handle document updates without full reindexing. Retrieval needs fallbacks for low-confidence results. Generation needs citation tracking. Observability needs to trace which documents influenced each response so that failures can be debugged.

ComponentWhat to buildWhy it matters
Ingestion pipelineIncremental updates, document versioningFull reindexing is slow; updates should be near-real-time
Query preprocessingHyDE, multi-query, query rewritingRaw queries often miss relevant chunks
RetrievalHybrid search + rerankerBest recall and precision at production scale
Context assemblyDeduplicate, rank, trim to fit context windowToo many chunks dilutes the LLM's focus
Citation trackingPass chunk metadata, instruct model to citeEnables source verification and trust
Fallback handlingNo-result responses, low-confidence flagsPrevents confident hallucination when retrieval fails
EvaluationRetrieval recall, answer faithfulness, groundednessCan't improve what you can't measure
The most common production failure: Retrieval works in development (small corpus, clean queries) but degrades in production (large corpus, diverse query styles). Build an eval suite for retrieval quality separately from generation quality, and measure both on real production queries.
Deploying RAG at scale? We build and maintain production retrieval pipelines with eval infrastructure, fallback handling, and citation tracking.Explore Content Systems →