How to Build a RAG Pipeline: Architecture, Tools & Production Lessons
A practical guide to building Retrieval Augmented Generation systems that actually work in production — covering document ingestion, chunking strategies, embedding models, vector databases, retrieval optimisation, and evaluation.
How to Build a RAG Pipeline: Architecture, Tools & Production Lessons
Retrieval Augmented Generation (RAG) has become the default architecture for connecting LLMs to private knowledge bases. The concept is simple: retrieve relevant documents from your corpus, inject them into the LLM prompt, and generate an answer grounded in your data rather than hallucinated from training weights.
In practice, getting RAG from a working demo to a production system that handles real enterprise documents reliably is significantly harder than the concept suggests. This guide covers the full architecture — from document ingestion to production evaluation — based on what we have learned building RAG systems for manufacturing, education, and enterprise clients.
The Core RAG Architecture
A production RAG pipeline has six stages:
- Document ingestion — loading, parsing, and cleaning source documents
- Chunking — splitting documents into retrievable units
- Embedding — converting chunks to vector representations
- Vector storage — indexing embeddings for efficient similarity search
- Retrieval — finding the most relevant chunks for a given query
- Generation — passing retrieved context to the LLM for answer synthesis
Each stage has decisions that significantly impact final accuracy. Getting any one of them wrong can cause the entire system to underperform.
Stage 1: Document Ingestion
The quality of your ingestion pipeline determines the quality ceiling of your RAG system. Most teams underinvest here.
Common source types and their challenges:
- PDFs — layout varies widely. Native PDFs can be parsed with PyMuPDF or pdfplumber. Scanned PDFs require OCR (Tesseract, PaddleOCR). Tables are particularly difficult — most parsers either miss them or flatten them unreadably.
- Word / DOCX — generally clean to parse with python-docx. Watch for inline images and embedded objects that carry semantic content.
- Web pages — HTML parsing with BeautifulSoup or Playwright for dynamic content. Strip navigation, footers, and ads before chunking.
- Databases — structured data is better served via SQL lookup than semantic retrieval in most cases. Use RAG for prose documents, not database records.
Production lesson: Build a preprocessing step that inspects each document and routes it to the appropriate parser. A generic "parse everything the same way" approach produces poor quality chunks from the start.
Stage 2: Chunking Strategy
Chunking is the single decision with the most impact on retrieval quality. The wrong strategy creates chunks that split relevant context across boundaries, retrieve fragments without sufficient information, or return chunks too large to fit meaningfully in a prompt.
Fixed-size chunking
Splitting at every N tokens regardless of content boundaries. Fast to implement, but produces poor semantic coherence. Chunks frequently cut sentences mid-thought or separate a heading from its content.
When to use: Exploratory prototyping only. Not recommended for production.
Sentence-level chunking
Splitting at sentence boundaries with overlap. Better semantic coherence than fixed-size. Works well for prose documents.
When to use: General-purpose text like FAQ pages, policy documents, and articles.
Semantic chunking
Using an embedding model to detect semantic boundaries — splitting when content topic shifts. Produces coherent chunks but is slower and more expensive at ingestion time.
When to use: Long, complex documents where topic boundaries are meaningful.
Document-aware chunking
Using the document structure (headings, sections, paragraphs) to define chunk boundaries. Preserves semantic units naturally. Requires parsed document structure.
When to use: Technical documentation, legal documents, contracts, and any content with clear structural hierarchy.
Chunking parameters that matter
- Chunk size: 256–512 tokens for precise retrieval; 512–1024 tokens for richer context. Test both.
- Overlap: 10–20% overlap reduces the risk of splitting relevant context across chunk boundaries.
- Minimum chunk size: Discard chunks below ~50 tokens — they rarely contain enough context to be useful.
Production lesson: The optimal chunking strategy is document-specific. We routinely test 3–4 chunking configurations and measure retrieval precision on a representative query set before committing to one approach.
Stage 3: Embedding Model Selection
The embedding model converts text into vectors that capture semantic meaning. Similar text should produce similar vectors. The choice of embedding model affects both retrieval quality and latency.
Key factors in embedding model selection
Domain alignment: A model pre-trained on scientific literature will represent medical terminology better than one trained on web text. For specialised domains, evaluate domain-specific models.
Dimensionality: Higher-dimensional embeddings (1536 for OpenAI text-embedding-3-large, 768 for most open-source models) capture more nuance but require more storage and compute.
Language support: For non-English content, use multilingual models (multilingual-e5-large, Cohere multilingual) or language-specific models.
Latency and cost: OpenAI embeddings are accurate and convenient but have per-token API costs. Local Sentence Transformers models (all-MiniLM-L6-v2, BAAI/bge-large) are free to run but require GPU infrastructure at scale.
Common embedding models
| Model | Dimensions | Best for |
|---|---|---|
| text-embedding-3-large | 3072 | General English, highest accuracy |
| text-embedding-3-small | 1536 | Cost-sensitive applications |
| BAAI/bge-large-en | 1024 | On-premise, English |
| multilingual-e5-large | 1024 | Multi-language |
| Cohere Embed v3 | 1024 | Enterprise, reranking integration |
Production lesson: Always evaluate embedding models against your actual documents and queries — not MTEB benchmarks. A model that ranks #1 on general benchmarks may underperform on your domain.
Stage 4: Vector Database Setup
The vector database stores your embeddings and serves similarity search queries. Choosing the right one depends on scale, latency requirements, and whether you need metadata filtering.
Managed vs self-hosted
- Pinecone: Fully managed, simple API, high reliability. Best for teams that do not want to operate infrastructure.
- Weaviate: Self-hosted or cloud, strong schema support and metadata filtering, GraphQL API.
- ChromaDB: Lightweight, great for prototyping and small-to-medium datasets. Not recommended for production at scale.
- pgvector: Adds vector search to PostgreSQL. Excellent if you already operate PostgreSQL — avoids an additional service.
- Qdrant: Fast, Rust-based, good for on-premise and high-throughput requirements.
Index configuration
Most vector databases use HNSW (Hierarchical Navigable Small World) as the default index. Key parameters:
- ef_construction: Higher values improve index quality at the cost of build time
- m: Number of connections per node — affects recall and memory
- ef_search: Search quality vs latency tradeoff at query time
Accept defaults for prototype; tune these parameters once you have production traffic and can measure recall vs latency.
Stage 5: Retrieval Optimisation
Basic cosine similarity search is a starting point, not an endpoint. Production RAG systems need retrieval optimisation to move accuracy from 60–70% to 85%+.
Hybrid search
Combining dense retrieval (embedding similarity) with sparse retrieval (BM25 keyword matching) consistently improves recall compared to either approach alone. Sparse retrieval catches exact keyword matches that dense retrieval misses; dense retrieval captures semantic similarity that sparse misses.
Most production RAG systems we build use hybrid search by default.
Query rewriting
Before retrieving, rewrite the user query to improve retrieval:
- HyDE (Hypothetical Document Embeddings): Ask the LLM to generate a hypothetical answer to the query, then embed that answer for retrieval. Counterintuitively, this often retrieves better documents than embedding the raw question.
- Multi-query retrieval: Generate 3–5 variants of the query, retrieve for each, and union the results. Reduces sensitivity to exact query phrasing.
Reranking
After initial retrieval, apply a cross-encoder reranker (Cohere Rerank, BGE Reranker, or a locally deployed cross-encoder) to reorder results by relevance. Rerankers are more accurate than bi-encoder similarity because they evaluate query and document jointly — but they are too slow to run over the full index, so apply them to the top 20–50 results from initial retrieval.
Reranking typically improves precision by 10–20 percentage points on the reranked set.
Metadata filtering
Add document metadata (source, date, department, document type) as filterable fields in your vector database. At query time, apply filters before or during retrieval to restrict the search space. This dramatically reduces irrelevant retrievals when users are asking about a specific document set.
Stage 6: Evaluation
RAG systems require systematic evaluation — both at launch and continuously in production. Without measurement, you cannot know whether your chunking or retrieval changes actually improved things.
RAGAS metrics
RAGAS (Retrieval Augmented Generation Assessment) provides automated metrics using LLMs as evaluators:
- Faithfulness: Does the generated answer contain only information present in the retrieved context? Low scores indicate hallucination.
- Answer Relevancy: Is the answer actually relevant to the question asked?
- Context Precision: What fraction of retrieved chunks are relevant to the question? Low scores mean you are retrieving noise.
- Context Recall: What fraction of the information needed to answer is present in retrieved chunks? Low scores mean relevant information is not being retrieved.
Building evaluation datasets
Create a "golden dataset" of 50–200 representative questions with expected answers from your actual use cases. Evaluate your RAG system against this dataset before each release. Track metrics over time.
Production lesson: Allocate 20% of your RAG project timeline to building evaluation infrastructure and creating the golden dataset. Teams that skip this step deploy systems they cannot improve systematically.
Common Production Failures
The retrieval ceiling problem: Your LLM can only answer well if the retrieval finds the right documents. If retrieval precision is 60%, the LLM cannot compensate by reasoning better. Measure retrieval separately from generation.
Context window management: Long documents fill the context window and push out relevant content. Set strict character limits on retrieved context (4000–8000 tokens typically) and prioritise the highest-ranked chunks.
Latency underestimation: A round trip including embedding the query, vector search, optional reranking, and LLM generation can take 2–6 seconds. If your application requires sub-second responses, design accordingly — consider caching common queries or pre-computing frequent retrievals.
Metadata stale data: Your index becomes stale as source documents are updated. Build a document change detection and re-ingestion pipeline from day one. Track document versions in your metadata.
Summary
A production RAG pipeline requires careful design at every stage: document-appropriate parsing, content-aware chunking, domain-evaluated embedding models, appropriately sized vector databases, and multi-stage retrieval with hybrid search and reranking. Each component has a measurable impact on final accuracy.
The difference between a RAG demo and a production RAG system is evaluation. Without measuring faithfulness, precision, and recall systematically — before deployment and continuously afterward — you cannot improve what you cannot see.
We build production RAG systems for enterprise clients across manufacturing, education, and professional services. If you are evaluating whether RAG is the right architecture for your use case, get in touch with our team.
