RAG Architecture: Building Retrieval-Augmented Generation Systems

A comprehensive guide to designing and implementing Retrieval-Augmented Generation architectures for AI agents, covering indexing pipelines, retrieval strategies, and generation workflows in Python.

technical9 min readBy Klivvr Engineering
Share:

Large language models are remarkably fluent, but they hallucinate when asked about data they have never seen. Retrieval-Augmented Generation (RAG) addresses this problem by grounding every answer in real documents retrieved at query time. Instead of relying solely on weights frozen during training, a RAG system fetches relevant context from an external knowledge base and injects it into the prompt before the model generates a response. The result is answers that are both articulate and factually anchored.

When we set out to build Data Whispal Agent, RAG was the natural architecture choice. Our users ask questions in plain English about datasets that change daily. No amount of fine-tuning could keep a static model current. RAG lets us decouple the knowledge from the model, treating the LLM as a reasoning engine and the vector store as a living memory. This article walks through the architecture we built, the decisions we made at each layer, and the Python code that ties it all together.

The Three Pillars of RAG

Every RAG system rests on three pillars: indexing, retrieval, and generation. Indexing is the offline process of transforming raw documents into searchable vector representations. Retrieval is the online process of finding the most relevant chunks for a given query. Generation is the final step where the LLM synthesizes an answer from the retrieved context.

Understanding these pillars as separate concerns is critical because each can be optimized independently. A brilliant generation prompt cannot compensate for a retrieval step that returns irrelevant documents, and the best retrieval engine in the world is useless if the indexing pipeline produced noisy, overlapping chunks.

Here is a simplified view of the pipeline:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
 
# Pillar 1: Indexing
def build_index(documents: list[str], collection_name: str) -> Chroma:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.create_documents(documents)
 
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=collection_name,
    )
    return vectorstore
 
# Pillar 2: Retrieval
def retrieve(vectorstore: Chroma, query: str, k: int = 5):
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": k, "fetch_k": 20, "lambda_mult": 0.7},
    )
    return retriever.invoke(query)
 
# Pillar 3: Generation (covered in detail below)

The RecursiveCharacterTextSplitter tries each separator in order, falling back to shorter boundaries only when a chunk would otherwise exceed the size limit. The 200-token overlap ensures that sentences straddling chunk boundaries are not lost. These seemingly small parameters have an outsized impact on retrieval quality, and we tuned them extensively during development.

Indexing Pipeline Design

The indexing pipeline is where most RAG projects succeed or fail. Raw data arrives in many formats: CSV exports, SQL query results, PDF reports, Markdown documentation, and JSON API responses. Each format requires a different loader, and the choice of chunking strategy depends on the structure of the content.

For Data Whispal Agent, we built a modular loader registry that maps MIME types to loader classes:

from langchain_community.document_loaders import (
    CSVLoader,
    UnstructuredMarkdownLoader,
    JSONLoader,
)
from pathlib import Path
 
LOADER_REGISTRY = {
    ".csv": lambda p: CSVLoader(p, encoding="utf-8"),
    ".md": lambda p: UnstructuredMarkdownLoader(str(p)),
    ".json": lambda p: JSONLoader(
        file_path=str(p),
        jq_schema=".[]",
        text_content=False,
    ),
}
 
def load_document(path: Path):
    suffix = path.suffix.lower()
    loader_factory = LOADER_REGISTRY.get(suffix)
    if loader_factory is None:
        raise ValueError(f"Unsupported file type: {suffix}")
    loader = loader_factory(path)
    return loader.load()

Beyond loading, metadata enrichment is essential. Every chunk should carry metadata that describes its origin: the source file, the table name, the date range it covers, and any schema information. This metadata powers filtered retrieval later, where the agent can narrow its search to chunks from a specific dataset or time window.

def enrich_metadata(documents, source_name: str, domain: str):
    for doc in documents:
        doc.metadata.update({
            "source": source_name,
            "domain": domain,
            "indexed_at": datetime.utcnow().isoformat(),
            "chunk_hash": hashlib.sha256(
                doc.page_content.encode()
            ).hexdigest()[:16],
        })
    return documents

The chunk_hash field is particularly useful for deduplication. When a dataset is re-indexed after an update, we compare hashes to avoid inserting duplicate chunks into the vector store. This keeps the index lean and prevents retrieval from returning near-identical passages that waste context window tokens.

Retrieval Strategies That Actually Work

Naive similarity search retrieves the top-k chunks closest to the query embedding. This works surprisingly well for simple factual questions but breaks down for complex analytical queries. When a user asks "What was the month-over-month revenue trend for Q3?", the most similar chunks might all describe the same month, missing the comparative data needed to identify a trend.

We use three strategies to combat this problem.

Maximal Marginal Relevance (MMR) balances relevance with diversity. It selects the first result by pure similarity, then penalizes subsequent candidates that are too similar to already-selected results. The lambda_mult parameter controls the trade-off: values near 1.0 favor relevance, values near 0.0 favor diversity.

Metadata filtering narrows the search space before vector similarity is computed. If the agent detects that the user is asking about a specific dataset, it passes a filter to the retriever:

def retrieve_with_filter(
    vectorstore: Chroma,
    query: str,
    domain: str | None = None,
    k: int = 5,
):
    search_kwargs = {"k": k, "fetch_k": 20, "lambda_mult": 0.7}
 
    if domain:
        search_kwargs["filter"] = {"domain": {"$eq": domain}}
 
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs=search_kwargs,
    )
    return retriever.invoke(query)

Hybrid search combines dense vector retrieval with sparse keyword matching. Dense embeddings capture semantic meaning but can miss exact terms like product codes or column names. Sparse methods like BM25 excel at exact matches. We run both in parallel and merge the results using reciprocal rank fusion:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
 
def build_hybrid_retriever(documents, vectorstore, k: int = 5):
    bm25 = BM25Retriever.from_documents(documents, k=k)
    dense = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": k},
    )
 
    ensemble = EnsembleRetriever(
        retrievers=[bm25, dense],
        weights=[0.4, 0.6],
    )
    return ensemble

The 0.4/0.6 weighting was determined empirically by evaluating retrieval precision on a test set of 200 question-answer pairs. Keyword-heavy queries about specific column names benefited from higher BM25 weight, while semantic queries about business concepts benefited from higher dense weight. The 0.4/0.6 split was the best single compromise.

The Generation Layer

Once relevant chunks are retrieved, they are formatted and injected into the prompt. The generation layer is where the LLM transforms raw context into a coherent, user-facing answer. The prompt template is the most important artifact in this layer.

from langchain_core.prompts import ChatPromptTemplate
 
SYSTEM_TEMPLATE = """You are Data Whispal Agent, an analytics assistant.
Answer the user's question based ONLY on the provided context.
If the context does not contain enough information, say so explicitly.
Do not fabricate data points, statistics, or trends.
 
When presenting numerical data:
- Include units and time periods
- Cite the source dataset from the context metadata
- Flag any data quality issues you notice
 
Context:
{context}
"""
 
prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_TEMPLATE),
    ("human", "{question}"),
])

Several design choices in this prompt are deliberate. The instruction to answer "ONLY on the provided context" reduces hallucination. The requirement to "cite the source dataset" adds traceability. The instruction to "flag data quality issues" turns the agent into a critical reviewer rather than a naive parrot.

We also implement a context compression step before injection. Retrieved chunks often contain boilerplate or irrelevant sections. A smaller model can extract only the relevant sentences, reducing token usage and improving answer quality:

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI
 
def build_compressed_retriever(base_retriever):
    compressor_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    compressor = LLMChainExtractor.from_llm(compressor_llm)
 
    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever,
    )

This adds latency but significantly improves answer precision, especially when chunks are large or when the original documents contain dense tabular data where only a few rows are relevant.

Putting It All Together: The RAG Chain

With indexing, retrieval, and generation defined, the final step is composing them into a runnable chain. LangChain Expression Language (LCEL) makes this composition declarative and streamable:

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
 
def format_docs(docs):
    formatted = []
    for doc in docs:
        source = doc.metadata.get("source", "unknown")
        formatted.append(
            f"[Source: {source}]\n{doc.page_content}"
        )
    return "\n\n---\n\n".join(formatted)
 
def build_rag_chain(vectorstore, documents):
    hybrid_retriever = build_hybrid_retriever(
        documents, vectorstore
    )
    compressed_retriever = build_compressed_retriever(
        hybrid_retriever
    )
 
    llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
 
    chain = (
        {
            "context": compressed_retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )
    return chain
 
# Usage
chain = build_rag_chain(vectorstore, documents)
answer = chain.invoke("What was the average order value last quarter?")

The chain reads from left to right. The query string flows into both the retriever branch (which fetches, compresses, and formats context) and the passthrough branch (which preserves the original question). Both outputs feed into the prompt template, which is sent to the LLM. The output parser extracts the string content from the model's response.

For production use, we add error handling, token counting, and latency tracking at each stage. We also persist the retrieved chunks alongside every answer so that users can inspect the evidence and our evaluation pipeline can measure retrieval quality retroactively.

Conclusion

RAG architecture transforms a general-purpose language model into a domain-specific analytics agent. The separation of indexing, retrieval, and generation into independent layers allows each to be tested, monitored, and improved without disrupting the others. The techniques discussed here, including recursive chunking with metadata enrichment, hybrid retrieval with MMR diversity, context compression, and grounded generation prompts, form the backbone of Data Whispal Agent.

The most important lesson from building this system is that retrieval quality is the bottleneck. A mediocre LLM with excellent retrieval will outperform a state-of-the-art LLM with poor retrieval every time. Invest your engineering effort accordingly: build comprehensive evaluation sets for retrieval, instrument your pipeline to measure chunk relevance, and iterate on your chunking and embedding strategies before tuning the generation prompt. The architecture is only as strong as the context it provides.

Related Articles

business

Scaling AI Agents: From Prototype to Production

A practical guide to scaling AI agent systems from initial prototype to production deployment, covering infrastructure architecture, cost management, reliability engineering, and team organization.

13 min read
business

Data Privacy and Security in AI Agent Systems

A practical guide to building privacy-preserving AI agent systems, covering data classification, access controls, PII handling, audit logging, and compliance requirements.

11 min read