Large language models are remarkably capable, but they have a fundamental limitation: they only know what was in their training data. Ask a general-purpose LLM about your company's internal documentation, your product specifications, or last quarter's financial results and it will either hallucinate an answer or admit it does not know. Retrieval Augmented Generation solves this problem by giving the LLM access to your data at inference time, without the cost and complexity of fine-tuning.
RAG has become the default architecture for enterprise AI applications that need to reason over proprietary knowledge. It is simpler to implement than fine-tuning, easier to keep current as your data changes, and provides built-in source attribution that makes outputs auditable. But building a RAG pipeline that works reliably in production requires understanding each component in the pipeline and making deliberate choices about how they fit together.
This guide covers the full RAG architecture, from document ingestion through generation, with practical code examples and guidance on the decisions that matter most.
What RAG Is and Why It Matters
Retrieval Augmented Generation is a two-phase process. First, given a user query, the system retrieves relevant documents or passages from a knowledge base. Second, those retrieved passages are injected into the LLM's prompt as context, and the model generates a response grounded in that context.
The architecture addresses three critical problems with standalone LLMs:
Knowledge currency. LLMs are frozen at their training cutoff. RAG lets you feed in documents that were created yesterday. When your product documentation changes, you update the knowledge base -- not the model.
Factual grounding. By providing source material in the prompt, RAG dramatically reduces hallucination. The model generates responses based on actual documents rather than parametric memory, and you can cite the specific passages that informed each answer.
Domain specificity. General-purpose models lack depth in specialized domains. RAG lets you inject domain-specific knowledge -- medical literature, legal precedents, engineering specifications -- without training a custom model.
The key insight behind RAG is that retrieval and generation are complementary. Retrieval systems are excellent at finding relevant information but poor at synthesizing it into coherent answers. LLMs are excellent at synthesis but unreliable when generating facts from memory. Combining them produces a system that is both accurate and articulate.
RAG Architecture: The Complete Pipeline
A production RAG system consists of two pipelines: an offline ingestion pipeline that prepares your documents, and an online query pipeline that handles user requests.
The Ingestion Pipeline
The ingestion pipeline transforms raw documents into a searchable format:
-
Document loading -- Collect documents from their sources: PDFs, web pages, databases, Confluence wikis, Notion pages, Slack channels, or any other data store. Each source typically requires a dedicated loader.
-
Chunking -- Split documents into smaller passages. This is one of the most consequential decisions in the pipeline. Chunks that are too large dilute relevant information with noise. Chunks that are too small lose context and coherence.
-
Embedding -- Convert each chunk into a dense vector representation using an embedding model. These vectors capture semantic meaning, so passages about similar topics end up close together in vector space regardless of the specific words used.
-
Indexing -- Store the vectors and their associated text in a vector database, along with metadata like source document, page number, and creation date.
The Query Pipeline
When a user asks a question, the query pipeline executes:
-
Query embedding -- The user's question is converted into a vector using the same embedding model used during ingestion.
-
Retrieval -- The vector database performs a similarity search, returning the chunks whose vectors are closest to the query vector. This typically uses approximate nearest neighbor algorithms for speed.
-
Reranking (optional) -- A cross-encoder model rescores the retrieved chunks for relevance. This is more computationally expensive than vector similarity but significantly more accurate.
-
Prompt construction -- The retrieved chunks are assembled into a prompt template along with the user's question and any system instructions.
-
Generation -- The LLM generates a response based on the provided context, ideally citing which chunks informed its answer.
Choosing a Vector Database
The vector database is the backbone of your RAG system. Your choice affects retrieval speed, scalability, operational complexity, and cost. Here are the leading options and when each makes sense.
Pinecone is a fully managed vector database designed specifically for production AI applications. It handles infrastructure, scaling, and high availability out of the box. Pinecone is the right choice when you want minimal operational overhead and your team does not want to manage database infrastructure. It supports metadata filtering, namespaces for multi-tenancy, and serverless pricing that scales to zero.
Weaviate is an open-source vector database with a rich feature set including hybrid search (combining vector and keyword search), built-in vectorization modules, and GraphQL-based querying. Weaviate can run self-hosted or as a managed cloud service. It is a strong choice when you need hybrid search capabilities or want the flexibility of self-hosting.
Chroma is a lightweight, open-source embedding database optimized for developer experience. It runs in-process with a simple Python API, making it ideal for prototyping, local development, and applications with modest scale requirements. Chroma is the fastest path from zero to a working RAG prototype.
pgvector is a PostgreSQL extension that adds vector similarity search to your existing Postgres database. If your application already uses PostgreSQL, pgvector lets you store vectors alongside your relational data without introducing a new database into your stack. This reduces architectural complexity and is sufficient for many applications, especially those with fewer than a few million vectors.
For most teams starting a new project, the decision comes down to two questions: Do you already use PostgreSQL (then start with pgvector), and do you need managed infrastructure at scale (then use Pinecone or Weaviate Cloud)?
Chunking Strategies That Actually Work
Chunking is deceptively important. Poor chunking is the single most common cause of bad retrieval quality. Here are the strategies that work in practice.
Fixed-size chunking splits text into chunks of a fixed token count with overlap between consecutive chunks. A chunk size of 512 tokens with 50 tokens of overlap is a reasonable starting point. The overlap ensures that information at chunk boundaries is not lost.
Recursive character splitting is the default strategy in LangChain and works well for most text. It recursively splits on paragraph breaks, then sentences, then words, trying to keep semantically related text together while respecting the maximum chunk size.
Semantic chunking uses an embedding model to identify natural breakpoints in the text. It computes embeddings for each sentence and splits where the semantic similarity between consecutive sentences drops significantly. This produces chunks that align with topic boundaries, improving retrieval relevance.
Document-structure-aware chunking respects the structure of your documents. For Markdown files, it splits on headings. For HTML, it splits on structural elements. For code, it splits on function or class boundaries. This approach preserves the author's intended organization.
The right strategy depends on your content. Technical documentation with clear headings benefits from structure-aware chunking. Long-form prose works well with recursive splitting. Research papers may need semantic chunking to handle dense, multi-topic sections.
Always test chunking strategies empirically with your actual data and queries. The difference between a mediocre chunking strategy and a good one can be a 20-30% improvement in retrieval relevance.
Building a RAG Pipeline with LangChain
Here is a complete, working RAG pipeline using LangChain, OpenAI embeddings, and Chroma as the vector store. This example covers both the ingestion and query phases.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# --- Ingestion Pipeline ---
# 1. Load documents from a directory
loader = DirectoryLoader(
"./knowledge_base",
glob="**/*.md",
loader_cls=TextLoader,
)
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
# 3. Create embeddings and store in Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="knowledge_base",
)
print(f"Ingested {len(chunks)} chunks from {len(documents)} documents")
# --- Query Pipeline ---
# 4. Create a retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5},
)
# 5. Define the prompt template
prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context. If the context
does not contain enough information to answer the question, say so clearly.
Cite the relevant source documents in your answer.
Context:
{context}
Question: {question}
Answer:
""")
# 6. Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
def format_docs(docs):
return "\n\n---\n\n".join(
f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
for doc in docs
)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# 7. Query the pipeline
response = rag_chain.invoke("How do I configure authentication for the API?")
print(response)
This pipeline handles the core workflow, but production systems typically need additional components: query preprocessing to rephrase ambiguous questions, a reranking step to improve retrieval precision, conversation memory to handle follow-up questions, and guardrails to prevent prompt injection.
Evaluation and Quality Metrics
You cannot improve what you do not measure. RAG evaluation requires assessing both the retrieval component and the generation component independently.
Retrieval metrics measure whether the system finds the right documents:
- Recall@k -- Of all relevant documents in the knowledge base, what fraction appears in the top k results? High recall means the system rarely misses relevant information.
- Precision@k -- Of the top k retrieved documents, what fraction is actually relevant? High precision means the system does not flood the prompt with irrelevant context.
- Mean Reciprocal Rank (MRR) -- How high does the first relevant document rank? MRR rewards systems that put the best result at the top.
Generation metrics measure the quality of the final answer:
- Faithfulness -- Does the answer accurately reflect the retrieved context, or does it introduce information not present in the sources? This is the most critical metric for enterprise applications.
- Answer relevance -- Does the answer actually address the user's question?
- Completeness -- Does the answer cover all the relevant information available in the retrieved context?
Frameworks like RAGAS and DeepEval automate these evaluations using LLM-as-judge techniques. Build an evaluation dataset of 50-100 representative questions with known answers, and run evaluations after every significant pipeline change.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Evaluate your RAG pipeline
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
Production Considerations
Moving a RAG pipeline from prototype to production introduces challenges that do not appear during development.
Latency management becomes critical at scale. Embedding the query, searching the vector database, and running LLM inference each add latency. Target total response times under two seconds for interactive applications. Strategies include caching frequent queries, using smaller embedding models, pre-computing embeddings for common question patterns, and streaming the LLM response.
Document freshness requires an ingestion pipeline that runs on a schedule or reacts to changes. If your knowledge base is a set of Markdown files in a repository, trigger re-ingestion on every commit. If it is a database, use change data capture to process updates incrementally rather than reprocessing the entire corpus.
Security and access control matter when your knowledge base contains sensitive information. Not every user should be able to retrieve every document. Implement document-level access controls in your vector database using metadata filtering, and verify permissions at query time.
Cost optimization involves choosing the right models for each component. You do not need the most expensive embedding model or the largest LLM for every use case. Start with smaller, cheaper models and upgrade only where evaluation metrics show a meaningful quality gap. OpenAI's text-embedding-3-small is significantly cheaper than text-embedding-3-large and performs within a few percentage points on most benchmarks.
Observability means logging every step of the pipeline -- the original query, the retrieved chunks, the constructed prompt, the generated response, and any user feedback. This data is essential for debugging quality issues and identifying opportunities for improvement.
Get Started with RAG in Your Organization
RAG is the most practical path to deploying AI that understands your specific business context. It does not require training custom models, it stays current as your data changes, and it provides the source attribution that enterprise stakeholders demand.
The key is starting with a well-defined use case and a clean, well-structured knowledge base. Internal documentation search, customer support automation, and technical Q&A systems are proven starting points that deliver measurable value quickly.
If you are planning a RAG implementation and want to accelerate the path to production, the team at Maranatha Technologies builds retrieval augmented generation systems that integrate with your existing data infrastructure and scale with your needs. Visit our AI Solutions offerings or get in touch to discuss how RAG can work for your specific use case.