Feb 28, 2025

Understanding the patterns of use for RAG chatbots

Fahad Ahmad and Mackenzie Knapp

Understanding the patterns of use for RAG chatbots

Chatbots have become a mainstay in modern business, offering quick responses to customer inquiries and internal user questions alike. However, basic chat applications—which simply call a Large Language Model (LLM) for every question, will often not meet the complex, domain-specific needs of some organizations. Retrieval-Augmented Generation (RAG) bridges this gap by combining a powerful LLM (like GPT-4, Claude, or other advanced models) with domain-specific data retrieval, ensuring accurate and cost-effective responses.

RAG combines the strengths of contextual language models with domain-specific data retrieval, ensuring that answers are coherent, accurate, and tailored to your company’s internal IP, giving you a competitive edge in the quality of responses you receive back from these RAG applications. Contrary to popular belief, implementing a RAG solution isn’t prohibitively expensive. In many cases, it can be more cost-effective than paying for enterprise-level LLM subscriptions (e.g., GPT-4, Anthropic Claude, Microsoft Azure OpenAI, etc.). This article explores the pattern-of-use for RAG chatbots, highlights key scenarios where RAG shines, and provides practical guidance for building one.

When to move from basic chat to RAG

You have complex, proprietary data
- Industries like finance, healthcare, and law often deal with specialized or confidential documents not covered by public LLM training sets.
- A basic chatbot won’t provide verifiable answers unless it can retrieve the actual content from your internal sources.
High accuracy & compliance needs
- When answers must be traceable to real documents for compliance or legal reasons, you need a system capable of referencing exact text rather than relying solely on an LLM’s inference.
Frequent document updates
- If your data changes regularly—such as policy documents, product specs, or legal regulations—you need a chatbot that pulls the latest information without retraining the model.
Cost pressures
- Large-scale usage of enterprise LLMs can be expensive, especially if you pay per token on every query. RAG can drastically reduce the total token usage and lower your bills by using targeted retrieval rather than passing extensive context to the LLM.

Why RAG can be more cost-effective

Lower token usage

Reduced LLM calls: By retrieving only the most relevant chunks from your data store, you can pass a significantly smaller context to the LLM.
Cheaper embedding APIs: The process of generating embeddings for vector search is generally much cheaper than making full calls to text-based generation models. This means that by relying on vector search for the heavy lifting, overall token usage—and therefore costs—are minimized.

Indexing & caching

Query indexing: You can cache or index frequently asked queries (e.g., “Where is the employee handbook?”). If the same question arises again, your system can reuse the cached answer or perform a cheaper vector lookup rather than reprocessing the entire query.

Avoiding enterprise UI markups

Cost savings on UI: Many large vendors sell enterprise chatbot interfaces at a premium. These solutions often
lack robust RAG capabilities, meaning you could be paying extra for a UI that doesn’t support advanced document retrieval.
API-based flexibility: By using APIs (for example, GPT-4, Claude, or other advanced models) along with a custom-built UI, you sidestep these expensive bundled offerings and maintain full control over your data pipeline.

Open-source document intelligence

Enterprise document parsing tools (azure document intelligence/google document AI) vs. Open Source: Tools like Azure Document Intelligence and Google’s Document AI are powerful managed services that perform OCR, layout analysis, table extraction, and more. However, its costs can add up quickly.
Cost-effective alternatives: By using open-source models for data parsing and document intelligence, you can achieve similar outcomes at a fraction of the price—up to 44x cheaper than proprietary solutions, in our experience. This approach not only saves money but also ensures that your system extracts as much information as possible from your complex internal documents.

Pattern of use: Key steps for a RAG chatbot

Below is a practical roadmap you can implement immediately to build a RAG chatbot that is both flexible and cost-effective.

1. Ingestion & document intelligence

Extract everything: Many companies have distinct, complex documents (e.g., PDFs with tables, embedded Excel sheets, and images).
- Managed service: Azure Document Intelligence offers comprehensive OCR and data extraction, but can be costly.
- Open-source alternative: Deploy open-source document intelligence models that can be
  44x cheaper, ensuring that every bit of information is captured and understood.
Agentic & semantic chunking: Chunking breaks large documents into smaller, meaningful sections that fit within the LLM’s context window.
- Semantic chunking: Splits content at natural boundaries (paragraphs, headings).
- Agentic chunking: Utilizes an LLM to dynamically determine chunk boundaries & ensures your chatbot receives relevant text. Most large companies offering blackbox RAG solutions still struggle to perfect chunking at scale, which is why building can still be an optimal approach for improved accuracy. If chunking is done incorrectly (e.g., cutting a sentence mid-way or merging unrelated sections), your chatbot might retrieve irrelevant text, leading to inaccurate answers.

2. Vector embedding & storage

Embeddings: Convert each document chunk into numerical vectors using an embedding model. This process is generally much cheaper than generating responses with an LLM.
Custom embedding models (optional): By training custom embedding models on domain-specific data, it is possible to create vector representations that more accurately reflect the semantic relationships and nuances of the target domain. Such models can learn to position related terms like "Projector" and "Time Tracking" in closer proximity within the vector space, thereby enhancing the retrieval process.
Vector database: Store these vectors in a scalable vector store.
- Proprietary options: Pinecone, Weaviate, Qdrant.
- Open-source savings: Use PGVector, a free PostgreSQL extension, to reduce costs further while ensuring efficient vector search.

3. Retriever + advanced LLMs

Retriever: When a user asks a question, the retriever queries the vector database to find the best-matching chunks, drastically reducing the context size needed for the next step.
LLM for response generation: Use an advanced model like GPT-4, Claude, or other state-of-the-art LLMs.
- Reasoning vs. non-reasoning models:
  - Reasoning models: Use these when the query requires multi-step reasoning, cross-referencing multiple document parts, or handling complex queries. Frameworks like crewAI can orchestrate such tasks as well as this gets into agentic frameworks.
  - Non-reasoning models: For simpler queries, a standard LLM without additional reasoning layers can suffice, reducing costs further.
Agentic frameworks: Integrate frameworks such as LangChain, Haystack, or crewAI to handle dynamic multi-step queries and orchestrate the retrieval process for advanced use cases.

4. Indexing & caching

Query indexing: Cache common queries so that if the same question is asked again, the system can retrieve the cached response or perform a more efficient vector lookup without incurring additional LLM costs.
Cost monitoring: At OPMG, we built our own RAG system across multiple companies and found that the
infrastructure costs were minimal—even with thousands of users—and API fees were 7x cheaper compared to enterprise AI chat solutions. Keep in mind many of the enterprise solutions do not have full black box RAG services & so many are paying 7x the cost for a tool that your competitors may be using and therefore the responses both companies receive from the tool with not have the competitive edge that comes from the vast context of all your internal IP from your documents.

5. Flexible architecture & self-hosted models

Self-hosted models: Hosting your own models can be cheaper, and in some cases, you may incur little to no LLM cost, depending on the model selected.
- For instance, if you choose an open-source LLM like Mistral or Llama 2, you can control costs and avoid high fees associated with managed API services.
Open-source flexibility: If you want to host everything in-house for privacy, compliance, or cost reasons, you can deploy an open-source LLM (e.g., Mistral, Llama 2) and pair it with a vector store like PGVector. This not only reduces costs but also offers flexibility; you can swap components easily—for example, replacing your LLM if a new, better model becomes available—without overhauling your entire system.

6. Monitoring & fine-tuning

Usage analytics: Track token usage, embedding costs, and the frequency of repeated queries. Evaluate the effectiveness of caching and refine your approach as needed.
Fine-tuning (optional): If RAG & prompting alone do not meet your specialized needs—especially for handling specific industry jargon —consider fine-tuning using approaches like LoRA or QLoRA. This targeted fine-tuning can offer additional improvements at a fraction of the cost of training a model from scratch.
Knowledge distillation (optional): Typically, chatbots are developed for specific domains. By leveraging Large Language Models (LLMs), we can distill their knowledge into smaller, more specialized models, known as Small Language Models (SLMs). These compact models are sufficiently lightweight to be self-hosted, thereby enhancing privacy and yielding additional cost savings.

The bottom line

A RAG chatbot represents the next-level solution when you need domain-specific accuracy, cost-efficiency, and flexibility beyond what a basic LLM-based chat application can provide. By leveraging advanced techniques like indexing queries, you avoid paying repeatedly for the same question. Using targeted vector search reduces token usage by minimizing the context sent to expensive LLMs. Furthermore, adopting open-source document intelligence for data parsing can be 44x cheaper than proprietary solutions, ensuring your system fully understands your complex internal documents.

By integrating advanced models—whether using reasoning frameworks like crewAI for multi-step queries or simpler models for straightforward tasks—you can customize your solution to meet diverse business needs. A modular, self-hosted architecture that includes components like PGVector for vector storage ensures you maintain full control over your data and costs while enjoying enterprise-grade performance without the premium markup.

If basic chat no longer meets your business requirements or you’re tired of paying premium fees for enterprise UIs that don’t support robust RAG architectures, consider building your own RAG chatbot. With this approach, you'll gain unparalleled control, reduced costs, and the confidence that your bot always delivers accurate, context-rich responses tailored to your unique needs.

An enterprise-grade RAG Chatbot should take roughly 4-8 weeks in our experience, given we ensure the system is secure, reliable, and integrated into your environment & tailored to your use cases/documents.