What is RAG, exactly?
RAG (Retrieval-Augmented Generation) is a technique that connects the power of LLMs with your internal data. While a standard model (e.g., ChatGPT on the web) answers based only on information it was trained on – which may be outdated or generic – RAG allows it to look into your private data and answer based on that.
Imagine you need specific answers that treat your company data as the source of truth: internal wikis, policies, regulations, documented processes, and so on. A standard LLM won't help you there, because it has no access to that information. RAG solves this limitation by "augmenting" the LLM with knowledge from your data. This augmentation and subsequent retrieval of information is the core idea behind RAG.
RAG is a very useful technique for building AI assistants and intelligent chatbots. Virtually every company has some internal documentation, and it is much easier to type a query in natural language than to dig through piles of documents searching for the answer to your question.
RAG is essentially the foundational building block for any "talk to your data" type of application.
Problems and shortcomings of most RAG implementations
To be able to talk to your data, you first need to get that data into the LLM – or more precisely, into the vector space over which the LLM performs semantic search. This can be implemented in various ways, and there are many tutorials showing how to do it. The problem, however, is that they always present a very trivial example that cannot hold up in real-world use. There is an enormous difference between a quick YouTube prototype and a production system.
A typical implementation of this part of RAG often looks like one or two scripts that handle all the logic, from loading data to storing it as vectors. Such a design is not modular and offers little control. High-level frameworks like LangChain are frequently used, which essentially do everything in the background – to the point that even the developer who built the system doesn't really know what happened beneath the surface. This is extremely dangerous.
More often than not, a large amount of irrelevant or incorrectly processed data ends up in the vector database. Querying over such data naturally leads to inaccurate or completely wrong answers. An LLM is not a magician – if it doesn't have the right data, it cannot give the right answer. A simple rule applies here: "Garbage in, Garbage out."
RAG as a pipeline
Every RAG – specifically the part where we augment the data – works as a pipeline consisting of several fundamental steps.
Data must first be loaded from somewhere – this could be a database, a folder on your computer, web-based documentation in Confluence, or any other source. The loaded data must be cleaned, parsed, and enriched with metadata such as page number, URL, etc. The text then needs to be split into appropriate chunks, because some documents can be very long and the LLM would not be able to work with them effectively. Splitting can be done by word count, by chapters, sections, paragraphs, and so on. Finally, the split text must be stored in a vector space over which the LLM performs semantic similarity search.
Each phase is strictly separated. Data always flows in one direction: Load → Clean → Split → Store. It is precisely this sequential data flow that makes this phase very well-suited for implementation using a variation of the design pattern known in software design as Pipe and Filter.
You can think of it as an assembly line with distinct phases. In each phase, one specific operation is applied. At the end of the line, the desired result is produced. A great advantage of this design is its flexibility.
You can assemble the assembly line according to your needs, and each step can be easily modified or replaced. The steps are independent and completely separate from each other, which is generally a very desirable property in software. It allows us to better test, debug, and swap out individual components.
At the same time, because the pipeline is modular, it is practically automatically ready for any future modifications and additions. This is very important, because in practice we know that development requirements change constantly. A well-designed pipeline is open to change – changes are fast, simple, and painless.
Let's look at an example of what such a pipeline looks like when implemented in code.
We said that we always need 4 steps. Text data must be:
- loaded
- parsed
- split
- embedded
We can therefore define several clearly bounded steps directly in code:
from abc import ABC, abstractmethod
from typing import Any
class Step(ABC):
@abstractmethod
def run(self, data: Any) -> Any:
pass
class LoadStep(Step):
"""
Pipeline step to load raw data from a source.
"""
def __init__(self, source: Source):
self.source = source
def run(self, data):
...
class ParseStep(Step):
"""
Pipeline step to parse raw data into structured documents.
"""
def __init__(self, parser: Parser):
self.parser = parser
def run(self, data):
...
class SplitStep(Step):
"""
Pipeline step to split documents into smaller chunks.
"""
def __init__(self, splitter: Splitter):
self.splitter = splitter
def run(self, data):
...
class EmbedStep(Step):
"""
Pipeline step to embed documents and store them in a vector store.
"""
def __init__(self, vector_store: VectorStore):
self.vector_store = vector_store
def run(self, data):
...
Once we have the basic steps defined, we can define further components alongside them as their dependencies. For example, the step that parses data (ParseStep) needs a specific Parser. The step that splits data into smaller parts (SplitStep) needs some Splitter, and so on.
class Parser(ABC):
@abstractmethod
def parse(self, raw_data: List[RawDocument]) -> List[ParsedDocument]:
pass
class ConfluencePageParser(Parser):
def __init__(self):
...
def parse(self, raw_data: List[RawDocument]) -> List[ParsedDocument]:
...
All components define abstract interfaces that ensure a clear communication protocol and allow easy substitution of one concrete class for another.
Because the individual components are independent, we can develop and test them independently and only at the end assemble them together into the final pipeline as needed.
Here is what assembling the pipeline in the orchestrator might look like:
class RagPipeline:
def run(self):
raw_pages = self.load_step.run()
processed_docs = self.parse_step.run(raw_pages)
doc_splits = self.split_step.run(processed_docs)
self.embed_step.run(doc_splits)
Finally, we put everything together and run it.
# main.py
settings = Settings()
splitter = MarkdownHeaderTextSplitter(...)
source = ConfluenceSource(...)
parser = ConfluencePageParser(...)
pipeline = RagPipeline(settings, splitter, source, parser)
pipeline.run()
The result is a very flexible yet robust solution.
Individual components can be easily swapped out as needed, which can be absolutely critical.
Different texts have different formatting, complexity, and so on. For example, parsing documents from PDF may require different logic than parsing content from an internal wiki. Technical documentation full of graphs, tables, and formulas may require a different text-splitting strategy than plain text documents that can simply be split by word count. There can be a whole range of combinations and variations, and if we had to build an entire pipeline from A to Z for each of them, it would quickly become unmaintainable.
By having well-designed individual components, we essentially have a building kit at our disposal. The individual components are like LEGO bricks that can be assembled into whatever is currently needed. If we don't have a particular brick, no problem – we can simply build it without affecting the rest of the system. This allows us to respond flexibly to change while keeping the code readable and clean.
Why good design matters
When building any production software, it is good practice to follow several fundamental, time-tested principles.
One of them is the Single Responsibility principle, which states that one component does one thing and does it well. Another is Separation of Concerns – components have clearly defined responsibilities and boundaries. Another is Low Coupling, which says that components should know as little about each other as possible. And yet another is Inversion of Control, which advises using abstract classes rather than concrete objects as direct dependencies.
These are best practices that have proven their worth over many years of practice, and it pays to stick to them. Not because software engineers like making up rules for themselves, but because an application that respects these rules is modular, well-structured, and readable. Good design is not just about writing "clean code" – it has very concrete business impacts. It increases:
- Scalability: When we need to add a new data source, we don't rewrite the entire pipeline. We simply implement a new
Sourceclass. The pipeline orchestrator remains unchanged. - Testability: Because each component (
Splitter,Parser,Loader) is isolated, we can develop them independently and write unit tests independently. We don't need to spin up a vector database just to test that our Markdown splitter works correctly. - Debuggability: In "spaghetti" code full of if/else logic, finding bugs is hard. If there's an error in parsing HTML from Confluence, we know exactly where to look.
- Transparency: Separated components are small and easy to understand. It is easier to trace what is happening, where, and how, and the entire system is significantly easier to tune and monitor.
A well-designed RAG pipeline can elevate an entire application from a prototype to a robust system that can adapt and grow over the long term, precisely according to our needs.