01 / What is RAG? And how is it different from LLMs?
Red pill, blue pill
Remember that scene in The Matrix where Neo downloads kung fu directly into his brain? That's basically what every tech vendor is promising with their AI solutions. "Just plug our AI product into your company's data, and it'll instantly know everything!" But here's the thing – most of these solutions are powered by something called RAG, and without knowing what that actually means, you could inadvertently be taking the wrong-colored pill.
Just to get it out of the way, RAG stands for Retrieval Augmented Generation, and it would take too long to unpack that acronym so let's just skip it.
If you don't know how RAG works, you aren't alone. Most people don't, even though it’s the cornerstone of so much AI-based utility. You may even be using RAG without realizing it. For example, when you upload a long document to ChatGPT, it uses RAG to “read” it. Popular tools like Writer and Glean and most other enterprise apps employ RAG foundationally to extract value from your organization's knowledge. It’s everywhere because it’s so damned useful!
RAG is a clever programming architecture invented to compensate for weaknesses in working directly with LLMs. I hate to delay the big reveal any longer, but before we can unpack RAG, we'll need to talk about LLMs.
A not very short detour to discuss LLMs
An LLM (or Large Language Model) is a statistical model of human languages. It can do a lot of languages, including programming languages and Klingon, but let’s not get distracted. Foundational LLMs from OpenAI, Google, and Anthropic are trained using nearly all public knowledge. As in… everything—from the works of Shakespeare to the threads of Twitter (yikes) and anything else that was written down that could be stolen obtained.
LLMs are just guessing the next word
LLMs work by taking preceding words and guessing the most likely next word. So if I asked the LLM to complete this sentence: “The girl rode her beautiful red…”, it would likely answer, “bike.” That would be its most confident guess and it would have a long, ordered list of its backup guesses. If you gave it more leeway to answer creatively (known technically as increasing the temperature), it might come back with “horse.” They do this by analyzing all the words in their training data and building a statistical model of their inter-relationships. The more training data, the better the model.
It turns out we’re incredibly predictable
The crazy thing nobody really expected at the outset was just how much implicit information our languages contain. By analyzing them at such a large scale, you can pick up patterns that would otherwise be invisible. In those patterns lie reasoning, bias, tone, humor, sarcasm, and even some simple math. When you ask an LLM what 2 + 2 equals, it responds with 4. Not because it knows how to calculate, but because it’s seen that pattern SO MANY TIMES. If you ask it to multiply two four-digit numbers, it will fail. Far from being intelligent (I’ll save AGI for another article) they are mind-bogglingly dumb. But they appear to be smart because they are so damn good at pattern recognition. And as it turns out, we humans mostly repeat ourselves.
The limitations of LLMs
Wait, weren’t we talking about RAG? Yeah, so… LLMs are good at talking about what they’ve been trained on. But what if we want to speak to an LLM about information it hasn’t been trained on? With a straight LLM (I’m not talking about ChatGPT—that’s technically an agent, which, yes, is yet another article), you can give it one of your documents, but you have to put the entire document into the input. In that case it will read the whole document and be able to answer questions about it. Great.
The problem is that LLMs can only handle so many words in their input (also known as a context window, basically their short-term memory). So if you’ve got a long document, you’re out of luck. What if you’re running a law firm and you want to use an LLM to interrogate your years of case research? Sorry, no can do.
A huge shout-out to Patrick Lewis and his research team at Meta, who coined the term "Retrieval-Augmented Generation" (RAG) in 2020. 2020! That’s 100 years ago in Gen AI terms. Anyway, these folks invented a methodology where instead of sending an entire document to the LLM, they would only send the relevant parts and then ask the LLM to synthesize and answer based on that information.
02 / How RAG works under the hood
Rag has three steps: storage, retrieval, and synthesis. SRS?—never mind, that’s worse than RAG.
Storage
This is a preprocessing step and it doesn’t use an LLM. That’s important because you can do it without sharing the entirety of your data with the naughty, data-hungry boys at OpenAI. Your documents are broken into smaller pieces, majestically named “chunks.” A chunk can be a sentence or a paragraph or a page. Any length really.
Generally, you want your chunks to contain a single idea (you'll see why in a sec). The chunk is then converted into vectors through a process called embedding. Okay, this is the mathy-est part of this article, but you’re good-looking and have a college degree, so I think you can handle it.
Introducing vectors
A key part of LLMs is turning words into numbers. The numbers (or more accurately vectors) represent the relationship between that word and all the other words. Math PhDs are groaning at their phones as they read this, but it’s basically correct. A vector is a long series of numbers that describe a word (or more accurately a token) across thousands of axes of meaning. For example, dog will be weighted heavily on furry-ness, animal-ness, pet-ness, friend-ness, etc., but low on mineral-ness, machine-ness, etc.
This is fundamentally important to how this technology works! So, if you don’t get that part, start chatting with ChatGPT about how vectors represent the meaning of and relationships between words.
So now you have your document broken into chunks and converted into a format that stores its meaning compared to other words. The embedding process does not use an LLM. It uses an embedding algorithm that doesn’t require any AI involvement. So you can do everything so far on your own servers. No data privacy concerns, yay!
Retrieval
The next step happens when you want to chat about your documents. So you start with a question. Simple.
Let’s say the document you embedded is your employee policy handbook. And you want to know what your company’s official vacation days are. Here’s what happens…
Your question is embedded into vectors using the same algorithm that turned the chunks into vectors. Now we’ve got your question stored in a format that expresses its meaning relative to other words.
Next, a search is performed, comparing your question vector to all of the chunk vectors. We want to know which chunks most closely match the meaning of your question. The search algorithm might return 10 responses where the vectors of the chunks are closest to the vector of your question. Closeness in this sense is actually a mathematical comparison and it can be done in different ways. Yes, there are different ways to measure closeness, and they will have a significant bearing on your results.
All this searching and matching did not require an LLM. Our data is still private.
Synthesis
Okay, now we have 10 chunks that the RAG system thinks are most closely related to your question. When we stored the vectors we also kept and appended the original text. So we now put all that text together and we send it to an LLM along with the question in a prompt. So it would look like this:
The user asked the following question:
[the user's question]
Use the following material to help answer the question:
[text of the matching chunks]
Use this information and only this information to answer the user’s question.
So, the question and the chunks are combined into a prompt sent to the LLM. And the LLM uses the information in the prompt to answer the question.
This is the first time the LLM has seen your private data. And it’s only seen the retrieved chunks sent along in the prompt. You don’t need a powerful LLM to synthesize information like this since it doesn’t necessarily require much thinking here.
And now you understand RAG.
03 / How RAG falls short
Complexity
Each step I described above has a slew of variables that will affect the system's quality.
For example, the chunk size.
- Smaller chunks return more precise matches but don’t capture the gist of the text.
- Larger chunks capture the gist but can return a lot of false positives.
A typical RAG system will overlap the chunks and/or vary the chunk size or make multiple passes.
The tweakability is also endless, making tuning and context setting an important part of the RAG’s system’s performance. For example:
- In storage you can store the chunks in myriad ways and with different kinds of metadata that can be used later to help in retrieval or synthesis.
- In retrieval, you can return more or fewer matches, among many other tweaks.
- In synthesis, you can modify the instructions you give, add additional meta information to the prompt, and so much more.
- You can even add more LLM calls to the picture by interpreting and modifying the original question before performing the search or by evaluating and passing along only the chunks that have appropriate information for the question.
Getting a RAG system to perform really well is hard. And the less tuned the system is to the specific content or use case, the more poorly it will work.
Speed
Every time you ask a question, your question needs to be vectorized, a search must be performed, the answers found and compiled and eventually sent to the LLM. All this takes time. That’s why you see that little spinner for a while when you ask your company policy bot, “when are annual bonuses sent out?” More highly tuned RAG systems tend to have more steps (see above) and thus perform slower. So quality comes at the price of time.
Synthesis
RAG is essentially a search process. If you asked a question like, “what is our company policy on parental leave?” you’ll get a really good answer. That section is probably well-labeled and uses a lot of words that are closely matched to “parental.”
But if you asked about the gestalt of the document, like, “how would you characterize this company’s dedication to their employee’s well-being?” it would probably stumble. That’s because it’s not taking in the whole document and “understanding” it. In this case it would try to find areas of the document that match the idea of “well-being,” and it would miss the forest for the trees. It’s like trying to understand history by only knowing events and dates.
Hallucination
If you ask for something that isn’t in the document or is highly tangential, the retrieved chunks will be of low relevance. The LLM will then attempt to answer the question with what is essentially lousy information, and it will feel like it’s making stuff up. If not properly designed, it can present its synthesis with high confidence. So, if you aren’t familiar with the document you’re querying, you can easily walk away misinformed.
Summarization
RAG is not built to summarize. If you say something like, “summarize the vacation policy of this company” then it will do fine. But for a broader summarization it will start to hallucinate.
Why? Because in order to summarize, the AI would need to read the entire document, not just a handful of retrieved chunks. There are different (also imperfect) summarization methodologies, which can even be added to your RAG system since it’s such a common use case.
04 / How RAG performance is improved
I mentioned all the knobs that are available to improve a RAG system. The good news is that we have a lot of control over how these architectures work, and we can do a lot to improve them depending on the nature of the content, the specific use cases, and the importance of getting quality output. Here are some of the big things:
- Indexing: How you chunk the document and the metadata that you keep when processing the document.
- Re-ranking: Taking an extra step to look at the question and search results and re-rank their relevancy before sending them to the LLM.
- Prompt engineering: Adding more use-case specific instructions to the LLM when you send the result.
- Query rewriting: Evaluating the user’s question and rewriting it for clarity before starting the search.
And. So. Much. More.
I can easily think of 10 ways to improve a RAG system. An experienced AI engineer could think of 20 more. The point is that RAG is not a single thing—it’s a complex, highly customizable, interconnected system that works best when it’s tuned to the type of content you’re working with and the kind of information you want to get out of it.
05 / Generic vs Custom implementation
You can buy an off-the-shelf RAG system that will work pretty well out of the box for many common use cases. But suppose your business depends on extracting relevant information from a large inventory of documents. In that case, you should at least consider a more custom solution for all the reasons I’ve outlined above.
Domain-specific tools have already emerged in industries with obvious use cases. I’m thinking of legal, healthcare, insurance, etc. When you need a super-powered search, these tools will likely suffice.
But if you want to transform what you’re doing to automate your unique processes or deal with unique data sets or highly complicated documents where simple text retrieval won’t suffice, you’ll benefit from a custom RAG solution.
Summary
I swear I didn’t set out to write a novel about RAG. If you've made it this far you're a dedicated soul and I salute you! If anything is unclear in this article, please let me know and I'll try to update it for better clarity.
RAG is a powerful component in the AI toolkit. Understanding how it works should help you use it more effectively. Understanding its limitations and weaknesses can help you squeeze the most value out of your RAG tools and design new RAG systems that meet your needs and expectations.
If you want to discuss the nuances of RAG for your business or how RAG can be a part of you AI-powered business transformation, give us a shout.