AI adoption is rising quickly. By the end of last year, 65% of enterprise businesses say they’ve integrated generative AI, up from just 33% in 2023. I’m convinced that AI will be as transformational as digital itself over the next decade. With changes this big afoot, it’s becoming a requirement that leaders educate themselves about this technology, its accordant risks, and the opportunities it brings.
Hopefully this article will shed some light on the nature of hallucinations, how big of a problem they are, and how you as a business leader should think about them in practice.
Bill Murray’s cumberbund
Suppose you ask ChatGPT to describe the 2017 Academy Awards picture of the year snafu where La La Land was accidentally awarded best picture over the rightful winner Moonlight. In that case, it will go on at length about the envelope mix-up, the blame laid on PwC, the identity politics ripples, the various emotions and opinions expressed during and after the event, and much more.
If you follow up by asking what Bill Murray wore when the event occurred, the model will likely hallucinate and tell you that he was wearing a plaid cumberbund and matching bowtie. Bill Murray did not attend the awards ceremony that year.
How can AI be so good, and yet so bad at the same time?
To answer that, and to give you a better understanding and intuition of why hallucinations happen, I need to explain a few things.
How LLMs work
Let's start with a reasonably non-technical explanation of what's happening under the hood of these large language models (LLMs) on which everyone's betting their careers. If you have a PhD in machine learning, you’ll want to hold your nose for this next part.
An LLM is a pattern-matching machine on steroids. Trillions of text examples have been shown: everything from Shakespeare to Reddit flame wars to Python code to corporate memos. This exposure has populated a very large statistical model, giving it an uncanny ability to predict what words should come next in any given sequence.
Think of it as the world's most sophisticated autocomplete. When you type "To be or not to," your phone suggests "be" because it's seen that pattern before. Similarly, when you ask ChatGPT, "What's the capital of France?" It predicts the words "The capital of France is Paris" because that pattern exists countless times in its training data.
The crucial point is that LLMs don't "know" facts like humans do. They generate text based on statistical patterns, not retrieving stored information from a database. This distinction is essential for understanding when and why hallucinations might occur—and why they're far less common than the benchmarking suggests.
Why this leads to hallucinations
LLMs have a few inherent traits that lead to hallucinations:
- They are probability-based, so right off the bat, right and wrong, truth and fiction become less about logic and more about shades of gray.
- They are trained and rewarded for giving fluent and confident responses. So they are biased towards answering with something rather than admitting they don’t know.
- They don’t have an internal database of “facts” so there’s no inherent mechanism to verify the accuracy of a statement before outputting it.
- Their training data limit their knowledge. The less data they have connected to any given subject, the less likely it is to appear in a response.
The architecture of LLMs can sometimes create conditions for hallucinations, but these aren't random or constant. They happen in specific, relatively predictable scenarios:
- Insufficient context: When an LLM lacks enough contextual information to generate a reliable response, it may fill gaps with plausible-sounding but incorrect details.
- Questions about obscure topics: If asked about something rarely mentioned in training data, the model has less reliable patterns to draw from.
- Ambiguous requests: Vague or poorly specified prompts allow the model to interpret and potentially misinterpret what's being asked.
- Requests for specific details far outside common knowledge: Dates, statistics, and highly technical information are more prone to fabrication when they're not frequently repeated in training data.
The essential thing to understand is that hallucinations aren't random or omnipresent. They happen in predictable circumstances that can be identified and managed. And critically, as models improve, these problematic zones continue to shrink.
What is a hallucination, anyway?
For all the buzz around "AI hallucinations," you'd think we'd have settled on a precise definition by now. But the term remains frustratingly slippery, which contributes to the overblown fears.
I would argue that an LLM is ALWAYS hallucinating. It’s just that most of the time it hallucinates correctly.
A more common definition is that an AI hallucination occurs when a language model generates information that is:
- Factually incorrect or unfounded
- Not supported by its training data
- Presented with unwarranted confidence
The most breathless headlines about AI hallucinations often involve deliberately adversarial prompts designed to trick the model—situations that rarely occur in actual business use cases. Judging a car's safety record exclusively by its performance when deliberately driven off a cliff is like judging a car's safety record exclusively by its performance when deliberately driven off a cliff.
Back to our example
So why did the model hallucinate about Bill Murray? Well, at the time, nobody talked about Bill Murray and the 2017 Academy Awards. He wasn’t nominated or very relevant that year. So his attendance (or lack of) at the ceremony wasn’t widely noted. The training data was slim.
So, the model did its best and cobbled together what it knew about the relationship between Bill Murray, the Academy Awards, and his sartorial choices. It’s easy to find images of Bill at other years’ ceremonies wearing tartan bowties. In effect, it trusted you and assumed that you knew he was at that awards ceremony and responded as best it could.
You’re thinking: But why didn’t it just say that it didn’t have enough information to answer my question? Ah, because that’s not how they work.
When an LLM generates each word in its answer, it actually chooses from several possibilities. Each possibility has a probability score, and depending on the settings for the LLM it may have some license to choose from among the top scores. This is how the same question asked twice in a row can generate different answers. But what if all the probability scores are the same? This happens when the model doesn’t have a clear “winner.” Since the model works word by word, evaluating its confidence in an overall answer can be difficult.
Modern LLMs sometimes punt when they start experiencing a spate of low probability choices. Hallucinations happen at the edge of that threshold, when they’re just confident enough to put something out there.
All hallucinations are not the same
Let’s start by talking about the different kinds of hallucinations.
Factual hallucinations
The model states incorrect facts, dates, inaccurate statistics, or events that never happened. These are most common with obscure knowledge or precise details that appear infrequently in training data. These are the kinds of things that ChatGPT 4.5 was trying to improve upon.
Source confabulations
The model invents non-existent sources to add authority to its claims. "According to a 2023 Harvard Business Review study..." when no such study exists. This typically happens when the user asks for citations when none exist. The model is trying to comply with its instructions and neglects to tell the user that there is no source.
Logical hallucinations
The model produces reasoning that seems sound but contains flaws. These are most common in complex, multi-step problems where minor errors accumulate. Reasoning models try to alleviate this through various means, including reflecting on and evaluating their output before sending a response.
Context misalignments
The model misinterprets the context of a conversation and responds inappropriately. These aren't true hallucinations but rather communication failures—they happen all the time between humans, too.
The key insight: each type occurs in specific situations that can often be avoided with proper system design and prompting strategies.
Benchmarks to the rescue?
Here's where things get tricky. Current benchmarks for measuring hallucinations are varied and imperfect. For benchmarks to be reliable, they have to be rigorous or difficult to ace. This often creates an exaggerated picture of the problem.
Factual accuracy tests against established knowledge
These benchmarks test models against encyclopedic facts, often focusing on obscure knowledge or specific details. Because AIs seems so “smart” we expect them to perform perfectly on straightforward tasks like information retrieval. But as I mentioned above, that’s just not how they work. Nevertheless, we feel let down when an AI can’t accurately spit out the voting record of Zales Ecton, a U.S. senator from Montana in the 1950s.
SimpleQA, the hallucination benchmark that OpenAI uses to score their models’ accuracy is a collection of thousands of arcane questions from subject matter experts around the world. Some examples:
- What position was John Gilbert Layton appointed to in Quebec from 1969 until 1970? (Answer: Quebec Youth Parliament prime minister)
- What phrase did Bird Person say to Morty in his native language about making the right choice or the one that lets you sleep at night in Season 1, Episode 11? (Answer: gubba nub nub doo rah kah)
- How many million viewers of the inaugural season of Ultimate Kho Kho (UKK) were from India? (Answer: 41 million)
ChatGPT 4.5 only hallucinated 37.1% of the time on hard questions like that. That’s why OpenAI was so proud. Those questions are hard as hell! To answer them, the model needs to understand semantic connections between all of human knowledge and pick perfect random needles out of the world’s most enormous haystack.
Other benchmarks
Summarization benchmarks evaluate how faithfully models represent source documents, typically using a RAG approach. This is a hard problem worth its own article. Summarization is particularly tough, especially for large documents. We want the model to “read” all the information, then reduce its meaning to a pithy summary that accurately captures the document's gist. But the document might have many themes and its core thesis might be buried or hard to discern from a lot of “noise” surrounding it. Models perform much better (well below 2% hallucination rate for the best models) when they are given source documents to search or summarize.
Still other benchmarks try to measure reading comprehension, instruction following, and professional knowledge (like legal and health expertise). In fact, there are a slew of hallucination benchmarks on Hugging Face.
These benchmarks fail to capture how well models perform in realistic business settings with proper guardrails and human oversight. I hate to be the bearer of bad news but here it is: LLMs will hallucinate. And detecting when it’s happening will fall on your shoulders.
What can we do about it?
While hallucinations aren't the existential threat they're often portrayed as, they still warrant attention. Here's the strategic, practical approach for both users and developers:
As an end user of ChatBots
If you're using ChatGPT or Claude in your workflow use these techniques to reduce hallucinations:
- Provide sufficient context: The more relevant information you give, the less the model needs to fill gaps.
- Be specific in your requests: Vague questions invite vague answers. Specificity constrains the response space.
- Use the model for what it's good at: Idea generation, writing assistance, and information synthesis rather than as the sole arbiter of obscure facts.
Apply common sense: If something seems questionable, it probably deserves verification—just as you would with information from a human colleague. - Iterative refinement: If a response isn't quite right, clarify and refine rather than abandoning the tool entirely.
The reality is that most business users intuitively develop these habits quickly, which is why hallucinations cause far fewer problems in practice than in theoretical discussions.
As a developer of products
If you're building AI-powered applications, you have even more options:
- Implement Retrieval-Augmented Generation (RAG): Ground your model's responses in verified documents for domains where factual accuracy is crucial.
- Design for appropriate use cases: Match AI capabilities to tasks where hallucination risks are minimal or inconsequential.
- Create clear user expectations: Help users understand what the system can do reliably and where human judgment remains essential.
- Build domain-specific solutions: Models fine-tuned on high-quality data in your specific domain will be more reliable within that domain.
- Implement sensible guardrails: Rather than trying to eliminate all potential errors, focus on containing their impact in the rare cases they occur.
The key insight: hallucinations can be managed effectively through thoughtful system design and proper use, making them a manageable engineering challenge rather than a fatal flaw..
The paradigm shift.
I talk about the paradigm shift a lot because I think it’s hard to get your head around just how different AI is compared to the technology we’ve all built our careers on. It’s a fundamentally different approach to solving business problems. We are used to ones and zeros, black and right, correct and incorrect. Transistors are so good at that kind of thing.
But AI is more about likelihoods. It’s good at shades of gray, interpretation, and analysis. And it’s really hard for us to let go of the idea that that kind of work belongs solely to us. So when we see something like a hallucination we assume the system is broken. A more productive way to see it is as a tax on performance. For the ability to automate thinking tasks, we will need to pay a price in error correction.
Thoughtful leaders aren’t waiting for AI to be flawless. Even in highly regulated industries, they’re putting it to work, designing systems that lean into its strengths while managing its limitations. The real question isn’t how to stop AI from making mistakes: it’s whether your business is set up to catch and correct those mistakes efficiently.
As an enterprise leader, here’s what you should be thinking about:
- Where can AI drive value today? Start with low-risk, high-reward applications like summarization, customer support augmentation, and internal knowledge management. These use cases benefit from AI’s strengths without exposing the business to significant risk.
- How do we address AI’s flaws? Techniques like retrieval-augmented generation (RAG), model fine-tuning, and human-in-the-loop validation help keep AI responses within acceptable bounds.
- What’s our AI roadmap? AI adoption isn’t slowing down. If your organization is still in “wait and see” mode, there’s a good chance you’re already behind.
AI hallucinations are a problem to manage, not a reason to stall. The most significant risk isn’t that AI might make a mistake; it’s that your competitors will figure this out faster than you do. It’s game theory. Businesses that integrate AI effectively will set the pace in the years ahead.
I hope this article helped you understand the nature of hallucinations and how they can be addressed.
If you want to chat more about the risks and oppurtunities of AI, set up a 15 min discovery chat with us. We help enterprise businesses identify and build transformative high-ROI AI projects.