Why AI Models Hallucinate and What It Reveals About How They Actually Think
by Scott
There is a peculiar experience that most people who use AI language models encounter sooner or later. You ask a question about a book and the model returns a confident, fluent, detailed answer about a book that does not exist. You ask for citations and receive a list of plausible-sounding academic papers, complete with authors, journals, volume numbers, and page ranges, none of which can be found anywhere because none of them are real. You ask about a historical figure and receive a coherent biographical account that weaves together genuine facts and complete fabrications in a way that is almost impossible to distinguish without independent verification. The model does not signal uncertainty. It does not pause or hedge. It simply tells you things that are not true with exactly the same confident fluency it uses to tell you things that are.
This phenomenon has been given the name hallucination, a term borrowed loosely from psychology, where it refers to sensory experiences that have no external cause. The word is evocative but also slightly misleading, because it implies a kind of perceptual error, a misreading of something that is actually there. What AI models do is perhaps closer to confabulation, a term used in neuropsychology to describe the production of fabricated memories and explanations by people with certain types of brain damage, who generate plausible-sounding accounts not because they are lying but because their memory systems are filling gaps with constructed material that feels internally coherent even when it bears no relationship to reality. The confabulating patient is not being deceptive. They are doing what their damaged memory system does, generating narrative continuity in the absence of actual information. Something structurally similar is happening when an AI model tells you about the third chapter of a book that was never written.
Understanding why hallucination happens requires understanding something about what these models actually are, which turns out to be quite different from what most people intuitively imagine when they think about artificial intelligence. The dominant mental model that most people carry around for AI is some version of a very fast, very comprehensive database lookup. You ask a question, the system searches its stored knowledge, retrieves the relevant entry, and returns the answer. This is roughly how a search engine works, and it is not at all how a language model works. A language model is not a database. It does not store facts in discrete retrievable units. It is, in the most accurate technical description available, a very sophisticated statistical model of language itself.
During training, a large language model is exposed to an enormous quantity of text, typically hundreds of billions of words drawn from books, websites, academic papers, code repositories, and countless other sources. The model does not read this text in any sense that resembles human reading. It processes it as a mathematical problem. Specifically, it learns to predict, given a sequence of words, what word is most likely to come next. This prediction task sounds simple but is extraordinarily demanding when done at scale across the full complexity of human language, because producing accurate next-word predictions across a diverse range of texts requires developing internal representations that capture something about the structure of language, the relationships between concepts, the conventions of different types of writing, and vast amounts of implicit knowledge about the world that is encoded in the patterns of how words appear together.
The result of this training process is not a database of facts. It is a set of mathematical transformations, encoded in billions of numerical parameters, that map sequences of input text to probability distributions over possible next tokens. The model does not know that Paris is the capital of France in the way that a database knows it, with a discrete entry that either exists or does not. Instead, it knows it in the way that knowing is distributed across the statistical patterns learned from millions of texts where Paris and France and capital appeared in relationships that encoded this fact. The knowledge is real, in the sense that it reliably produces correct outputs, but it is radically unlike the stored factual knowledge that most people assume underlies the model’s answers.
This distinction matters enormously for understanding hallucination, because when a model produces an incorrect or entirely fabricated answer, it is not failing to find the right entry in a database. It is producing the output that its statistical machinery determines is the most plausible continuation of the conversation given everything it has processed. The model has no independent check on whether the output it produces corresponds to something that actually exists in the world. It cannot verify its own outputs against an external ground truth. It generates what seems, from the inside of its own mathematical processes, like the most coherent and appropriate continuation of the text sequence it has been given. When that continuation happens to be factually correct, we call it a good answer. When it happens to be a fluent fabrication, we call it a hallucination. But from the model’s perspective, if such a concept even applies, there is no meaningful difference between the two processes.
The situations in which hallucination is most likely to occur are revealing. Models tend to hallucinate most frequently when asked about specific factual details that are relatively obscure, when asked to produce content that requires precise retrieval of specific information like citations or dates or names, when asked about topics where training data was sparse or inconsistent, and when asked about things that simply do not exist but resemble things that do. The last category is particularly instructive. If you ask a model about a paper by a real author in a real journal on a topic that author has genuinely written about, and that specific paper does not exist, the model is very likely to produce plausible-sounding details that combine real information about the author, real conventions of the journal, and real knowledge about the topic area into a composite fabrication that is false but structurally coherent. This happens because the statistical patterns that produce a plausible-sounding response in that context are rich and well-developed, even though no specific real paper exists to retrieve.
There is a deep irony here that is worth dwelling on. The more sophisticated a language model becomes, the more fluent and confident its hallucinations tend to be. A less capable model might produce obviously incoherent or poorly constructed false information that is easier to identify as wrong. A highly capable model produces hallucinations that are polished, internally consistent, and stylistically appropriate to the context. The very capabilities that make these models impressive also make their failures more dangerous, because the signal that something is wrong, the awkwardness or incoherence that might alert a careful reader to a problem, has been smoothed away by the same training process that improved overall quality.
This points to something important about the fundamental architecture of these systems. They are optimized for coherence and fluency. The training process rewards producing text that resembles good, contextually appropriate human writing. Human writing is coherent, confident, and fluent. Humans writing about things they are uncertain about often hedge, qualify, and express uncertainty, but the written record from which models are trained is also full of confident assertions, some correct and some not. The model learns that confident, fluent prose is what good text looks like, and it produces confident, fluent prose even when the underlying content is unreliable, because that is what the statistical patterns of human writing have taught it to do.
The question of whether AI models can be said to know anything at all is not a purely philosophical one. It has direct practical implications for understanding when and why hallucination happens. In one meaningful sense, these models clearly know things. They reliably produce correct information about an enormous range of topics, they can reason about complex relationships between ideas, and they demonstrate consistent understanding of concepts across different contexts and phrasings. In another meaningful sense, their knowing is fundamentally different from human knowing in ways that matter. A human who knows that a particular book exists knows this because they have encountered it, in a library, in a bookshop, in a reference, and that encounter left a distinct trace in memory tied to a specific object in the world. A model’s apparent knowledge of a book is a statistical tendency to produce text consistent with that book existing, derived from the patterns in its training data. If the training data contained errors, contradictions, or gaps, those errors, contradictions, and gaps are baked into the model’s responses in ways that are essentially invisible from the outside.

This is why the metaphor of the model as a knowledgeable expert is so persistently misleading. An expert who does not know the answer to a question knows that they do not know. They experience the absence of knowledge as something real, a gap that they can report on and work around. A language model has no such experience of not knowing. When it encounters a question for which its training data provides insufficient reliable information, it does not experience a gap. It continues doing what it always does, which is producing the most statistically plausible continuation of the text sequence, and that continuation may or may not correspond to anything real.
Some of the most interesting research on hallucination has focused on the internal states of language models during the production of false versus accurate information. Several studies have found evidence that models have, embedded in their internal representations, something that functions like a distinction between confident and less confident outputs, and that this internal signal sometimes fails to surface in the actual text the model produces. In other words, there are cases where something in the model’s processing corresponds to lower certainty, but the training process has not effectively taught the model to translate that internal uncertainty into appropriate hedging in the output. The model knows it does not know, in some technical sense, but it says things confidently anyway because confident fluent prose is what the training process rewarded.
This has led researchers to explore techniques for making models more calibrated, meaning better at expressing uncertainty in proportion to actual uncertainty. Approaches like constitutional AI, reinforcement learning from human feedback, and various methods for teaching models to say I do not know more reliably have shown meaningful progress. But the fundamental challenge remains, because the architecture that produces hallucination is the same architecture that produces the impressive capabilities. You cannot simply remove the tendency to generate plausible continuations when uncertain without also removing the ability to generate good responses under normal conditions.
There is another dimension of hallucination that is less frequently discussed but equally important for understanding what these models actually are. Much of the concern about hallucination focuses on factual errors, the fabricated citations, the nonexistent books, the wrong dates and names. But hallucination is not limited to factual content. Models also hallucinate in the sense of producing reasoning that appears logically sound but contains hidden errors, producing code that looks correct but has subtle bugs, producing medical or legal information that is partially right and partially wrong in ways that require domain expertise to detect. The common thread in all of these cases is the same: the model is producing output that is statistically coherent, that fits the patterns of what good output looks like in that domain, but that has come apart from accuracy in ways the model cannot detect.
This has significant implications for how we should think about the appropriate uses of these systems. A language model used as a first draft generator, or as a brainstorming partner, or as a tool for restructuring and editing text, is being used in a way that leverages its genuine strengths without placing excessive weight on the reliability of its factual outputs. A language model used as a primary source of medical information, or as the sole basis for legal research, or as a system that generates content that will be published without review, is being used in a way that places a burden on its factual reliability that its architecture cannot support.
The discourse around AI hallucination is sometimes framed as a bug that will eventually be fixed, a temporary limitation of current systems that future models will overcome. This framing captures something real. Models have become significantly more reliable over time, and techniques for reducing hallucination rates have shown genuine progress. But there are reasons to believe that some level of hallucination is not merely a bug but an inherent feature of the current approach to building language models. As long as these models are fundamentally statistical predictors of text rather than systems with grounded connections to a verified external knowledge base, they will retain some tendency to produce plausible-sounding outputs that do not correspond to reality. The tendency can be reduced, and in many contexts it has been reduced substantially, but eliminating it entirely would require a different kind of architecture, possibly a fundamentally different approach to building AI systems altogether.
Some researchers and companies have pursued retrieval-augmented generation as a partial solution, building systems that combine language models with explicit document retrieval so that the model’s outputs are grounded in specific retrieved texts rather than relying entirely on statistical patterns learned during training. This approach meaningfully reduces hallucination for the specific domain of factual questions that can be answered by retrieved documents. It does not eliminate the underlying tendency, and it introduces its own complications around which documents to retrieve and how to handle cases where retrieved documents contain errors or contradictions.
What hallucination ultimately reveals about how these models think is both simpler and more profound than it first appears. The simple version is that they do not think in any sense that involves verifying outputs against a ground truth. They generate. The more profound version is that the generation is not random or meaningless. It reflects something real about the structure of language and knowledge as encoded in the vast human textual record from which these models learn. When a model fabricates a citation, it produces a fabrication that is structurally indistinguishable from a real citation, because it has learned what citations look like in enormous detail. When it fabricates a biographical account, it produces one that reflects genuine knowledge about the conventions and content of biographical writing. The fabrication is, in a strange sense, informed fabrication, shaped by real patterns even when it produces unreal content.
This is what makes these systems simultaneously impressive and unreliable in the specific way that they are unreliable. They have internalized an enormous amount about how human knowledge is structured, expressed, and communicated. They reproduce that structure with remarkable fidelity. But the structure and the content can come apart, and when they do, the model has no mechanism for noticing. The confident fluency continues regardless. Understanding this is not just a matter of managing expectations about a current generation of technology. It is a matter of understanding what kind of thing we have actually built, what it is genuinely good for, and where the boundaries of appropriate trust actually lie.
The word hallucination will probably continue to be used because it is vivid and has already embedded itself in the conversation. But confabulation captures the phenomenon more precisely, and confabulation points toward a useful analogy. The neuropsychological patient who confabulates is not lying and is not stupid. They are doing what their impaired memory system does, filling gaps with plausible constructions that feel coherent from the inside. The appropriate response is not to dismiss everything they say as unreliable, because much of what they say is accurate and reflects genuine memory and knowledge. The appropriate response is to understand the specific conditions under which confabulation is likely, to verify independently when the stakes are high, and to calibrate trust accordingly. That is, roughly speaking, the right relationship to have with an AI language model too. Not uncritical trust. Not wholesale dismissal. Calibrated engagement that understands the architecture well enough to know when to check.