How AI Large Language Models Function and Why We Will Not Achieve AGI With Existing Technology

by Scott

The conversation about artificial general intelligence has never been louder, and it has rarely been more confused. On one side are the optimists, researchers and entrepreneurs who believe that the large language models currently transforming industries represent early prototypes of systems that will soon achieve or surpass human-level general intelligence across all cognitive domains. On the other side are the skeptics, researchers who believe that current systems, however impressive, are so fundamentally different from general intelligence that calling them precursors to it is a category error. Between these poles is a large population of interested observers who find it difficult to evaluate the claims of either camp because they do not have a clear picture of what large language models actually are, how they actually work, and what the genuine boundaries of the architecture are. That clarity is worth developing, not because it settles the debate, but because it makes the debate possible to have honestly.

A large language model is, at its most fundamental level, a mathematical function that maps sequences of text to probability distributions over what text might come next. This description sounds reductive, and in some ways it is, but it is also precise in a way that matters for understanding both the capabilities and the limits of these systems. The model takes as input a sequence of tokens, which are roughly equivalent to words or word fragments, and produces as output a probability distribution over the vocabulary of possible next tokens. The token with the highest probability is typically selected, though in practice some randomness is introduced to make the outputs more varied and interesting. This process repeats, with each newly generated token appended to the input sequence, until the model produces a complete response. Everything that large language models do, every essay they write, every question they answer, every piece of code they generate, is produced by this process of iterative next-token prediction.

The function that performs this mapping is implemented as a neural network, specifically a type of architecture called a transformer, introduced in a landmark paper by researchers at Google in 2017. The transformer architecture is built around a mechanism called attention, which allows the model to weigh the relevance of different parts of the input sequence when computing its output at each position. When a model is generating the next word in a sentence, attention allows it to consider not just the immediately preceding words but the entire context of the conversation or document, assigning different weights to different parts of that context depending on their relevance to the current prediction. This ability to integrate information across long spans of context is one of the key capabilities that makes transformers more powerful than the recurrent neural network architectures that preceded them.

The parameters of the neural network, the billions or hundreds of billions of numerical values that determine how the network transforms its inputs into outputs, are learned during a training process. Training involves exposing the model to an enormous corpus of text, typically hundreds of billions or trillions of tokens drawn from books, websites, academic papers, code, and countless other sources. At each step of training, the model makes a prediction about what token comes next in a sequence, compares that prediction to the actual next token in the training data, and adjusts its parameters in a direction that would have made the correct prediction more likely. This adjustment process, implemented through an algorithm called backpropagation combined with an optimization procedure called gradient descent, is repeated billions of times across the training corpus until the model’s predictions are as accurate as the training process can make them.

What emerges from this training process is a function that has, in some distributed and mathematically complex sense, internalized an enormous amount of information about the structure of language, the relationships between concepts, the conventions of different types of text, and vast quantities of world knowledge that is implicit in the patterns of how words appear together across the training corpus. The model does not store this knowledge in discrete retrievable units the way a database stores records. It is encoded across the billions of parameters of the network in a form that is not directly interpretable but that reliably produces outputs reflecting that knowledge when the model is queried in appropriate ways.

The capabilities that emerge from this training process are genuinely remarkable and in some cases surprising even to the researchers who built the systems. Large language models can write coherent essays on complex topics, translate between languages with high accuracy, generate functional code in dozens of programming languages, explain scientific concepts at varying levels of sophistication, solve mathematical problems, engage in extended reasoning about hypothetical scenarios, and demonstrate apparent understanding of nuanced social and emotional contexts. Many of these capabilities were not explicitly trained for. They emerged as a consequence of training on next-token prediction at sufficient scale, leading some researchers to conclude that scale itself, more data and more parameters and more compute, is a reliable driver of new capabilities in ways that were not anticipated.

This observation, that capabilities seem to emerge with scale, has been one of the primary drivers of optimism about the trajectory toward artificial general intelligence. If training larger models on more data consistently produces more capable and more general systems, the argument goes, then continuing to scale should eventually produce systems that are generally capable across all cognitive domains. This argument has a surface plausibility that has made it influential, but it has several deep problems that become apparent when examined carefully.

The first and most fundamental problem is that next-token prediction, however sophisticated, is not the same thing as understanding. This distinction is contested and philosophically complex, and the word understanding is doing a lot of work in it that needs to be unpacked. What is meant here is something specific and empirically testable. A system that genuinely understands something in the way humans understand things should be able to apply that understanding flexibly to novel situations, to reason about cases that differ substantially from anything encountered in training, to identify when its knowledge is insufficient and reason about its own limitations, and to generalize principles across domains in ways that are not merely statistical regularities in the training data. Current large language models fail these tests in systematic and revealing ways.

The failure mode is most visible in what researchers call out-of-distribution generalization. When a large language model is presented with a problem that is structurally similar to problems it has seen many times in training, it typically performs well. When it is presented with a problem that is structurally novel, even if a human with genuine understanding of the relevant domain would find it straightforward, the model often fails in ways that reveal the limits of statistical pattern matching. Mathematical reasoning provides particularly clean examples of this. A model might correctly solve thousands of arithmetic and algebra problems drawn from the distribution of problems represented in its training data, but fail on a problem that involves a slight variation in structure or presentation that would not trouble a student who had genuinely understood the underlying mathematical principles.

The characteristic failure modes of large language models, including hallucination of confident-sounding false information, brittleness in the face of novel problem structures, inconsistency across rephrasings of the same question, and the inability to reliably track the implications of their own stated beliefs across a long conversation, all point toward the same underlying issue. These systems are extraordinarily good at producing text that is statistically consistent with the patterns present in their training data. They are not systems that have developed internal models of the world that they consult when generating responses. They are systems that have learned to produce outputs that look like the outputs a knowledgeable person would produce, because those are the outputs that minimize prediction error on the training corpus.

The distinction between producing outputs that look like understanding and having an internal model that constitutes understanding matters enormously for the question of artificial general intelligence. Artificial general intelligence, as the concept is typically used, refers to a system that can perform any intellectual task that a human being can perform, with comparable flexibility and generality. The key word is generality. Human general intelligence is not a collection of specialized skills that happen to cover a wide range of domains. It is a capacity for flexible reasoning, learning, and adaptation that can be applied to genuinely novel problems that have no precedent in prior experience. When humans encounter a radically new challenge, they can reason from first principles, draw analogies from distant domains, experiment and update their beliefs based on the results of experiments, and construct new conceptual frameworks to organize their understanding. This capacity for genuine novelty response is not something that emerges naturally from the statistical pattern-matching architecture of large language models.

The second major limitation of current large language model architecture for the purposes of general intelligence is the absence of grounded interaction with the world. Human intelligence is not developed or exercised in isolation from the physical and social environment. It develops through years of embodied experience, through the feedback loops of action and consequence, through emotional responses to real events, through the social dynamics of communication with other minds. The conceptual structures that human beings use to understand the world are grounded in this embodied experience in ways that make them robust and flexible. When a human being understands the concept of weight, that understanding is connected to the felt experience of lifting objects, of balance and imbalance, of the difference between effort and ease. It is not merely a statistical regularity in the co-occurrence of the word weight with other words in text.

Large language models learn from text, and text is a product of human experience rather than experience itself. The model learns the linguistic structures that humans use to describe and communicate about the world, but it does not learn from the world directly. This means that the model’s internal representations of concepts, whatever they are in mathematical terms, are derived from second-order descriptions of experience rather than from experience itself. Whether this distinction matters for the production of useful outputs is an empirical question that has a complicated answer. For many tasks, the distinction does not matter much, because the text descriptions of the relevant concepts are sufficiently rich and consistent to support accurate and useful responses. For tasks that require reasoning about physical causation, spatial relationships, the dynamics of complex systems, or the phenomenology of embodied experience, the lack of grounding becomes a meaningful limitation.

The third limitation is the absence of persistent memory and genuine learning from interaction. A large language model, as currently implemented, does not learn from its conversations. Each conversation begins with the model in the same state as every other conversation, with parameters fixed at the values produced by the training process. Information provided within a conversation is available within the context window of that conversation, but it does not persist to future conversations and does not update the model’s parameters. This is a fundamental difference from human intelligence, which learns continuously from every experience and maintains a persistent and evolving model of the world, of other people, and of itself. The ability to learn from interaction, to update beliefs in response to new evidence, and to carry the products of learning forward into new situations is central to what makes general intelligence general.

Various approaches have been developed to address this limitation to some degree. Retrieval-augmented generation allows models to query external databases during inference, effectively extending their accessible knowledge base. Fine-tuning allows models to be retrained on specific datasets to improve their performance in particular domains. Agentic frameworks allow models to take actions in the world, observe the results, and incorporate those observations into their reasoning. These techniques meaningfully extend the capabilities of language model-based systems, but they do not change the fundamental architecture in ways that address the deep limitations. A model that retrieves information from a database and incorporates it into a response is still doing next-token prediction. It has not developed the capacity for genuine world modeling or continuous learning that would be required for general intelligence.

The fourth and perhaps most philosophically interesting limitation concerns the nature of what these models are doing when they appear to reason. The chain of thought prompting technique, in which models are encouraged to produce intermediate reasoning steps before arriving at a final answer, has been shown to improve performance on many tasks. This has led some researchers to describe models as engaging in reasoning and to interpret the intermediate steps as a genuine reasoning process. A more skeptical interpretation is that chain of thought prompting improves performance because the intermediate steps constrain the statistical prediction process in ways that make the final answer more likely to be correct, not because the model is genuinely reasoning in any sense that involves internal state transitions corresponding to logical inference. The question of whether there is a meaningful distinction between these two descriptions is deeply contested, but it matters for assessing the prospects for general intelligence because genuine flexible reasoning, as opposed to pattern-matched reasoning traces, may require architectural features that current transformers do not have.

None of this is to say that large language models are not remarkable or that the progress in AI capabilities is not genuinely significant. These systems have demonstrated capabilities that would have seemed extraordinary a decade ago, and they are already transforming the way that many knowledge work tasks are performed. The question is not whether they are impressive. The question is whether they are on a path to general intelligence or whether they represent a powerful but fundamentally limited approach to one subset of the cognitive capabilities that constitute general intelligence.

The honest answer, given what is currently known, is that they are probably the latter. The limitations described above are not engineering problems that can be solved by building larger models or collecting more training data. They reflect architectural choices about how these systems are built, and overcoming them will likely require not just scaling current approaches but developing genuinely new approaches to machine learning and artificial intelligence. What those approaches will look like is not clear. The history of AI research is littered with confident predictions about which direction would lead to general intelligence, and most of those predictions have been wrong in ways that reflect genuine uncertainty about what general intelligence actually requires.

What can be said with some confidence is that the systems we have today, whatever their impressive surface capabilities, are not the systems that will produce artificial general intelligence. They are language models. They are extraordinarily powerful language models, capable of producing outputs that in many contexts are indistinguishable from the outputs of intelligent human beings. But producing outputs that look like intelligence and having the underlying architecture that constitutes general intelligence are different things, and the distance between them may be larger than the current optimistic consensus in the technology industry suggests. The spinning of the platters, to borrow a metaphor, continues. But the road to general intelligence, if such a destination exists at all, likely requires a different kind of engine than the one currently installed.