Some thoughts on autoregressive models

Most generative AI models nowadays are autoregressive. That means they’re following the concept of next token prediction, and the transformer architecture is the current implementation that has been used for years now thanks to its computational efficiency. This is a rather simple concept that’s easy to understand - as long as you aren’t interested in the details - everything can be tokenized and fed into an autoregressive (AR) model. And by everything, I mean everything: text as you’d expect, but also images, videos, 3D models and whatnot. There is no limit to what can be represented and generated by an autoregressive model, and while pre-training is far from solved, I think it’s fair to say everyone more or less knows what to do. That’s why today’s autoregressive models, “multimodal reasoning general” large language models (LLMs), are statistical models so powerful that we may see traits of generalization.

Update: since the article caught a bit of attention on HN, I modified some bits that weren’t clear or reflective of the message I was trying to convey. This might be yet another critique against LLMs, but this is precisely because I use LLMs all the time that I was able to write how I felt about them. That said, I’m very grateful for the criticism and skepticism. This is what this rant (of sorts) is all about: triggering a discussion that I find interesting!

The purpose of AI research

But what is the original purpose of AI research? I will speak for myself here, but I know many other AI researchers will say the same: the ultimate goal is to understand how humans think. I think one of the best (or the funniest) ways to understand how humans think is to try to recreate it. I come from a medical science background, where I studied the brain from a “traditional” neuroscience perspective (biology, pathology, anatomy, psychology and whatnot). That a good way to understand human-level intelligence is actually to try to recreate it is honestly how I feel whenever I read about AI advancements where the clear goal is to achieve/surpass human intelligence, something we don’t fully understand yet.

“What I cannot create, I do not understand.” - Richard Feynman

But today, when you see AI being mentioned, it’s mostly about autoregressive models like LLMs. Major players in the field think they may achieve artificial general intelligence (AGI) by continuing to scale the models and applying all sorts of tricks that happen to work (multimodality, pure reinforcement learning, test-time compute and search, agentic systems). It’s really too early to tell if there’s a ceiling to this approach and I’m not one to pretend to know the absolute truth.

However, I keep asking myself the following question:

Are autoregressive models the best way to approximate human thinking?

You can say LLMs are fundamentally dumb because of their inherent linearity. Are they? Isn’t language by itself linear (more precisely, the presentation of it)? Autoregressive models may as well be a simple yet effective approach after all, since they can be remarkably effective at modeling human language use. But there are many limitations in practice.

Clarification on terminology

In the field of statistics, an autoregressive model means that future outputs (or predictions) depend directly on all previous inputs (or tokens). Transformers also follow this principle, but unlike other traditional linear autoregressive models, they condition outputs on previous tokens using highly non-linear mechanisms (self-attention).

Ultimately, transformers remain next-token predictors; so when I mention “linearity”, I’m specifically referring to the sequential nature of next-token generation itself, rather than implying transformers lack non-linear capabilities altogether.

Limitations of AR models

By design, AR models lack planning and reasoning capabilities. If you generate one word at a time, you don’t really have a general idea of where you’re heading. You just hope that you will reach a nice conclusion by following a chain of thoughts. Large reasoning models work the same. They were trained using RL on many formal proofs, not too easy but not too hard. AR models being stochastic, they won’t always yield good results when formal logic is involved. They don’t really master abstract principles the way humans do.

Technically speaking, neural networks, as they are usually used, are function approximators, and Large Language Models (LLMs) are basically approximating the function of how humans use language. And they’re extremely good at that. But approximating a function is not the same thing as learning a function. – Gary Marcus (2025)

The current architecture of AR models lacks long-term memory, and has limited working memory. Everything has to be contained within a context window. Long-term memory may be vectorized information learned from previous interactions, but ultimately, it has to fit in the same context window. While LLMs with larger context windows are getting released, they still suffer from major coherence issues under heavy context workloads, mainly due to limitations in the attention mechanism itself. Transformers are computationally efficient during training, but their self-attention scales quadratically with input length during inference, which is also one of the practical limitations for having a “long-memory” model.

There is room for optimization here, but ultimately, LLMs have no recollection capabilities like humans have. Once trained, they will not learn from their mistakes. The context window can be compared to working memory in humans: it’s fast, efficient but gets rapidly overloaded. Humans manage this limitation by offloading previously learned information into other memory forms, whereas LLMs can only mimic this process superficially at best.

When explaining LLMs to people unfamiliar with the concept, I often use the ZIP file analogy to illustrate that these models aren’t exactly “smart knowledge databases”. Pre-training essentially compresses human knowledge—like the entire internet—in a very lossy way. Alternatively, if you had infinite time, it would be like a person reading an enormous library. While there are ways to mitigate this lossiness, the resulting AR model will always produce non-deterministic output due to its inherently stochastic nature.

So, AR models hallucinate. Humans hallucinate too. But I fear the word “hallucination” here is being misused as it gives AR models human traits they don’t really have. The nature of hallucination is very different between AR models and humans, as one has a world model and the other is a very elaborated pattern matching machine. While humans make (a lot of) mistakes, they do have this common-sense understanding that AR models lack. I would even say that I generally trust a SOTA LLM more than the average human, however, I may not as easily detect hallucination in LLMs, which can be problematic.

Of course, there are ways to limit risks of LLM hallucination. Retrieval-augmented generation (RAG) is a common one: we fit as much relevant data as possible in the LLM context window during inference and we hope it’s better at certain specific tasks. We can also tweak inference parameters, making token prediction a lot more rigid at the cost of creativity (temperature and others). Ultimately, stochastic models will always make plausible-sounding mistakes.

The exposure bias is also an inherent issue in an autoregressive paradigm. If they make a small mistake early on, this will eventually lead to more errors. The model can easily derail and produce irrelevant and repetitive output. Humans notice when they’re going in circles and have the ability to “course-correct”, something LLMs lack. We may see traces of this capability emerging in reasoning models, but this is still somewhat limited. Think of it like driving: a human driver will quickly notice (hopefully) when they take a wrong turn and rethink the route to correct this mistake. While AR models might sometimes give the illusion of self-awareness, they tend to never check the route again once they’ve started: they continue driving forward, they may turn randomly or take roads they’ve already taken, hoping to eventually arrive at the right destination (if you’ve seen Claude playing Pokémon, you’ll know what I mean).

Exploring other paradigms

Human thinking involves more than just linking words together, and I don’t think AGI can ever be achieved if an AI model doesn’t show solid planning and memory capabilities. That isn’t to say AR models should be ditched altogether, and they’re still very useful tools. They may even be used as part of more complex architectures that will tackle these limitations.

Yann LeCun is a famous AI researcher that has also been a vocal critic of AR models. He suggests pursuing research in other paradigms to achieve human-like cognition. He’s working on an architecture called JEPA (Joint Embedding Predictive Architecture) that generates stuff via iterative refinement instead of generating every detail step-by-step like traditional AR models. This means that JEPA’s goal is to learn the world by not focusing on the details to generate, but by focusing on a state, an aspect.

Rather than raw sequence prediction, the idea is to use a kind of self-supervised learning focused on abstract prediction. Which makes sense after all, because for instance, humans don’t really perceive the world pixel by pixel. A true mark of intelligence would be to focus on the core concept, on the essential information in an abstract way, and that might be how we can achieve goal-driven AI models.

Having studied diffusion models a fair bit, I’ve also wondered how they can be used for text generation. Diffusion models are very different compared to AR models: they’re also inherently stochastic, but they don’t generate in a defined direction (like left-to-right text generation). Instead, they start from noise, and the model knows how to iteratively denoise at every step to achieve a result that makes sense, in other words, aligned with the training data distribution. You could say they work as the inverse of transformer-based models: their internal inference process is iterative, but they perform parallel prediction at every step. Unlike AR models which have exposure bias, this shows global coherence: if something seemingly doesn’t fit, it can be corrected later on, because the model has a global idea that undergoes a refinement process.

This looks a lot more like the process of human-like drafting, because we don’t necessarily think with words first. An example of a large diffusion text model might be the very recent LLaDA model; it’s very interesting if you can take a look.

We’re more than just a prediction machine

Modern neuroscience states that the brain is a prediction machine. And I feel this makes sense: we predict constantly. When we have an idea of doing something, before acting, we may evaluate the outcome first, thus essentially predicting. I think this is not something only humans have, but animals in a broader sense. Language processing is no exception, and we know from imagery research that the brain actively anticipates upcoming patterns or words, much like an AR model does. If I write something like:

The cat is chasing a ____

It is evident you will strongly predict that the word is “mouse” before you even begin reading this sentence. Training AR models is essentially that: we cut the last chunk of text, and we train using backpropagation. So they get very good at that much like humans. My point is, the brain is also doing next-word prediction (although, LLMs don’t really do next-word prediction, because tokens may be just chunks of text, not necessarily words that make sense).

Language, while presented linearly, has an inherently hierarchical structure organized into nested layers of meaning, intention, and context. Can patterns emerging from pre-training alone sufficiently capture this hierarchical mechanism in LLMs?

Human thought, however, is a more complicated story. We do have inner speech, and we use language internally, thus something AR models can achieve too. Well, that’s not necessarily the same as the language we use. But beyond inner speech, there is also non-sequential thought and planning, and we can’t really represent them using simple Markov chains. Before speaking a sentence, we have a general idea of what we’re going to say; we don’t really choose what to say next based on the last words. That kind of planning isn’t something that can be represented sequentially.

The human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question. On the contrary, the human mind is a surprisingly efficient and even elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations. – Noam Chomsky

So, while the brain is a prediction machine, there is strong evidence that not all thinking is linguistic or sequential. Not everything we think or represent has to follow an inner narrative. That “gut feeling” we sometimes have is an example that we don’t even fully comprehend on a scientific level, let alone AR models. An idea is often first represented, then gets linearized for communication or refinement. Current large reasoning models still lack that kind of non-sequential planning, and I’d argue post-training alone won’t change their nature (but can yield great results in some specific tasks).

The human approach to semi-complex tasks, like trivial algebra operations, can show autoregressive characteristics, but this highlights a key limitation for LLMs, as their lack of an internal model beyond pattern matching makes them ineffective for such tasks. The open question remains whether post-training aimed at reasoning truly adds more than just a higher likelihood of correctness. Anecdotally, I believe this is precisely why reasoning-enhanced LLMs are powerful: reasoning abilities scale efficiently to handle tasks that can quickly overwhelm humans.

Language and thought are not purely autoregressive in humans, and prediction can only go so far. That is exactly why AI research is headed towards incorporating planning, memory and world models in new architectures, and they will hopefully capture non-autoregressive aspects of thinking.

The purpose of AI research#

Clarification on terminology#

Limitations of AR models#

Exploring other paradigms#

We’re more than just a prediction machine#

The purpose of AI research

Clarification on terminology

Limitations of AR models

Exploring other paradigms

We’re more than just a prediction machine