Most generative AI models nowadays are autoregressive. That means they’re following the concept of next token prediction, and the transformer architecture is the current implementation that has been used for years now thanks to its computational efficiency. This is a rather simple concept that’s easy to understand - as long as you aren’t interested in the details - everything can be tokenized and fed into an autoregressive model. And by everything, I mean everything: text as you’d expect, but also images, videos, 3D models and whatnot. There is no limit to what can be represented and generated by an autoregressive model, and while pre-training is far from solved, I think it’s fair to say everyone more or less knows what to do. That’s why today’s autoregressive models, “multimodal reasoning general” LLMs, are statistical models so powerful that we may see traits of generalization.
The purpose of AI research
But what is the original purpose of AI research? I will speak for myself here, but I know many other AI researchers will say the same: the ultimate goal is to understand how humans think. And we think the best (or the funniest) way to understand how humans think is to try to recreate it. But today, when you see AI being mentioned, it’s mostly about autoregressive models like LLMs. Huge players in the field think they may achieve “AGI” (artificial general intelligence) by continuing to scale the models and applying all sorts of tricks that happen to work (multimodality, pure reinforcement learning, test-time compute and search, agentic systems). It’s really too early to tell if there’s a ceiling to this approach and I’m not one to pretend to know the absolute truth.
However, I keep asking myself the following question:
Are autoregressive models the best way to approximate human thinking?
You can say LLMs are fundamentally dumb because of their inherent linearity. Are they? Isn’t language by itself linear? Autoregressive models may as well be a simple yet effective approach after all, since they can be remarkably effective at modeling human language use. But there are many limitations in practice.
Limitations of AR Models
By design, AR models lack planning and reasoning capabilities. If you generate one word at a time, you don’t really have a general idea of where you’re heading. You just hope that you will reach a nice conclusion by following a chain of thoughts. Large reasoning models work the same. They were trained using RL on many formal proofs, not too easy but not too hard. AR models being stochastic, they won’t always yield good results when formal logic is involved. They don’t really master abstract principles the way humans do.
Technically speaking, neural networks, as they are usually used, are function approximators, and Large Language Models (LLMs) are basically approximating the function of how humans use language. And they’re extremely good at that. But approximating a function is not the same thing as learning a function. – Gary Marcus (2025)
The current architecture of AR models lacks long-term memory, and has limited working memory. Everything has to be contained within a context window. Long-term memory may be vectorized information learned from previous interactions, but ultimately, it has to fit in the same context window. While LLMs with larger context windows are getting released, they still suffer from huge coherence issues in heavy context workloads. There is room for optimization here, but ultimately, LLMs have no recollection capabilities like humans have. Once trained, they will not learn from their mistakes.
AR models hallucinate. Humans hallucinate too. But I fear the word “hallucination” here is being misused as it gives AR models human traits they don’t really have. The factuality between AR models and human hallucination is very different, as one has a world model and the other doesn’t. While humans make (a lot of) mistakes, they do have this common-sense understanding that AR models lack. I would even say that I generally trust a SOTA LLM more than the average human, however, I may not as easily detect hallucination in LLMs, which can be problematic.
Of course, there are ways to limit risks of LLM hallucination. Retrieval-augmented generation (RAG) is a common one: we fit as much relevant data as possible in the LLM context window during inference and we hope it’s better at certain specific tasks. We can also tweak inference parameters, making token prediction a lot more rigid at the cost of creativity (temperature and others). Ultimately, stochastic models will always make plausible-sounding mistakes.
The exposure bias is also an inherent issue in an autoregressive paradigm. If they make a small mistake early on, this will eventually lead to more errors. The model can easily derail and produce irrelevant and repetitive output. Humans notice when they’re going in circles and have the ability to “course-correct”, something LLMs lack. We may see traces of this capability emerging in reasoning models, but this is still somewhat limited.
Exploring other paradigms
Human thinking involves more than just linking words together, and I don’t think AGI can ever be achieved if an AI model doesn’t show solid planning and memory capabilities. That isn’t to say AR models should be ditched altogether, and they’re still very useful tools. They may even be used as part of more complex architectures that will tackle these limitations.
Yann LeCun is a famous AI researcher that has also been a vocal critic of AR models. He suggests pursuing research in other paradigms to achieve human-like cognition. He’s working on an architecture called JEPA (Joint Embedding Predictive Architecture) that generates stuff via iterative refinement instead of generating every detail step-by-step like traditional AR models. This means that JEPA’s goal is to learn the world by not focusing on the details to generate, but by focusing on a state, an aspect.
Rather than raw sequence prediction, the idea is to use a kind of self-supervised learning focused on abstract prediction. Which makes sense after all, because for instance, humans don’t really perceive the world pixel by pixel. A true mark of intelligence would be to focus on the core concept, on the essential information in an abstract way, and that might be how we can achieve goal-driven AI models.
Having studied diffusion models a fair bit, I’ve also wondered how they can be used for text generation. Diffusion models are very different compared to AR models: they’re also inherently stochastic, but they don’t generate in a defined direction (like left-to-right text generation). Instead, they start from noise, and the model knows how to iteratively denoise at every step to achieve a result that makes sense, in other words, aligned with the training data distribution. Unlike AR models which have exposure bias, this shows global coherence: if something seemingly doesn’t fit, it can be corrected later on, because the model has a global idea that undergoes a refinement process.
This looks a lot more like the process of human-like drafting, because we don’t necessarily think with words first. An example of a large diffusion text model might be the very recent LLaDA model; it’s very interesting if you can take a look.
We’re more than just a prediction machine
Modern neuroscience states that the brain is a prediction machine. And I feel this makes sense: we predict constantly. When we have an idea of doing something, before acting, we may evaluate the outcome first, thus essentially predicting. I think this is not something only humans have, but animals in a broader sense. Language processing is no exception, and we know from imagery research that the brain actively anticipates upcoming patterns or words, much like an AR model does. If I write something like:
The cat is chasing a ____
It is evident you will strongly predict that the word is “mouse” before you even begin reading this sentence. Training AR models is essentially that: we cut the last chunk of text, and we train using backpropagation. So they get very good at that much like humans. My point is, the brain is also doing next-word prediction (although, LLMs don’t really do next-word prediction, because tokens may be just chunks of text, not necessarily words that make sense).
Human thought, however, is a more complicated story. We do have inner speech, and we use language internally, thus something AR models can achieve too. Well, that’s not necessarily the same as the language we use. But beyond inner speech, there is also non-sequential thought and planning, and we can’t really represent them using simple Markov chains. Before speaking a sentence, we have a general idea of what we’re going to say; we don’t really choose what to say next based on the last word. That kind of planning isn’t something that can be represented sequentially.
The human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question. On the contrary, the human mind is a surprisingly efficient and even elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations. – Noam Chomsky
So, while the brain is a prediction machine, there is strong evidence that not all thinking is linguistic or sequential. Not everything we think or represent has to follow an inner narrative. That “gut feeling” we sometimes have is an example that we don’t even fully comprehend on a scientific level, let alone AR models. An idea is often first represented, then gets linearized for communication or refinement. Large reasoning models still lack that kind of non-sequential planning.
Language and thought are not purely autoregressive in humans, and prediction can only go so far. That is exactly why AI research is headed towards incorporating planning, memory and world models in new architectures, and they will hopefully capture non-autoregressive aspects of thinking.