
Beyond Bigger Models: Recursion As The Next Scaling Law In AI
Y Combinator Startup Podcast
Hosted by Unknown
A 7-million-parameter model with zero pre-training defeats trillion-parameter LLMs on hard reasoning — because scale was never the real bottleneck.
In Brief
A 7-million-parameter model with zero pre-training defeats trillion-parameter LLMs on hard reasoning — because scale was never the real bottleneck.
Key Ideas
Recursive Models Surpass Scale on Reasoning
A 7M-parameter recursive model beats trillion-parameter LLMs on hard reasoning benchmarks.
Chain-of-Thought Mimics Rather Than Reasons
Chain-of-thought is a hack bounded by human knowledge — not true reasoning.
Layer Count Proves Fundamental Reasoning Limit
LLMs have a provable, architectural ceiling on problems requiring more steps than they have layers.
Single-Step Backprop Surprisingly Matches Full Training
Truncated backprop at T=1 works almost as well as full backprop — and nobody knows why.
Merging LLMs With Recursive Reasoning Modules
The next breakthrough is merging LLM embeddings with recursive reasoning modules.
Why does it matter? Because scaling up isn't the same as reasoning — and a 7-million-parameter model just proved it.
A tiny recursive model trained on roughly a thousand puzzles with zero pre-training just outperformed trillion-parameter LLMs on hard reasoning benchmarks. This episode pulls apart why that's possible — and why it exposes a structural ceiling baked into every transformer, no matter how big it gets.
- LLMs have a provable, architectural limit: if a problem requires more computational steps than the model has layers, it literally cannot solve it.
- Chain-of-thought and tool use are workarounds bounded by existing human knowledge — they cannot discover genuinely novel algorithms.
- A 27M-parameter model scored 70% on ArcPrize 1 while OpenAI's O3 scored zero — then a 7M-parameter successor hit 87%.
- The training trick making all of this possible is counterintuitive: backpropping through just one recursive step works almost as well as full backprop through time.
LLMs hit a hard wall on reasoning — and more parameters won't fix it
Sorting a list of 31 elements is literally impossible for a 30-layer transformer in a single forward pass. That's not a bug or a training failure — it's mathematics. There's a proven lower bound of n log n steps for comparison-based sorting, and when you run out of transformer layers, you run out of chances to do comparisons. Full stop.
YC visiting partner Francois Chauvard frames it bluntly: if the list has 31 elements and your transformer has 30 layers, you're out of steps. The same logic applies to Sudoku, mazes, and any problem that's informationally incompressible — where you cannot shortcut the computation without external memory.
This isn't about data quality or model size. It's about what a feed-forward, one-shot architecture is structurally capable of doing. At training time, the transformer processes every token in parallel through a causal mask — the whole sequence in one shot. That's what makes it so fast to train. But that same design means there's no iterative refinement, no recursive scratch space, no tape to write to. You get one pass, and then you're done.
The implication is uncomfortable for anyone betting that the next order of magnitude in parameters fixes reasoning: some failures aren't scaling problems. They're architecture problems. And they require a different class of model entirely.
Chain-of-thought is a retrieval hack, not reasoning — and it breaks the moment you leave known territory
Give an LLM infinite examples of unsorted and sorted lists, train it with chain-of-thought supervision on every intermediate step, and it can learn to sort. Chauvard confirms he's run exactly this experiment. So the problem is solved, right?
Only if you already know the algorithm. "If you chain of thought it on all the bubble sort input and output, it will only do bubble sort." If merge sort was never in the training data, the model has no path to discover it — because chain-of-thought reasoning is happening in discrete token space. Every intermediate thought has to be snapped back to a token from the model's vocabulary. There's no continuous latent scratchpad. The model is recombining things it's already seen, not deriving new structure from first principles.
Tool use has the same ceiling: you can only call functions that already exist and are known. Both approaches, as Chauvard puts it, are "bounded by the bounds of human knowledge. In the event it's outside the set of human knowledge, then you're kind of SOL."
The contrast with recursive models is stark. RNNs and their successors keep their intermediate state in a continuous high-dimensional latent space — far more expressive than any discrete token vocabulary. The carry never has to be quantized back to a word. That's not a minor implementation detail; it's the thing that enables genuine algorithmic discovery from input-output pairs alone, with no labeled reasoning traces required.
A 27M-parameter model with no pre-training scored 70% on ArcPrize 1 — while O3 scored zero
The HRM paper landed like a hand grenade on benchmark assumptions. Trained exclusively on ArcPrize's roughly one thousand puzzles, starting from random weights, with no pre-training on any internet data, a 27-million-parameter hierarchical reasoning model hit 70% on ArcPrize 1. At the same time, OpenAI's O3 — a model trained on effectively the entire internet — scored zero on the same benchmark.
The architecture behind this draws from RNN lineage with a three-level recursion structure: a low-level network that recurses TL times, a high-level network that runs TH times over the low-level loop, and an outer refinement loop that runs the whole thing N times. The hidden states — ZL and ZH — are not reset between outer refinement steps. They carry forward, meaning each pass is operating from a different point in latent memory space, effectively constructing a mini-batch across memory states rather than across different inputs.
The critical training innovation: instead of backpropagating through all recursive steps (what Alex Graves did in every neural Turing machine and adaptive compute time paper), the team applies a stop-gradient after one step. Truncated backprop through time at T=1. Counterintuitively, this works. The model learns, the gradients don't explode, and the vanishing gradient catastrophe that killed RNNs for a decade is sidestepped. Nobody fully understands why — and Chauvard says so directly.
TRM shrinks the model to 7M parameters and improves ArcPrize performance to 87% — by going simpler and deeper
Researcher Alexia took the HRM results and did what good follow-on ML papers always do: deleted most of it and kept the magic.
The two key changes in TRM: collapse the separate low-level and high-level networks into a single shared-weight network, and extend backpropagation through one full recursive loop instead of just one step past the stop-gradient. The separate hidden states ZL and ZH are preserved — the insight is that you need distinct memory scopes, not distinct networks to produce them.
The result: a 7-million-parameter model, roughly four times smaller than HRM's 28 million, that scores 87% on ArcPrize 1. Going deeper on transformer layers inside the architecture didn't help — on some Sudoku tests, a plain feedforward net matched transformer performance. The recursion mechanism itself, not architectural complexity, is doing the work.
Alexia also confirms that the outer refinement loop is where most of the performance lives. Train on 16 refinement steps, test on just one, and you recover about seven-eighths of the performance. The recursion at training time matters more than recursion at test time — another counterintuitive result that the field hasn't fully explained yet.
Simplification plus more aggressive recursion beat the more complex architecture. The inductive bias is everything.
The model teaches itself to solve Sudoku — with no chain-of-thought supervision and no human-labeled solution traces
The training dynamics of TRM look nothing like standard supervised learning. There are no labeled reasoning steps, no teacher-forcing through intermediate solution states, no human-annotated chain-of-thought. The model sees only input-output pairs: incomplete Sudoku puzzle in, completed grid out.
What emerges from the EM-style optimization is that ZL functions like a local scratchpad — trying, updating, proposing partial computations — while ZH holds a candidate answer, one MLP lookup away from the final output. The training process rewards strategies stored in memory that tend to produce correct outputs. The algorithm is discovered, not taught.
"If we had Sudoku and we didn't know how to solve it," Chauvard says, "it would just have solved it." That's the point. For the Millennium Prize problems, for protein folding variants we don't have traces for, for any domain where human-generated solution paths don't exist — this class of model offers a path that chain-of-thought cannot. You only need the inputs and outputs. The model figures out what to store in between.
The real bet: LLM embeddings as the feature space, tiny recursive models as the reasoning engine
Neither architecture alone is the end state. LLMs are extraordinarily good at one thing: finding rich, semantically structured latent representations. Reasoning inside those representations is almost always routed back through token space — which is the bottleneck.
The architecture that excites Chauvard most is the one that doesn't exist yet: take the latent space that a giant LLM has already learned — where concepts are cleanly separated, where pixels have been compressed into meaningful structure — and deploy a small recursive model to reason inside it. Not through tokens. Through continuous latent space, with recursive depth, trained on that representation.
"A lot of the view of what these LLMs are doing is finding really amazing embedding representation spaces. But reasoning inside that space is actually not done all that much." A 7-million-parameter recursive module sitting inside a trillion-parameter embedding space might produce capabilities that neither has alone — without the memory explosion that made deep recursive training impossible for the past decade.
The ceiling on current AI is architectural — and the fix is already in the math
What this episode quietly establishes is that the next leap in AI capability probably won't come from another order of magnitude in parameters or another trillion tokens of training data. It'll come from changing what happens inside the forward pass — from one-shot feed-forward to iterative, recursive refinement in continuous space.
Truncated backprop at T=1 working almost as well as full backprop is the most underexplored result in this space right now. Nobody knows why it works. That unknown is probably where the next breakthrough lives.
Topics: AI, machine learning, recursive models, LLMs, transformers, RNNs, reasoning, ArcPrize, HRM, TRM, chain-of-thought, scaling laws, neural architecture, backpropagation, YCombinator
Frequently Asked Questions
- What is Beyond Bigger Models about?
- The work argues that scale was never the real bottleneck in AI reasoning capability. It demonstrates that a 7-million-parameter recursive model defeats trillion-parameter LLMs on hard reasoning benchmarks, challenging the assumption that bigger models are inherently better. The core insight is that "chain-of-thought is a hack bounded by human knowledge — not true reasoning." The research shows LLMs face architectural ceilings that cannot be overcome through scaling. The work proposes that the next AI breakthrough requires merging LLM embeddings with recursive reasoning modules instead of simply increasing parameter counts.
- How can a 7-million-parameter model beat trillion-parameter LLMs?
- Recursion enables true multi-step reasoning that transcends memorized patterns. A 7-million-parameter recursive model outperforms trillion-parameter LLMs on hard reasoning benchmarks because it performs genuine iterative computation rather than pattern matching. Unlike chain-of-thought prompting, which relies on patterns learned during training, recursive reasoning can solve problems requiring more computational steps than traditional architectures allow. The key advantage is architectural efficiency—the recursive model's design allows it to reason through complex problems step-by-step rather than generating answers based on learned associations. This proves that methodology and architecture matter more than raw parameter count.
- What architectural limitations do current LLMs face?
- LLMs have a provable architectural ceiling on problems requiring more steps than they have layers. This fundamental constraint means transformer models cannot solve certain reasoning tasks regardless of size because the number of available computation layers directly limits problem-solving depth. The work demonstrates that "chain-of-thought is a hack bounded by human knowledge — not true reasoning," so this bottleneck cannot be overcome through prompting techniques. Scaling parameters yields diminishing returns on hard reasoning benchmarks because depth—not width—is the limiting factor for complex reasoning tasks, making parameter growth an increasingly ineffective solution.
- Why does truncated backpropagation work almost as well as full backprop?
- The research reveals that "truncated backprop at T=1 works almost as well as full backprop — and nobody knows why." This counterintuitive finding suggests current understanding of backpropagation mechanisms is incomplete. The efficiency of truncated backprop despite limited gradient information implies neural networks may learn through mechanisms fundamentally different from traditional backpropagation theory. This mystery suggests either significant redundancy in standard training or unknown learning processes. Understanding this phenomenon could illuminate more efficient training methods and inform design of next-generation AI systems merging recursive reasoning with LLM embeddings.
Read the full summary of Beyond Bigger Models: Recursion As The Next Scaling Law In AI on InShort
