Y Combinator Startup Podcast cover
Technology & the Future

How François Chollet Is Building A New Path To AGI

Y Combinator Startup Podcast

Hosted by Unknown

57 min episode
8 min read
5 key ideas
Listen to original episode

The creator of ARC-AGI believes AGI could have been built on 1980s hardware — and will fit in under 10,000 lines of code.

In Brief

The creator of ARC-AGI believes AGI could have been built on 1980s hardware — and will fit in under 10,000 lines of code.

Key Ideas

1.

Reasoning beats raw scaling

Scaling LLMs 50,000x moved ARC-AGI scores almost zero — reasoning was the real leap.

2.

Verifiable rewards unlock full automation

Any domain with verifiable rewards can be fully automated today; unverifiable domains will stall.

3.

AGI forecasted around 2030

AGI by 2030: Chollet predicts it arrives around ARC-AGI v6 or v7.

4.

Conceptual breakthrough beats computation

The AGI codebase will likely be under 10,000 lines — it's a conceptual, not compute, problem.

5.

Domain expertise amplified by AI

Expertise compounds with AI tools — the more domain knowledge you have, the more AI empowers you.

Why does it matter? Because the creator of the most important AI benchmark thinks the entire industry is building on the wrong foundation.

François Chollet built ARC-AGI to expose what LLMs can't do — and then watched $100B flow into the very approach he thinks is a dead end. His new lab, NDEA, is a deliberate bet against the consensus, and this conversation is the clearest articulation yet of why the scaling era may be approaching its conceptual ceiling.

  • AGI will retrospectively fit in under 10,000 lines of code — and could have run on 1980s hardware
  • Any domain with verifiable rewards can already be fully automated with current LLMs; unverifiable domains will stall
  • Scaling LLMs 50,000x moved ARC-AGI scores almost zero — reasoning models were the actual leap
  • Chollet puts AGI around 2030, roughly ARC v6 or v7, but thinks economic automation arrives domain-by-domain well before that

Symbolic program synthesis isn't a tweak to deep learning — it's a replacement for the entire stack

NDEA isn't building a better coding agent. Chollet is working two levels below that, replacing the fundamental learning substrate. The idea: instead of fitting a parametric curve via gradient descent, find the shortest possible symbolic model that explains the data.

"We are replacing the parametric curve with a symbolic model that is meant to be as small as possible," he says. "It's like the simplest possible model to explain the data."

The payoff, if it works, is dramatic: models that need far less data, run far more efficiently at inference, and generalize far better because they're genuinely compressed representations rather than interpolated patterns. This connects to the minimum description length principle — the model most likely to generalize is the shortest one. Chollet argues gradient descent structurally cannot find such models.

His honest assessment of the odds: 10 to 15% chance of success. But he frames this as precisely the reason to try — if a high-impact idea has low probability and no one else is pursuing it, walking away guarantees failure. "If you don't do it, no one else will do it."

The implication for everyone watching the AI landscape: the assumption that future AI will be an LLM with more layers on top may be wrong in a foundational sense, not just an engineering sense.

Verifiable rewards are the master key — any domain that has them can be fully automated right now

One number explains the coding agent explosion: zero. That's how many additional reasoning capabilities the models needed — what changed was the training signal.

"Any problem where the solutions you propose can be formally verified and you can actually trust the reward signal," Chollet says, "any domain like this can be fully automated with current technology."

Code was first because unit tests and compilers give you ground truth for free. The model stopped relying on expensive human annotations and started running millions of trial-and-error loops, building an actual execution model in the process — tracking variable states, simulating runtime behavior. Mathematics is next for the same structural reason.

The flip side is just as important. For essay writing, creative work, or any fuzzy domain, progress will be "very, very slow. Maybe it's even going to stall." Human annotation is costly and sparse; there's no loop to run.

This gives founders and researchers a concrete diagnostic: map your problem onto the verifiability spectrum before deciding how much to bet on AI automation. If you can engineer a formal verification signal, you can access the full force of the current paradigm. If you can't, you're paying for incremental gains on expensive human labels.

Scaling LLMs 50,000x moved ARC-AGI almost zero — the real breakthrough was reasoning, not compute

From GPT-3 to the largest base models available in early 2025, ARC-AGI v1 scores stayed below 10%. Fifty-thousand-fold scale-up. Near-zero movement on the benchmark.

"It was really telling you that more scale, scaling up pre-training alone, was not going to crack the benchmark."

The step function came with OpenAI's o1 and then o3 — the first reasoning models. OpenAI used ARC performance specifically to demonstrate that something genuinely new had emerged, because it was the one unsaturated reasoning benchmark that couldn't be gamed by raw memorization.

Then came v2 saturation, but through a different mechanism: not higher fluid intelligence, but industrial-scale targeted training. Labs generated ARC-like tasks, solved them via program induction, verified answers formally, fine-tuned on successful reasoning chains, and repeated millions of times. Confluence Labs hit 97% on v2 during a YC batch — in a couple of months.

Chollet's read: the models aren't smarter. They're better trained. "The models don't have higher fluid intelligence per se. They don't have a higher IQ, so to speak. It's just that they're way better trained." Track reasoning benchmarks, not parameter counts — they're the leading indicator of genuine capability shifts.

ARC-AGI v3 measures something frontier models are genuinely bad at — and it's designed to stay that way

V3 drops an agent into an unknown mini video game with no instructions, no stated goal, and no hint about what the controls do. It must explore, infer goals, build a world model, plan, and solve — all scored against human efficiency baselines.

"Frontier models today, they are not very good at it."

The benchmark is deliberately resistant to the v2 saturation strategy. The private test set uses substantially different games from the public set, and the public set is intentionally easier — so performance on what's visible tells you almost nothing about how a system will score where it counts.

Chollet built an entire video game studio to create the 250+ environments: game developers, a custom engine, design pipelines, human testing. Every game is built on core knowledge priors — basic physics, object permanence, agent intentionality — with no cultural symbols, no arrows, no learned shortcuts. Regular humans with zero prior training can solve them within a few hundred to two thousand actions. Current frontier models cannot match that efficiency even when they find solutions.

V3 is the cleanest signal yet of whether any lab has made genuine progress on fluid intelligence rather than domain mastery. Watch it.

AGI the economic phenomenon arrives before AGI the intelligence milestone — and we may already be crossing it in code

Chollet draws a hard line between two definitions that most people blur. The industry definition — automating most economically valuable work — is about automation. His definition is about learning efficiency: can a system approach any new domain and become competent with the same data and compute a human would need?

He expects both, on separate timelines. "Absolutely. I think that's a trajectory that we're on right now" when asked whether economic AGI arrives first.

Code is already crossing that threshold. Other verifiable domains will follow in waves, each arriving when someone builds the right training environment — not when models get smarter. Chollet puts his best guess at AGI by his definition around 2030, roughly when ARC v6 or v7 would be releasing.

The gap between the two timelines is the policy and business challenge no one has quite named: society will face AGI-scale economic disruption before we've built systems with human-level adaptability. The disruption won't announce itself as a single moment — it will look like one industry after another going quiet.

AGI is an ideas problem, not a compute problem — and it might fit on a floppy disk

"I do believe that when you create AGI, retrospectively, it will turn out that it's a codebase that's less than 10,000 lines of code. And that if you had known about it, back in the 1980s, you could have done AGI back then using the computer resources available back then."

This is the most radical claim in the conversation, and Chollet says it without hedging. The AGI bottleneck isn't GPUs or training data — it's a missing conceptual framework. Science, he argues, is fundamentally symbolic compression: take a mess of observations, find the shortest equation that explains them. That's what NDEA is trying to automate.

The knowledge base that AGI draws on will be large. But the intelligence engine itself — the fluid reasoning core — will be tiny. The distinction matters because it means a small, well-funded team with the right idea can still win the race. The labs with the most compute have a structural advantage in the current paradigm, but if Chollet is right about the paradigm being wrong, that advantage disappears.

Expertise compounds with AI — the more you know, the more leverage you get

The race to AGI will be decided by whoever finds the right conceptual frame, not whoever builds the biggest cluster. And in the meantime, the people best positioned to benefit from AI progress are those with deep domain expertise — not those waiting for AI to replace the need for it. Chollet's closing point to anyone early in their career: the question isn't whether AI progress stops. It won't. The only question is whether you ride the wave or get washed over by it. Domain knowledge is the surfboard.


Topics: AGI, program synthesis, ARC-AGI benchmark, deep learning, symbolic AI, machine learning, coding agents, verifiable rewards, NDEA, frontier AI research, intelligence measurement, open source

Frequently Asked Questions

What is François Chollet's new path to AGI about?
François Chollet presents a framework suggesting AGI could have been built on 1980s hardware and will likely consist of under 10,000 lines of code. His argument challenges the prevailing assumption that AGI requires massive computational scale. Instead, Chollet emphasizes that reasoning represents the fundamental breakthrough, not raw computing power. The work introduces ARC-AGI, a framework for measuring general intelligence through verifiable reasoning tasks. This conceptual-first approach suggests the path to artificial general intelligence is primarily an intellectual achievement rather than an engineering problem of increasing scale.
Why does scaling large language models not lead to AGI according to Chollet?
According to Chollet's findings, scaling large language models 50,000 times moved ARC-AGI scores almost zero, proving that computational scale alone cannot achieve artificial general intelligence. This demonstrates that raw processing power and model size are insufficient for genuine reasoning capabilities. Instead, Chollet argues that reasoning represents the actual technological leap required for AGI. The implication is that future progress depends on developing fundamentally new algorithms and approaches rather than simply investing in bigger models and more compute. This insight redirects the AGI research community away from pure scaling toward architectural and conceptual innovations.
When will AGI arrive according to François Chollet?
François Chollet predicts that artificial general intelligence will arrive around 2030, specifically coinciding with the development of ARC-AGI version 6 or 7. This timeline reflects his belief that solving the reasoning problem—the core intellectual challenge—is the bottleneck, not computational resources. His prediction suggests we are closer to AGI than many in the industry estimate, provided the focus shifts from scaling to fundamental conceptual breakthroughs. This optimistic but grounded timeline depends on continued progress in understanding and implementing general reasoning capabilities, which Chollet views as achievable within the next few years.
How does expertise interact with AI tools according to this work?
Chollet argues that expertise compounds with AI tools, meaning those with deep domain knowledge benefit most from AI capabilities. As someone gains more specialized knowledge in a field, artificial intelligence becomes increasingly powerful in their hands. This insight suggests that AI amplifies existing expertise rather than replacing it. Professionals with years of experience can leverage AI tools far more effectively than novices, creating a dynamic where knowledge accumulation becomes even more valuable in an AI-augmented world. This has significant implications for career development and skill acquisition strategies in the coming decades.

Read the full summary of How François Chollet Is Building A New Path To AGI on InShort