How did a 7-person team beat Anthropic on Humanity's Last Exam?

Poetic built reasoning strategy harnesses on top of existing language models rather than training their own. These harnesses architect how models approach problems — not just what they're told. The optimization budget was under $100K versus hundreds of millions for foundation model training.

What is the difference between prompt optimization and reasoning strategies?

Prompt optimization — tuning what you tell the model — provides incremental gains. Reasoning strategies architect how the model approaches problems through code, not system prompts. On one task, prompts alone reached 5% accuracy while adding reasoning strategies jumped performance to 95%.

Why might AI-generated prompts outperform human-written ones?

Poetic's meta-system generates prompts that are readable but clearly non-human — including one that was factually wrong but still improved performance. The system outsources data curation to AI itself, inverting the old machine learning rule of carefully hand-curating datasets.

Technology

The Fastest Path to Super Intelligence

Y Combinator Startup Podcast

Hosted by Y Combinator

20 min episode

7 min read

5 key ideas

February 27, 2026

Listen to original episode

Seven people just outperformed Anthropic and Google on elite AI benchmarks — spending under $100K, without training a single model.

In Brief

Poetic, a 7-person startup, beat Anthropic's Claude on Humanity's Last Exam (55% vs 53.1%) for under $100K by building reasoning harnesses on top of existing models instead of training new ones. Their approach compounds as both base models and the meta-system improve, suggesting the fastest path to superintelligence runs through harness-level optimization.

Key Ideas

Seven people beat Anthropic for under $100K

7 people beat Anthropic on a benchmark for under $100K.

Reasoning strategies drove 5% to 95% accuracy

Reasoning strategies, not prompts, drove a task from 5% to 95% accuracy.

Fine-tuning is a trap that erases your investment

Fine-tuning is a trap: every new frontier model erases your investment.

Harness improves automatically with better models

Poetic's harness improves automatically when better base models release.

Harness-level self-improvement may be the fastest path

The fastest path to superintelligence may be harness-level recursive self-improvement, not bigger training runs.

Summary

Why does it matter? Because a 7-person startup just beat Anthropic on the hardest AI benchmark — for under $100K.

Poetic doesn't train models. It builds harnesses that sit on top of them — and those harnesses are already outperforming the world's best-funded labs on their own benchmarks. This episode is a wake-up call for any founder who's sunk money into fine-tuning or prompt optimization and thought that was enough.

Reasoning strategy optimization — not better prompts — moved one task from 5% to 95% accuracy
A 7-person team beat Anthropic's Claude Opus 4.6 on Humanity's Last Exam (55% vs 53.1%) for less than $100K
Fine-tuning is a trap: every new frontier model release erases your investment
Poetic's harness automatically gets stronger when better base models drop — no re-optimization required

Fine-tuning is a slow-motion wealth destruction machine — and most founders are still doing it

Every time a new frontier model drops, fine-tuned systems built on the previous generation get quietly wiped out. Ian Fischer puts it plainly: you fine-tuned on GPT-3.5, then GPT-4 came out and blew you out of the water. "Are you going to do that again or are you going to go out of business? In some cases, the latter."

Training LLMs from scratch costs hundreds of millions of dollars and takes months. Fine-tuning is cheaper but carries the same structural risk — your investment is denominated in the gap between your tuned model and the current frontier, and that gap resets to zero with every major release.

Poetic's answer is a harness: code, prompts, and reasoning strategies layered on top of one or more language models. When a new model drops, the same harness plugs straight in — and the performance bump gets bigger, not smaller. "With Poetic, what we end up giving you is a harness that sits on top of one or more language models. And it just performs better than them. And when the new model comes out, that same harness is perfectly compatible with it."

This isn't just a cost argument. It's a structural one. The bitter lesson in machine learning — that hand-crafted features always lose to scale — keeps repeating at every layer. Fine-tuning is the current version of that mistake.

Prompt optimization is table stakes. Reasoning strategies are where the 18x performance jump actually lives.

5% to 95%. That's the gap between what hard prompt engineering alone achieved and what happened when Poetic layered reasoning strategies on top — on the same model, for the same task, using Gemini 1.5 Flash.

"We manually optimized the prompts really hard for these very hard problems. And that got us a little bit of the way. In this particular case, the hardest task we were working on, we got to 5% performance. When we added on the reasoning strategies, we went from 5% to 95%."

Most teams chasing AI performance are running JEPA-style automated prompt optimization and calling it done. Fischer doesn't dismiss that — "that will get you some performance improvements" — but frames it as leaving most of the available gain untouched. The real lever is reasoning strategy: not what you tell the model, but how you architect its approach to the problem. That lives in code, not in a system prompt.

The Poetic meta-system figures this out automatically. It looks at failure modes in the data, generates robust reasoning strategies, and writes them into the harness. Founders don't curate the data. They don't hand-tune the logic. The system does it — and sometimes what it produces is visibly strange. One example in the ArcAGI run was factually wrong. They left it in. It still worked.

Seven people. $100K. Beat Anthropic on Humanity's Last Exam.

Humanity's Last Exam is 2,500 expert-written questions designed to stump frontier AI. Last week, Anthropic's Claude Opus 4.6 set the state-of-the-art at 53.1%. Poetic hit 55% — nearly two percentage points higher — with a team of seven researchers and an optimization budget under $100K.

The cost contrast is almost absurd. Foundation model training runs cost hundreds of millions. Poetic's optimization for ArcAGI v2 was achieved building on Gemini 3 Pro — a cheaper model — and still beat Gemini 3 DeepThink by 9 percentage points: 54% versus 45%, at $32 per problem versus $70-something.

"The optimization costs us less than 100K. Which is impressive because each of these big foundation model train runs are in the hundreds of millions of dollars."

This isn't a fluke. It's a repeatable architecture. The labs validate the approach by pursuing their own recursive self-improvement — but they do it at the most expensive possible layer, retraining full models for every improvement cycle. Poetic does it at the harness level, in days, for a fraction of the cost. Fischer's framing: "We don't view the frontier models as competitors. They're the ones we're using as stilts."

The moat compounds: every S-curve Poetic rides sits higher than the last

Here's the structural claim that separates Poetic from a clever prompt wrapper: the advantage should widen over time, not narrow.

Fischer describes two compounding improvement loops. The underlying models keep getting better — each new frontier release lifts the floor. Simultaneously, the Poetic meta-system itself keeps improving through recursive self-optimization. "Each model or each set of models that we're working with will have their own S-curve. The poetic meta system itself is also going to have its own S-curve. As both get better, you'll find that the S-curve keeps shifting higher and higher until ultimately either you saturate or reach AGI."

For any given customer task, this means the harness doesn't just hold its performance lead — it's designed to extend it. That's the opposite dynamic from fine-tuning, where the gap between your tuned system and raw frontier models shrinks with every release until it inverts.

The AI is writing prompts no human would write — and that's exactly why they work

When the Poetic meta-system generated prompts for ArcAGI, the outputs were readable but clearly non-human. Simple examples. Unexpected structures. One example that was factually wrong — and they left it in.

"You can read those and say, that's not what a human would have written pretty clearly."

This is the inversion of the old machine learning rule: know your dataset inside out, curate carefully, apply expert judgment. Poetic outsources that entirely. The meta-system decides what goes into context, what examples to generate, what reasoning strategies to encode. "Historically, in machine learning, you always had to know your dataset really well. But now we're kind of outsourcing that to the AI itself."

The practical lesson for teams building agents: resist cleaning up AI-generated systems to make them more legible. The weirdness isn't noise. On the evidence so far, it's load-bearing.

The race to superintelligence is already happening — just not where most people are looking

The labs are pursuing recursive self-improvement too. But they're doing it at the slowest, most expensive layer imaginable — full retraining for every iteration. Poetic runs the same loop at the harness level, in days, for under $100K. That speed advantage compounds. As the underlying models keep improving and the meta-system keeps optimizing, the ceiling keeps rising. The fastest path to superintelligence may not be a bigger training run — it may already be running on top of one.

Frequently Asked Questions

How did a 7-person team beat Anthropic on Humanity's Last Exam?: Poetic built reasoning strategy harnesses on top of existing language models rather than training their own. These harnesses architect how models approach problems — not just what they're told. The optimization budget was under $100K versus hundreds of millions for foundation model training.
Why is fine-tuning a bad long-term strategy?: Every time a new frontier model releases, fine-tuned systems built on the previous generation get wiped out. Your investment is denominated in the gap between your tuned model and the current frontier, and that gap resets to zero with each major release. Harnesses, by contrast, plug into new models and perform even better.
What is the difference between prompt optimization and reasoning strategies?: Prompt optimization — tuning what you tell the model — provides incremental gains. Reasoning strategies architect how the model approaches problems through code, not system prompts. On one task, prompts alone reached 5% accuracy while adding reasoning strategies jumped performance to 95%.
Why might AI-generated prompts outperform human-written ones?: Poetic's meta-system generates prompts that are readable but clearly non-human — including one that was factually wrong but still improved performance. The system outsources data curation to AI itself, inverting the old machine learning rule of carefully hand-curating datasets.