Sakana Fugu Ultra · 42 builds · benchmarked

Does Sakana Fugu Ultra reach Fable 5 level? I ran 42 builds to find out.

Same 42 prompts. Three models — Fugu Ultra, Fusion, and Opus 4.8 — side by side on Goldie Bench. Real scores, real screenshots, and an honest answer.

Latest test — June 2026 · scored live on goldiebench.com

Sakana Fugu Ultra benchmarked side by side against Fusion and Opus 4.8 across 42 builds

42builds per model

7.94Fugu Ultra avg /10

3models head-to-head

100%smoke-tested

Straight from the source

Read it — and run it — yourself.

Every number in this guide is real. The build scores come from my own bench. The Fable 5 comparison numbers are Sakana's own published benchmarks plus the press write-ups. Here's where to read all of it:

Official sources + the live bench ↓

the live benchGoldie Bench · Fugu Ultra → the announcementSakana · Fugu launch → the technical reportFugu paper (arXiv) → press · the verdictVentureBeat write-up → the full leaderboardgoldiebench.com → try it · APISakana console →

"No Claude Fable 5? No problem: Sakana achieves frontier performance with a new Fugu multi-model auto-synthesis system." — VentureBeat headline, June 2026

The short answer first

Where Fugu Ultra actually landed.

I'll give you the headline before the proof. On my 42-build creative bench, Fugu Ultra slots right between the two reference models — it clearly beats solo Opus 4.8, and it sits a notch under Fusion, which runs a Fable-5-class model inside its panel.

Goldie Bench — average score across 42 builds

0–10 scale · my Claude-judge scoring · every build smoke-tested in headless Chrome · axis floor 0

Fusion

8.60

Fugu Ultra

7.94

Opus 4.8

7.49

How every model gets scored — the same way

one fixed prompt set · three models · one scoring rule · every build smoke-tested

the one-line verdict

Frontier-tier — but not full Fable 5 parity on creative builds.

Fugu Ultra is the real deal. It beat Opus 4.8 — a frontier model — on 25 of 42 builds head-to-head. On Sakana's own reasoning benchmarks it's neck-and-neck with Fable 5 (it even wins a couple). But on my creative one-shot bench it couldn't catch Fusion, and roughly a third of its builds need a human to confirm they play. Close to the door of Fable 5 level. Not all the way through it.

My story · why I built this test

I was tired of guessing which model is best.

Before

Every week a new "frontier" model drops.

The launch post is always a wall of benchmark bars where the new model wins everything.

I'd read the marketing, watch a cherry-picked demo, and still have no idea if it was actually good for the work I do.

So I'd waste a week wiring it into my stack — only to find it fell apart on the real builds.

So I stopped trusting launch charts and built my own bench.

After

Now every model runs the same 42 prompts — games, sims, 3D worlds, landing pages.

I score each build the same way and smoke-test every single one in a real browser.

When Sakana Fugu dropped, I didn't guess. I ran it through all 42 and watched it land.

I can tell you exactly where it's strong, exactly where it breaks, and exactly how close it gets to Fable 5 — with the receipts.

You can do the same. Stop reading launch charts. Watch the thing build.

The receipts

Why I'm the one running this test.

I run an AI-first SEO agency and teach thousands of operators how to wire AI into real businesses. The bench isn't a hobby — it's how I decide what goes into the stack my members and I actually use.

3,600+ founders inside AIPB
400k YouTube subscribers
38 countries · live members
163k X / Twitter followers
8 models on the bench

I'm not going to paste invented quotes here. The wins are real and written by the members themselves — agency owners, ecom founders, course creators, solo operators across 38 countries. Read them in their own words.

Read the 158-page wins doc →

Before you scroll on —

Commit to testing, not trusting.

You've seen the headline number. Now you're about to see exactly how I got it.

Here's the deal I want you to make with yourself.

The next time a model drops and the launch post says it beats everything — you don't take it at face value. You run it on one real task you actually care about, and you watch it build, before you wire it into anything.

That one habit will save you more wasted weeks than any benchmark chart ever will.

The people who test get the truth. The people who trust the marketing get burned and don't know why.

Be one of the people who tests.

Commit to running it yourself today. That's the whole game.

The benchmark · build by build

42 builds. Three models. Who won what.

Averages hide the story. The real picture is in the head-to-heads — the same prompt, scored on the same scale, model versus model. Here's how Fugu Ultra did against each one.

Fugu Ultra vs Opus 4.8 · 42 builds

25 wins

Fugu wins tie Opus wins

Fugu Ultra clearly beats solo Opus 4.8 — more than 2-to-1 on wins. This is a genuinely frontier-class model.

Fugu Ultra vs Fusion · 42 builds

12 ties

26 wins

Fugu wins tie Fusion wins

Fusion takes it — but 12 ties means Fugu matched it dead-on a quarter of the time. The gap is real, not a blowout.

Thinking it?

"A model that loses to Fusion can't be that good."

Fusion is the strongest thing on my whole bench — it's a panel that ensembles multiple frontier models including a Fable-5-class one. Losing to that while beating solo Opus 4.8 puts Fugu Ultra exactly where a single frontier model should sit. It's a compliment, not a knock.

Where Fugu Ultra won outright — beating both Opus and Fusion, or tying for the top score: the aurora over a mountain ridge (8.0), the boids flocking sim (8.5), and the synthwave terrain flythrough (8.5). It tied for #1 on the solar system, the voxel runner, the playable game, fireworks, landing page, neon city, and more — all at 9.0.

See it yourself · same prompt, three models

The same build — Fugu vs Fusion vs Opus. Click any one.

This is the part launch charts can't give you. Every cell below is a real, live, playable build from the exact same prompt. Click to open the one you want and judge it with your own eyes.

The solar system

"drag to orbit, scroll to zoom, real orbital motion"

Fugu Ultra · 55KB9.0 — densest on the bench

Fusion9.0

Opus 4.88.5

The voxel runner

"Temple-Run style, jump + slide, increasing speed"

Fugu Ultra9.0 — most reactive build

Fusion9.0

Opus 4.88.5

A juicy playable game

"make it feel SHIPPABLE — score, lives, sound, polish"

Fugu Ultra9.0 — 24% reactivity

Fusion9.0

Opus 4.87.0

Interactive fireworks

"click to launch, particle trails, night sky"

Fugu Ultra9.0 — 26% reactivity

Fusion9.0

Opus 4.87.0

Nordic dungeon crawler

"torch-lit ruin, chasing enemies, boss room, bloom"

Fugu Ultra · 61KB9.0 — 23% reactivity

Fusion9.5 — top score

Opus 4.86.0

Open-world RPG

"village, NPCs, weather, day/night, inventory"

Fugu Ultra · 62KB9.0 — densest Ultra build

Fusion9.5 — top score

Opus 4.87.5

See all 42 Fugu Ultra builds on Goldie Bench →

The big question

So — does it reach Fable 5 level?

This is what everyone actually wants to know. Sakana built Fugu to go toe-to-toe with the frontier — and they put Claude Fable 5 right at the top of their own comparison charts. So let's look at the real reasoning benchmarks, not the vibes.

Fugu Ultra vs Fable 5 — published benchmarks

Sakana's own numbers + Anthropic's Fable 5 table · higher = better · axis floor 0

LiveCodeBench

93.2

↳ Fable 5

89.8

GPQA-Diamond

95.5

SWE-Bench Pro

73.7

↳ Fable 5

80.3

Terminal-Bench

82.1

↳ Fable 5

88.0

Read that honestly and it's a genuinely close fight:

▲ Fugu Ultra WINS on LiveCodeBench — 93.2 vs Fable 5's 89.8. On live, refreshed coding problems, it's actually ahead.
▲ It tops GPQA-Diamond at 95.5 — above Sakana's own Mythos Preview number, right at the saturated frontier.
▼ It trails on SWE-Bench Pro — 73.7 vs Fable 5's 80.3. On the hardest agentic-coding test, Fable 5 still leads (though Fugu beats Opus 4.8's 69.2 and GPT-5.5's 58.6).
▼ It trails on Terminal-Bench — 82.1 vs 88.0.

"Sakana said Fugu Ultra scored 73.7 on SWE-Bench Pro, ahead of Claude Opus 4.8 at 69.2 and GPT-5.5 at 58.6, while trailing the restricted Fable 5 score of 80.0." — reporting on Sakana's Fugu benchmarks, June 2026

The catch you need to know

"If it ties Fable 5, why isn't it just as good?"

Because Fugu isn't one model — it's an orchestration system that routes each request across a pool of models and synthesises the answer. And here's the kicker: Fable 5 itself isn't in that pool, because it's not publicly accessible. So Fugu reaches near-Fable-5 numbers without Fable 5 in the engine. That's the whole "no Fable 5? no problem" story — and it's genuinely impressive. But it also means you're betting on a black box: Sakana doesn't disclose which models answer your request.

the Fable 5 verdict

It reaches the door of Fable 5 — and on a couple of tests, walks through it.

On published reasoning benchmarks, Fugu Ultra is legitimately in Fable 5's weight class: it wins on LiveCodeBench and GPQA, trails on SWE-Bench Pro and Terminal-Bench. On my creative one-shot build bench, it beat frontier-solo Opus 4.8 comfortably but couldn't catch Fusion. So the honest answer isn't a clean yes or no — it's "frontier-tier, neck-and-neck on reasoning, a step behind on the hardest agentic coding and on creative polish." For most real work, that's close enough to matter. For the absolute top of the coding ladder, Fable 5 still edges it.

The honest part

What didn't work as well.

I'm not going to pretend all 42 builds were flawless. They weren't. Here's exactly where Fugu Ultra struggled — because the rough edges are as useful to know as the wins.

About a third needed a human to confirm they play.

I smoke-test every build in a real headless browser — click the screen, press the keys, check the pixels actually change. Roughly 14 of the 42 Fugu Ultra builds came back as "maybe." They're structurally complete — closed tags, a real game loop, input handlers, a fully rendered scene — but my automated test couldn't fully drive them. These were pointer-lock FPS games (doom, skyrim, crypt, dogfight) and click-drag games (pool, neonblaster, fluid) where a generic click-plus-WASD can't trigger the specific controls. I scored those 6.5–7.0 and flagged them for manual checking instead of claiming they work. Not broken — just unverified by a robot.

It's slow, and the two heaviest prompts timed out — four times each.

Fugu Ultra is a panel ensemble, so it's not fast — roughly 15 minutes per build. The two most complex prompts (a full Doom-style FPS and a pseudo-3D OutRun racer) blew past my 40-minute cap on four straight attempts before finally landing on a direct retry with a longer ceiling — 16 and 19 minutes respectively. Frontier quality, but you wait for it.

Thinking it?

"If it times out, the API must be broken."

It wasn't broken — it was busy. The first run used the wrong endpoint and a 16K token cap, and half my densest builds got cut off mid-file with no closing tags. The fix was switching to Sakana's documented Responses API with a 48K output budget. After that, every build came back complete. If you're calling Fugu yourself: use the Responses API and give it room. That one change fixed the truncation entirely.

The cheaper Mini variant couldn't finish the hardest prompts at all.

Sakana also ships Fugu Mini — a faster, single-model version. I ran it through the same 42. It only completed 37. On the five heaviest prompts (the dungeon crawler, billiards, a Minecraft-style sandbox, the open-world RPG, a bullet-hell shooter) it returned empty every time — it burns its whole token budget on reasoning before it writes a line. Ultra's panel finished all 42. That gap is the clearest argument for paying for Ultra over Mini: the panel completes the work the single model can't.

How to actually pick a model

The old way vs the new way.

Here's the real takeaway, bigger than Fugu. The way most people choose an AI model is broken. Here's the contrast.

The old way

~1 wasted week

Read the launch post where the new model wins every chart
Watch one cherry-picked demo video
Trust the vendor's hand-picked benchmark
Wire it into your stack on faith
Find out a week later it breaks on your real work
Result: lost time, no idea why it failed

The new way

~1 afternoon

Run the model on the same fixed prompts every model gets
Score every build the same way, on the same scale
Smoke-test each one in a real browser — does it actually run?
Compare it head-to-head against models you already trust
Know exactly where it's strong and where it breaks
Result: pick with proof, not marketing

That's all Goldie Bench is. A fixed set of prompts, one scoring rule, every build smoke-tested, every model side by side. It's why I can tell you Fugu Ultra is 7.94 and not "amazing" or "mid" — because I watched all 42 run.

Get the whole stack

Want the bench, the Agent OS, and the playbook?

Agent OS — Claude, OpenClaw and Hermes connected

Everything you just saw — the bench that scores models, the Agent OS that runs them, the workflow that builds and deploys — lives inside the AI Profit Boardroom. It connects Claude, OpenClaw, and Hermes into one dashboard with shared memory, so your agents know your business and your goals.

The full Agent OS zip file — every prompt, the memory setup, the dashboard
Four live coaching calls every week with operators running this in production
Daily tutorials as new models like Fugu drop — how to test and wire them in
A 30-day roadmap to set the whole system up step by step
A community of 3,600+ founders across 38 countries, active 24/7
A member map to connect with builders near you

Get the Agent OS →

Inside the AI Profit Boardroom · skool.com/ai-profit-lab

link in the description ↑

Three beliefs to drop

What's holding you back from testing.

Wrong: "The newest model is always the best one to switch to."

Right: Fugu Ultra is brand new and frontier-tier — and it still lost to Fusion and trailed Fable 5 on the hardest test. New doesn't mean best for your work. The only way to know is to run it.

Wrong: "Benchmark charts in launch posts tell me what I need."

Right: Launch charts are picked by the vendor to win. Half of Fugu's densest builds truncated until I fixed the API call — you'd never see that in a chart. Watching the thing build tells you what a bar never will.

Wrong: "I'll wait until the dust settles and the 'best' model is obvious."

Right: It never settles — there's a new frontier model every few weeks. The operators who win are the ones with a system to test each one fast and move on. The skill is the bench, not the model.

Don't take my word for it

158 pages of members who stopped chasing tools and built systems instead — real businesses, real wins, written in their own words.

Read the 158-page wins doc →

The recap

What you now know about Fugu Ultra.

You know where it lands

7.94 on my bench — above solo Opus 4.8 (7.49), below Fusion (8.60).

ii.

You know it beats Opus

25 wins to 11 against frontier-solo Opus 4.8. Genuinely frontier-class.

iii.

You know the Fable 5 answer

Wins LiveCodeBench + GPQA, trails on SWE-Bench Pro. Near-parity, not full parity.

iv.

You know where it breaks

~14 builds need manual checking; it's slow; Mini can't finish the hardest 5 prompts.

You know the API fix

Use the Responses API with a 48K output budget or your dense builds truncate.

vi.

You can check it yourself

All 42 builds are live and clickable on Goldie Bench. Don't trust me — play them.

One system for every model

Make every new model plug into your stack.

Fugu today, something else next month — the models keep changing. What doesn't change is the system around them. The Agent OS inside the AI Profit Boardroom turns Claude, OpenClaw, and Hermes into one dashboard with shared memory and goals, so every new model you test just slots in and gets more powerful automatically.

The full Agent OS zip — every prompt, the Obsidian memory setup, the dashboard
Weekly coaching calls where we wire up new tools together, step by step
Daily tutorials + a 30-day roadmap to build the whole thing
3,600+ founders across 38 countries, someone online 24/7
The 158-page wins doc — read what members actually built

Get the Agent OS →

Inside the AI Profit Boardroom · skool.com/ai-profit-lab

I'll see you in the next one ↓

Does Sakana Fugu Ultra reach Fable 5 level? I ran 42 builds to find out.

Read it — and run it — yourself.

Where Fugu Ultra actually landed.

Frontier-tier — but not full Fable 5 parity on creative builds.

I was tired of guessing which model is best.

Why I'm the one running this test.

Commit to testing, not trusting.

42 builds. Three models. Who won what.

The same build — Fugu vs Fusion vs Opus. Click any one.

More of what Fugu Ultra shipped.

So — does it reach Fable 5 level?

It reaches the door of Fable 5 — and on a couple of tests, walks through it.

What didn't work as well.

About a third needed a human to confirm they play.

It's slow, and the two heaviest prompts timed out — four times each.

The cheaper Mini variant couldn't finish the hardest prompts at all.

The old way vs the new way.

Want the bench, the Agent OS, and the playbook?

What's holding you back from testing.

What you now know about Fugu Ultra.

Make every new model plug into your stack.