Sakana Fugu Ultra · 42 builds · benchmarked

Does Sakana Fugu Ultra reach Fable 5 level? I ran 42 builds to find out.

Same 42 prompts. Three models — Fugu Ultra, Fusion, and Opus 4.8 — side by side on Goldie Bench. Real scores, real screenshots, and an honest answer.

Latest test — June 2026 · scored live on goldiebench.com
Sakana Fugu Ultra benchmarked side by side against Fusion and Opus 4.8 across 42 builds
42builds per model
7.94Fugu Ultra avg /10
3models head-to-head
100%smoke-tested
Straight from the source

Read it — and run it — yourself.

Every number in this guide is real. The build scores come from my own bench. The Fable 5 comparison numbers are Sakana's own published benchmarks plus the press write-ups. Here's where to read all of it:

"No Claude Fable 5? No problem: Sakana achieves frontier performance with a new Fugu multi-model auto-synthesis system." — VentureBeat headline, June 2026
The short answer first

Where Fugu Ultra actually landed.

I'll give you the headline before the proof. On my 42-build creative bench, Fugu Ultra slots right between the two reference models — it clearly beats solo Opus 4.8, and it sits a notch under Fusion, which runs a Fable-5-class model inside its panel.

Goldie Bench — average score across 42 builds
0–10 scale · my Claude-judge scoring · every build smoke-tested in headless Chrome · axis floor 0
Fusion
8.60
Fugu Ultra
7.94
Opus 4.8
7.49
How every model gets scored — the same way
one fixed prompt set · three models · one scoring rule · every build smoke-tested
42 prompts same every model Fugu Ultra 7.94 Fusion 8.60 Opus 4.8 7.49 Bench score smoke-tested · 0–10
the one-line verdict

Frontier-tier — but not full Fable 5 parity on creative builds.

Fugu Ultra is the real deal. It beat Opus 4.8 — a frontier model — on 25 of 42 builds head-to-head. On Sakana's own reasoning benchmarks it's neck-and-neck with Fable 5 (it even wins a couple). But on my creative one-shot bench it couldn't catch Fusion, and roughly a third of its builds need a human to confirm they play. Close to the door of Fable 5 level. Not all the way through it.

My story · why I built this test

I was tired of guessing which model is best.

Before

Every week a new "frontier" model drops.

The launch post is always a wall of benchmark bars where the new model wins everything.

I'd read the marketing, watch a cherry-picked demo, and still have no idea if it was actually good for the work I do.

So I'd waste a week wiring it into my stack — only to find it fell apart on the real builds.

So I stopped trusting launch charts and built my own bench.

After

Now every model runs the same 42 prompts — games, sims, 3D worlds, landing pages.

I score each build the same way and smoke-test every single one in a real browser.

When Sakana Fugu dropped, I didn't guess. I ran it through all 42 and watched it land.

I can tell you exactly where it's strong, exactly where it breaks, and exactly how close it gets to Fable 5 — with the receipts.

You can do the same. Stop reading launch charts. Watch the thing build.

The receipts

Why I'm the one running this test.

I run an AI-first SEO agency and teach thousands of operators how to wire AI into real businesses. The bench isn't a hobby — it's how I decide what goes into the stack my members and I actually use.

3,600+ founders inside AIPB
400k YouTube subscribers
38 countries · live members
163k X / Twitter followers
8 models on the bench

I'm not going to paste invented quotes here. The wins are real and written by the members themselves — agency owners, ecom founders, course creators, solo operators across 38 countries. Read them in their own words.

Read the 158-page wins doc →
Before you scroll on —

Commit to testing, not trusting.

You've seen the headline number. Now you're about to see exactly how I got it.

Here's the deal I want you to make with yourself.

The next time a model drops and the launch post says it beats everything — you don't take it at face value. You run it on one real task you actually care about, and you watch it build, before you wire it into anything.

That one habit will save you more wasted weeks than any benchmark chart ever will.

The people who test get the truth. The people who trust the marketing get burned and don't know why.

Be one of the people who tests.

Commit to running it yourself today. That's the whole game.

The benchmark · build by build

42 builds. Three models. Who won what.

Averages hide the story. The real picture is in the head-to-heads — the same prompt, scored on the same scale, model versus model. Here's how Fugu Ultra did against each one.

Fugu Ultra vs Opus 4.8 · 42 builds
25 wins
6
11
Fugu wins tie Opus wins

Fugu Ultra clearly beats solo Opus 4.8 — more than 2-to-1 on wins. This is a genuinely frontier-class model.

Fugu Ultra vs Fusion · 42 builds
4
12 ties
26 wins
Fugu wins tie Fusion wins

Fusion takes it — but 12 ties means Fugu matched it dead-on a quarter of the time. The gap is real, not a blowout.

Thinking it?
"A model that loses to Fusion can't be that good."
Fusion is the strongest thing on my whole bench — it's a panel that ensembles multiple frontier models including a Fable-5-class one. Losing to that while beating solo Opus 4.8 puts Fugu Ultra exactly where a single frontier model should sit. It's a compliment, not a knock.

Where Fugu Ultra won outright — beating both Opus and Fusion, or tying for the top score: the aurora over a mountain ridge (8.0), the boids flocking sim (8.5), and the synthwave terrain flythrough (8.5). It tied for #1 on the solar system, the voxel runner, the playable game, fireworks, landing page, neon city, and more — all at 9.0.

See it yourself · same prompt, three models

The same build — Fugu vs Fusion vs Opus. Click any one.

This is the part launch charts can't give you. Every cell below is a real, live, playable build from the exact same prompt. Click to open the one you want and judge it with your own eyes.

The solar system
"drag to orbit, scroll to zoom, real orbital motion"
The voxel runner
"Temple-Run style, jump + slide, increasing speed"
A juicy playable game
"make it feel SHIPPABLE — score, lives, sound, polish"
Interactive fireworks
"click to launch, particle trails, night sky"
Nordic dungeon crawler
"torch-lit ruin, chasing enemies, boss room, bloom"
Open-world RPG
"village, NPCs, weather, day/night, inventory"

See all 42 Fugu Ultra builds on Goldie Bench →

42 projects · a lot to show

More of what Fugu Ultra shipped.

I tested it on 42 prompts, so there's plenty to look at. Here's a wall of the builds that passed the smoke test — every one is live and clickable.

The big question

So — does it reach Fable 5 level?

This is what everyone actually wants to know. Sakana built Fugu to go toe-to-toe with the frontier — and they put Claude Fable 5 right at the top of their own comparison charts. So let's look at the real reasoning benchmarks, not the vibes.

Fugu Ultra vs Fable 5 — published benchmarks
Sakana's own numbers + Anthropic's Fable 5 table · higher = better · axis floor 0
LiveCodeBench
93.2
↳ Fable 5
89.8
GPQA-Diamond
95.5
SWE-Bench Pro
73.7
↳ Fable 5
80.3
Terminal-Bench
82.1
↳ Fable 5
88.0

Read that honestly and it's a genuinely close fight:

"Sakana said Fugu Ultra scored 73.7 on SWE-Bench Pro, ahead of Claude Opus 4.8 at 69.2 and GPT-5.5 at 58.6, while trailing the restricted Fable 5 score of 80.0." — reporting on Sakana's Fugu benchmarks, June 2026
The catch you need to know
"If it ties Fable 5, why isn't it just as good?"
Because Fugu isn't one model — it's an orchestration system that routes each request across a pool of models and synthesises the answer. And here's the kicker: Fable 5 itself isn't in that pool, because it's not publicly accessible. So Fugu reaches near-Fable-5 numbers without Fable 5 in the engine. That's the whole "no Fable 5? no problem" story — and it's genuinely impressive. But it also means you're betting on a black box: Sakana doesn't disclose which models answer your request.
the Fable 5 verdict

It reaches the door of Fable 5 — and on a couple of tests, walks through it.

On published reasoning benchmarks, Fugu Ultra is legitimately in Fable 5's weight class: it wins on LiveCodeBench and GPQA, trails on SWE-Bench Pro and Terminal-Bench. On my creative one-shot build bench, it beat frontier-solo Opus 4.8 comfortably but couldn't catch Fusion. So the honest answer isn't a clean yes or no — it's "frontier-tier, neck-and-neck on reasoning, a step behind on the hardest agentic coding and on creative polish." For most real work, that's close enough to matter. For the absolute top of the coding ladder, Fable 5 still edges it.

The honest part

What didn't work as well.

I'm not going to pretend all 42 builds were flawless. They weren't. Here's exactly where Fugu Ultra struggled — because the rough edges are as useful to know as the wins.

About a third needed a human to confirm they play.

I smoke-test every build in a real headless browser — click the screen, press the keys, check the pixels actually change. Roughly 14 of the 42 Fugu Ultra builds came back as "maybe." They're structurally complete — closed tags, a real game loop, input handlers, a fully rendered scene — but my automated test couldn't fully drive them. These were pointer-lock FPS games (doom, skyrim, crypt, dogfight) and click-drag games (pool, neonblaster, fluid) where a generic click-plus-WASD can't trigger the specific controls. I scored those 6.5–7.0 and flagged them for manual checking instead of claiming they work. Not broken — just unverified by a robot.

It's slow, and the two heaviest prompts timed out — four times each.

Fugu Ultra is a panel ensemble, so it's not fast — roughly 15 minutes per build. The two most complex prompts (a full Doom-style FPS and a pseudo-3D OutRun racer) blew past my 40-minute cap on four straight attempts before finally landing on a direct retry with a longer ceiling — 16 and 19 minutes respectively. Frontier quality, but you wait for it.

Thinking it?
"If it times out, the API must be broken."
It wasn't broken — it was busy. The first run used the wrong endpoint and a 16K token cap, and half my densest builds got cut off mid-file with no closing tags. The fix was switching to Sakana's documented Responses API with a 48K output budget. After that, every build came back complete. If you're calling Fugu yourself: use the Responses API and give it room. That one change fixed the truncation entirely.

The cheaper Mini variant couldn't finish the hardest prompts at all.

Sakana also ships Fugu Mini — a faster, single-model version. I ran it through the same 42. It only completed 37. On the five heaviest prompts (the dungeon crawler, billiards, a Minecraft-style sandbox, the open-world RPG, a bullet-hell shooter) it returned empty every time — it burns its whole token budget on reasoning before it writes a line. Ultra's panel finished all 42. That gap is the clearest argument for paying for Ultra over Mini: the panel completes the work the single model can't.

How to actually pick a model

The old way vs the new way.

Here's the real takeaway, bigger than Fugu. The way most people choose an AI model is broken. Here's the contrast.

The old way
~1 wasted week
  • Read the launch post where the new model wins every chart
  • Watch one cherry-picked demo video
  • Trust the vendor's hand-picked benchmark
  • Wire it into your stack on faith
  • Find out a week later it breaks on your real work
  • Result: lost time, no idea why it failed
The new way
~1 afternoon
  • Run the model on the same fixed prompts every model gets
  • Score every build the same way, on the same scale
  • Smoke-test each one in a real browser — does it actually run?
  • Compare it head-to-head against models you already trust
  • Know exactly where it's strong and where it breaks
  • Result: pick with proof, not marketing

That's all Goldie Bench is. A fixed set of prompts, one scoring rule, every build smoke-tested, every model side by side. It's why I can tell you Fugu Ultra is 7.94 and not "amazing" or "mid" — because I watched all 42 run.

Get the whole stack

Want the bench, the Agent OS, and the playbook?

Agent OS — Claude, OpenClaw and Hermes connected

Everything you just saw — the bench that scores models, the Agent OS that runs them, the workflow that builds and deploys — lives inside the AI Profit Boardroom. It connects Claude, OpenClaw, and Hermes into one dashboard with shared memory, so your agents know your business and your goals.

  • The full Agent OS zip file — every prompt, the memory setup, the dashboard
  • Four live coaching calls every week with operators running this in production
  • Daily tutorials as new models like Fugu drop — how to test and wire them in
  • A 30-day roadmap to set the whole system up step by step
  • A community of 3,600+ founders across 38 countries, active 24/7
  • A member map to connect with builders near you
Get the Agent OS →
Inside the AI Profit Boardroom · skool.com/ai-profit-lab
link in the description ↑
Three beliefs to drop

What's holding you back from testing.

Wrong: "The newest model is always the best one to switch to."

Right: Fugu Ultra is brand new and frontier-tier — and it still lost to Fusion and trailed Fable 5 on the hardest test. New doesn't mean best for your work. The only way to know is to run it.

Wrong: "Benchmark charts in launch posts tell me what I need."

Right: Launch charts are picked by the vendor to win. Half of Fugu's densest builds truncated until I fixed the API call — you'd never see that in a chart. Watching the thing build tells you what a bar never will.

Wrong: "I'll wait until the dust settles and the 'best' model is obvious."

Right: It never settles — there's a new frontier model every few weeks. The operators who win are the ones with a system to test each one fast and move on. The skill is the bench, not the model.

Don't take my word for it

158 pages of members who stopped chasing tools and built systems instead — real businesses, real wins, written in their own words.

Read the 158-page wins doc →
The recap

What you now know about Fugu Ultra.

i.
You know where it lands

7.94 on my bench — above solo Opus 4.8 (7.49), below Fusion (8.60).

ii.
You know it beats Opus

25 wins to 11 against frontier-solo Opus 4.8. Genuinely frontier-class.

iii.
You know the Fable 5 answer

Wins LiveCodeBench + GPQA, trails on SWE-Bench Pro. Near-parity, not full parity.

iv.
You know where it breaks

~14 builds need manual checking; it's slow; Mini can't finish the hardest 5 prompts.

v.
You know the API fix

Use the Responses API with a 48K output budget or your dense builds truncate.

vi.
You can check it yourself

All 42 builds are live and clickable on Goldie Bench. Don't trust me — play them.

Stop trusting launch charts. Run the model yourself — and watch it build.

One system for every model

Make every new model plug into your stack.

Agent OS — Claude, OpenClaw and Hermes connected

Fugu today, something else next month — the models keep changing. What doesn't change is the system around them. The Agent OS inside the AI Profit Boardroom turns Claude, OpenClaw, and Hermes into one dashboard with shared memory and goals, so every new model you test just slots in and gets more powerful automatically.

  • The full Agent OS zip — every prompt, the Obsidian memory setup, the dashboard
  • Weekly coaching calls where we wire up new tools together, step by step
  • Daily tutorials + a 30-day roadmap to build the whole thing
  • 3,600+ founders across 38 countries, someone online 24/7
  • The 158-page wins doc — read what members actually built
Get the Agent OS →
Inside the AI Profit Boardroom · skool.com/ai-profit-lab
I'll see you in the next one ↓