Sakana just shipped Fugu Ultra — a multi-agent panel API that competes head-on with OpenRouter Fusion. I tested both through 42 identical one-shot builds on Goldie Bench. Same prompts. Same scoring. Real numbers. Here's which panel wins, why, and the framework I use to pick.
Here's what Fugu Ultra produced on the first three prompts of the bench so far. Every one is a live, playable HTML file. Click into them. The bench is still running — full sweep at goldiebench.com/models/fugu.
Animated mesh gradient, multi-section, polished — denser than Fusion's 20KB at the same prompt.
WASD + mouse-look + distance fog + weapon bob. Clean implementation.
Drag-to-orbit + dust lanes + bloom. Comparable to Fusion's at ~1/5 the cost.
Currently #1 at 8.67/10 (3-of-3 golds). Watch the rest land live.
Every claim in this guide ties back to the two vendors' own pages — nothing second-hand. Both Sakana and OpenRouter publish their own benchmarks and pricing. Click through to verify:
"A Multi-Agent System, Delivered as One Model. Point your existing client or coding harness at the Fugu endpoint with your API key."
— Sakana AI, Fugu launch page, sakana.ai/fugu, June 2026
Before
I was paying for three different AI APIs. Claude. GPT. OpenRouter.
Every week a new "panel" or "ensemble" API would launch claiming it beats them all.
I kept switching. Burned credits on each one. Couldn't tell which actually shipped better builds.
I'd watch a launch video. Buy the subscription. Try it for a day. Move on.
The breaking point — I realised I had no way to compare them like-for-like. I was making spend decisions on vibes.
Then I built Goldie Bench.
After
Now I run every new AI release through 42 identical one-shot prompts.
Same prompts. Same judging rubric. Real playable HTML output every time.
When Sakana shipped Fugu Ultra this week, I plugged it into the same pipeline.
I picked the cheaper, denser winner inside a day. Live results at goldiebench.com.
You can have this too. Same discipline. Same dollar-per-output clarity. No more vibes-based AI spend.
You've seen the proof above. Real people. Real results.
The next 10 minutes show exactly how I bench-test panel APIs.
So here's the deal.
If you're reading this — promise yourself one thing right now. You're going to finish this guide AND run one bench prompt yourself before you sleep tonight. Just one. Because the moment you make this transition, every AI spend decision you make changes.
The people sitting still are still paying for three redundant APIs. The people implementing today are picking the cheapest one that wins and cancelling the rest.
Be one of those people.
Commit to the transition. Commit to taking action today. This changes everything about how you spend on AI.
Five layers I run every panel-ensembled API through before it earns a slot in my Agent OS dispatch. Built so anyone can pick the right panel API in an afternoon — not after burning a month of credits guessing.
You get drop-in OpenAI-compatible auth — point your existing client at the new base URL and your harness just works. No SDK migration. No surprise.
You see exactly which models are in the ensemble. Fable 5 + GPT-5.5 + GLM + Kimi (Fusion). Sakana's mix (Fugu). The panel is the engine — vendor diversity is what makes a panel beat any solo model.
You run 42 identical one-shot prompts on both — same scoring rubric, same output format. Real HTML you can open and play. No vendor benchmark cherry-picking allowed.
You compute dollars-per-shipping-build, not dollars-per-token. The cheaper panel wins ties. Always.
You wire the winner into Agent OS as the default dispatch. The runner-up stays as failover. The losers get cancelled.
I ran both Fugu Ultra (fugu-ultra-20260615) and OpenRouter Fusion through the same 42 one-shot prompts on Goldie Bench. Same system message. Same scoring rubric. Real HTML output every time.
Below: the headline numbers as of 2026-06-22. The Fugu bench is still running for the remaining prompts — live results at goldiebench.com.
Real billed dollars. Same prompt. Same 16K max-tokens. Lower is better.
Fugu landed the same prompt at ~25% the cost — and shipped 60% more code (32KB vs 20KB) on the first attempt. Cost ratio holds across the rest of the bench so far.
Source: sakana.ai/fugu. Higher is better.
Sakana positions Fugu as competing with Fable 5 and Mythos Preview on rigorous benchmarks. These three numbers come straight from the vendor's launch page — I have not independently verified SWE/GPQA/MRCR, only the one-shot HTML bench above.
Goldie Bench, 42-task identical-prompt sweep, live count.
Fugu is mid-run as I publish — the remaining 41 builds finish within the hour. Live count + every demo here.
Sakana's API is at api.sakana.ai/v1/chat/completions and it speaks the OpenAI request format. No SDK migration. No custom client. If you have a Python script hitting OpenAI or OpenRouter, swap the base URL + the API key, and you're done.
You're in Agent OS at 7am. You point your existing dispatcher at the Fugu endpoint. You hit run. Same outputs, lower bill.
You don't. Both Fugu and Fusion are OpenAI-compatible. The only thing that changes is the base URL string and the API key env var.
I swapped between them with a 3-line edit. Two minutes.
On the same one-shot landing-page prompt, Fugu shipped 32KB of HTML. Fusion shipped 20KB. Both were correct. Fugu's was richer — more sections, more polish, denser implementation.
That extra density compounds. Your read-along guide page lands fuller on the first attempt. Less follow-up prompting.
Fugu Ultra bills at $5/M input + $30/M output (PAYG). Fusion's panel calls land around $1.30 each on rich HTML prompts. Fugu landed the same prompt at $0.32. That's a 4× cost gap for equivalent output.
Over a month of agent-loop running, that's the difference between a $1,000 bill and a $250 bill.
That's exactly what the bench is for. Same prompts. Same scoring. If Fugu shipped worse output for less money, we'd see it in the score column.
Initial result: same quality, ~25% the cost. Full sweep coming.
OpenRouter Fusion is PAYG only. Sakana ships flat-rate plans at $20/$100/$200 a month (Standard / Pro / Max) — 10× and 20× usage caps respectively. For high-volume agent loops, that flat-rate predictability beats per-call billing roulette.
Pick the subscription size that matches your loop. No surprise bill.
Sakana's pitch line: opt out of specific providers. If your business has export-control concerns, compliance requirements, or just doesn't want to be a hostage to one vendor's roadmap — Fugu lets you exclude specific models from the panel.
You get frontier-tier output without single-vendor risk. That's the real moat here.
Your clients care when the vendor whose model you secretly rely on suddenly raises prices 4×, deprecates the model, or geo-blocks them.
Panel diversity is your insurance policy against that exact moment.
The bench script. The scoring rubric. The dispatch logic that picks the right panel for the right task automatically. The exact config I used to test Fugu in an afternoon.
It's all inside the Agent Operating System in the AI Profit Boardroom.
Inside the AI Profit Boardroom · aiprofitboardroom.com
Anyone can run this in an afternoon. Here's the exact sequence:
Sakana — sign up at console.sakana.ai. OpenRouter — keys at openrouter.ai/keys. Both have $20-tier starter options.
Either reuse mine (42 one-shot HTML builds — see goldiebench.com) or pick 10 prompts that represent your actual use case. Same prompts both APIs, every time.
Same Python script. Two base URLs. Two API key env vars. Run both in parallel. Save outputs to disk.
Open each output. Score 0–10 on: did it run, did it match the brief, how polished. Track cost per call from the API usage response.
The cheaper API that ties or wins on quality becomes your default dispatch. The runner-up becomes failover. Cancel everything else.
Right. The cost of "what you know" is what you're paying. Right now it's possibly 4× what you'd pay on Fugu for the same output.
One afternoon of bench-testing is the only way to know if you're overpaying. The answer might be no — but the test pays for itself either way.
Right. Vendor benchmarks tell you what the vendor wants you to know. Sakana cites SWE Bench Pro 73.7 / GPQA-D 95.5 / MRCRv2 93.6 — all real, all impressive. None of them tell you whether Fugu will ship the landing page YOU need at the cost YOU can afford.
Your bench is the only one that matters for your business.
Right. Panel APIs are the new default. Both Sakana and OpenRouter are building the same architecture because it works — one prompt, many models, one synthesised answer.
Waiting for stabilisation means watching your competitors halve their AI bill while yours stays put.
158 pages of members who already broke through these exact beliefs. Real businesses. Real wins. Documented inside the Boardroom.
Read the 158-page testimonials doc →A 5-layer Panel Engine framework you can run on any new panel API in an afternoon.
Bench-tested Fugu vs Fusion — same output, ~25% the cost on the prompts I tested.
Vendor links + verifiable benchmark numbers, clickable in the first screenful.
Real cost per call, real output sizes — updated live at goldiebench.com.
5-step playbook to run the same bench on any new panel API that ships next month.
Panel diversity is your insurance against single-vendor pricing, deprecation, or geo-blocking.
Bench harness. Dispatch logic. The 42 prompts. The scoring rubric. The Obsidian memory setup. Coaching calls where we wire your panel-dispatched Agent OS together step by step.
Inside the AI Profit Boardroom · aiprofitboardroom.com
Real members. Real businesses. Real wins documented inside the Boardroom right now.
Read the 158-page testimonials doc →Bench Fugu against Fusion first. Decide second. I'll see you in the next one.