Hermes · Mixture of Agents · live on GoldieBench

The Hermes Council Engine.

A panel of frontier models, merged by a chair. I put it through my whole 42-task leaderboard — it ranked #2 on the entire board, above every single model. The model doesn't matter. The system does.

A council of robed scholar figures, each fully clothed in a long flowing hooded robe glowing a different jewel tone, seated around a glowing round table while a golden chair figure synthesises their light into one beam

GoldieBench · top of my 42-task leaderboard · avg 0–10 (bars scaled from 7.0 to show the gap)

🏆 Fusiona system · model panel

8.60

🥈 Hermes Council Enginea system · Opus + GPT panel

8.38

Grokbest single model

8.13

Claude Opus 4.8single model

7.49

The top two on my whole leaderboard are both SYSTEMS — not single models. The best solo model is third. (Real scores · goldiebench.com)

#rank on the whole board

tasks built, one prompt each

🥇 task golds

See it yourself — open everything ↓

the live resultHermes MoA on GoldieBench → the whole boardGoldieBench leaderboard → where the idea shippedNous Research · MoA 2.0 → get the whole stackAI Profit Boardroom →

"The strongest models are gated, and access is granted only to a select few. Hermes Agent now exposes MoA presets as virtual models — capabilities beyond the publicly available frontier."

— Nous Research (@NousResearch), on Mixture of Agents 2.0

I · catch-up in 30 seconds

What the Council Engine is.

It's a new tab I built inside Hermes, in my Agent OS.

Here's the plain version.

You pick a few frontier models — any providers, mixed together. I'm running Claude Opus 4.8 and GPT-5.5.

When you ask a question, both models answer it privately, at the same time.

Then a third model — the chair — reads both answers, judges them, and writes one final answer that's better than either.

The panel of models is the council. The chair is the aggregator. One question in, one better answer out.

Nous Research shipped this idea as Mixture of Agents 2.0. I wired it into a tab, gave it a workspace, and put it on my own leaderboard to see if it actually holds up.

It does. It ranked second on my whole board — above every single model on the market.

II · how it works

A panel of experts beats one genius.

Picture one brilliant person answering a hard question alone.

Now picture a panel — each expert writes their own take, privately — and a sharp chair reads all of them and gives you the best combined answer.

The panel wins. Almost every time.

That's the whole engine. The reference models are the panel. The aggregator is the chair.

One prompt goes in. Several models think. One clean answer comes out — better than any single one of them could give.

III · I put it on the bench

It ranked #2 on my whole board.

I didn't want to take anyone's word for it. So I ran the Council Engine through GoldieBench — my one-shot leaderboard.

The test is simple. Ask the model for a real thing — a playable game, a working simulation, a page — in one prompt. Whatever ships on the first try is what gets scored, 0 to 10.

The council built all 42 tasks. Every one. Then each build was judged against the same field every other model faces.

It finished second on the entire leaderboard, with an 8.38 average and 12 task golds.

Second out of every model I've ever tested. Above Opus 4.8. Above Grok. Above every single model in the field.

The only thing ahead of it is Fusion — which is also a system. A different panel of models.

Read that line again as a builder, not a fan: the two best things on my board are both systems. Not models.

Thinking it? "Those scores are just your opinion."

Every build is live on the site — open them and judge for yourself. And each one was scored the same way every other model is: the same judge, the same 0–10 scale, the same field, across all 42 tasks. No special treatment for my own system.

GoldieBench · average score across 42 tasks (real)

Fusion · a system8.60

Hermes Council Engine · a system8.38 · #2

Grok · best single model8.13

MiniMax M3 · single model7.96

Claude Opus 4.8 · single model7.49

Two systems on top (gold + green). The best single model (Grok) is third. Bars scaled from a 7.0 floor so the gap is visible. Live at goldiebench.com.

IV · the receipts

What the council actually built.

These are real. Built by the panel, merged by the chair, one prompt each. Touch them — they're live.

Neon Breakout 🥇 8.6

Beat both Opus 4.8 and Fusion (8.5 each) — juice plus deeper systems.

play fullscreen ↗

Particle Galaxy 🥇 8.6

A 24k-particle WebGL spiral — denser than the solo Opus build.

play fullscreen ↗

The Dragon Realm 7.8

A low-poly frozen open world. Strong — but trails Fusion's 9.0 here.

play fullscreen ↗

Doom raycaster 8.6

An imp dead ahead, gun, HUD — a full raycaster from one prompt.

play fullscreen ↗

Heads up: these are live embeds — the cursor stays free so you can scroll. Hit "play fullscreen" for full mouse-look. Every demo is at goldiebench.com/models/moa.

Thinking it? "You just showed the ones it won."

No — all 42 are live on the board, including the ones it lost. I'm showing the Dragon Realm right here, where it scored 7.8 and Fusion beat it at 9.0. The honest result is the point: a system that wins most and loses some still beats every single model.

V · the honest read

Where it wins, and where it doesn't.

I'm not going to pretend the council wins everything. It doesn't. Here's the straight version on the big builds.

On the tight stuff, it edges the field. Neon Breakout scored 8.6 — past both Opus 4.8 and Fusion at 8.5.

On the heaviest 3D builds, it trails. The Dragon Realm and the Crypt both landed at 7.8 — while Fusion hit 9.0 on each.

So the council isn't magic. It's a system, and Fusion is a better-tuned system for the hardest builds.

But step back. Both of them — the thing that wins, and the thing in second — are panels of models. Neither is a single model.

That's the real result. You don't climb this board with a bigger model. You climb it with a better system.

head-to-head · the council vs the field, by task (real 0–10)

Neon Breakout — Council Engine8.6 🥇

Neon Breakout — Opus 4.8 / Fusion8.5

Dragon Realm — Council Engine7.8

Dragon Realm — Fusion9.0

Council wins the tight game (8.6 > 8.5); Fusion wins the heavy 3D (9.0 > 7.8). Both are systems. Bars scaled from a 7.0 floor.

VI · the big idea

Stop chasing the model. Build the system.

Everyone is waiting on the next model. The next Opus. The next GPT.

But look at what just happened on my own board: a mix of today's models beat every single model — and a second mix beat it.

No new release. No gated access. Just a smarter way to use what's already here.

The model is a part you can swap. The system is the thing you own.

And the timing makes this huge. Fable 5 is in preview. GPT-5.6 is delayed. The next frontier models aren't really in your hands yet.

So the winning move isn't waiting. It's squeezing more out of the models you already have — by combining them.

Thinking it? "I'll just wait for the next big model."

You'll wait months — and then you'll still be using one model at a time.

A system that mixes models beats a single model today, and it'll beat the next one too. The system compounds. The model expires.

VII · how you run it

One tab. A workspace for everything it makes.

Inside Hermes there's now a Mixture tab.

At the top it shows the council — your panel models and the chair — live.

You type a prompt, hit run, and watch both models answer in parallel before the chair merges them. You see the final answer and, if you want, each model's private draft.

Under it sits a workspace — every build and every answer the council has ever made, in one place, click to preview. Nothing gets lost.

And it's all configurable. Swap any model into a slot. Save different councils for different jobs. Flip between them with one command.

Thinking it? "Doesn't running Agent OS burn a fortune in tokens?"

No — that's the biggest myth about it. Agent OS runs the everyday work on a free local model on your own machine, free APIs slot in for more, and for the frontier work it drives the CLIs you already pay for — your Claude subscription already includes the Claude CLI, and Agent OS plugs straight into it, so you're not paying twice. It's a layer on top of what you already own, not a new meter. And inside the AI Profit Boardroom there are full token-optimisation tutorials, so you learn to cut usage to the bone and stop worrying about it.

build it with me

Get the Agent OS — Council Engine built in.

The Council Engine lives inside the Agent Operating System I built — the same stack the Boardroom runs every day.

The full Agent OS — Hermes, Claude and OpenClaw wired into one dashboard with shared memory

The Mixture tab — your own model council, plus Fusion and Sakana panels

Weekly coaching calls, daily tutorials, and a 30-day roadmap to set it all up

Token-optimisation tutorials so it runs cheap — mostly free, on tools you already pay for

Get the Agent OS → Inside the AI Profit Boardroom · skool.com/ai-profit-lab

VIII · old way vs new way

Why this changes your move.

Chasing the model

always behind

Pick one model and hope it's the best one
Hit that model's ceiling — and stop there
Wait months for the next release to save you
Beg for access to the gated frontier models
Pay top dollar for one premium model

Building the council

#2 on the board

Mix several models into one virtual council
Beat every single model — today, like I did
No waiting — squeeze more from what's already here
Go beyond the public frontier, no access needed
Swap a slot when a cheaper or better model lands

IX · three beliefs to drop

What's holding you back.

Wrong: "The best model wins. I just need to pick the right one."

Right: The two best things on my whole leaderboard are both systems — not models. A panel of ordinary frontier models beat every single model. The system is the edge, not the model.

Wrong: "Stacking models is lab-only stuff. Too complex for me."

Right: It's one tab and one command. I'm one person and I run three different model panels — Council, Fusion, Sakana — from a single dashboard. If you can pick two models from a list, you can run a council.

Wrong: "I'll wait for the next frontier model to get ahead."

Right: Fable 5 is in preview, GPT-5.6 is delayed — the next models aren't in your hands. A council of today's models is ahead right now, and it'll absorb the next model the day it ships, just by swapping a slot.

Don't take my word for it

I'm not going to paste invented quotes here. The wins are real and written by the members themselves — agency owners, ecom founders, course creators, solo operators across 38 countries. Read them in their own words.

Read the 158-page wins doc →

X · should you care?

Where this leaves you.

If you only ever use one model at a time, this is the gap you've been feeling.

You're not behind because you picked the wrong model. You're behind because you're using models one at a time, while the systems are pulling ahead.

The people who figure out model systems now — while the tools are this raw — are going to be way ahead when everything settles. Every council you build, every workflow you wire in, it all compounds.

The model expires. The system compounds. That's the whole bet.

the recap

The whole thing in five lines.

You stopped guessing. A council of Opus + GPT, merged by a chair, beats picking one model.

ii.

You got proof. It ranked #2 on my whole 42-task board — above every single model.

iii.

You got the honest read. It wins the tight builds, trails Fusion on the heaviest — both are systems.

iv.

You got one tab. Run the council, see every draft, and a workspace keeps everything it makes.

You stopped waiting. Swap a slot when the next model ships. The system stays ahead.

last thing

Build your own Council Engine.

You can wire this yourself — it's a panel and a chair. Or take the shortcut: get the whole Agent OS, with the Mixture tab built in, and have me walk you through it.

The full zip — Agent OS, the Mixture tab, every prompt, the memory setup

Coaching calls where we set up your model council together, step by step

Daily tutorials, a 30-day roadmap, and a member map across 38 countries

Run it cheap — free local model, free APIs, and the CLIs you already pay for

Get the Agent OS → Inside the AI Profit Boardroom · skool.com/ai-profit-lab