A panel of frontier models, merged by a chair. I put it through my whole 42-task leaderboard — it ranked #2 on the entire board, above every single model. The model doesn't matter. The system does.
"The strongest models are gated, and access is granted only to a select few. Hermes Agent now exposes MoA presets as virtual models — capabilities beyond the publicly available frontier."
— Nous Research (@NousResearch), on Mixture of Agents 2.0
It's a new tab I built inside Hermes, in my Agent OS.
Here's the plain version.
You pick a few frontier models — any providers, mixed together. I'm running Claude Opus 4.8 and GPT-5.5.
When you ask a question, both models answer it privately, at the same time.
Then a third model — the chair — reads both answers, judges them, and writes one final answer that's better than either.
The panel of models is the council. The chair is the aggregator. One question in, one better answer out.
Nous Research shipped this idea as Mixture of Agents 2.0. I wired it into a tab, gave it a workspace, and put it on my own leaderboard to see if it actually holds up.
It does. It ranked second on my whole board — above every single model on the market.
Picture one brilliant person answering a hard question alone.
Now picture a panel — each expert writes their own take, privately — and a sharp chair reads all of them and gives you the best combined answer.
The panel wins. Almost every time.
That's the whole engine. The reference models are the panel. The aggregator is the chair.
One prompt goes in. Several models think. One clean answer comes out — better than any single one of them could give.
I didn't want to take anyone's word for it. So I ran the Council Engine through GoldieBench — my one-shot leaderboard.
The test is simple. Ask the model for a real thing — a playable game, a working simulation, a page — in one prompt. Whatever ships on the first try is what gets scored, 0 to 10.
The council built all 42 tasks. Every one. Then each build was judged against the same field every other model faces.
It finished second on the entire leaderboard, with an 8.38 average and 12 task golds.
Second out of every model I've ever tested. Above Opus 4.8. Above Grok. Above every single model in the field.
The only thing ahead of it is Fusion — which is also a system. A different panel of models.
Read that line again as a builder, not a fan: the two best things on my board are both systems. Not models.
Every build is live on the site — open them and judge for yourself. And each one was scored the same way every other model is: the same judge, the same 0–10 scale, the same field, across all 42 tasks. No special treatment for my own system.
These are real. Built by the panel, merged by the chair, one prompt each. Touch them — they're live.
Heads up: these are live embeds — the cursor stays free so you can scroll. Hit "play fullscreen" for full mouse-look. Every demo is at goldiebench.com/models/moa.
No — all 42 are live on the board, including the ones it lost. I'm showing the Dragon Realm right here, where it scored 7.8 and Fusion beat it at 9.0. The honest result is the point: a system that wins most and loses some still beats every single model.
I'm not going to pretend the council wins everything. It doesn't. Here's the straight version on the big builds.
On the tight stuff, it edges the field. Neon Breakout scored 8.6 — past both Opus 4.8 and Fusion at 8.5.
On the heaviest 3D builds, it trails. The Dragon Realm and the Crypt both landed at 7.8 — while Fusion hit 9.0 on each.
So the council isn't magic. It's a system, and Fusion is a better-tuned system for the hardest builds.
But step back. Both of them — the thing that wins, and the thing in second — are panels of models. Neither is a single model.
That's the real result. You don't climb this board with a bigger model. You climb it with a better system.
Everyone is waiting on the next model. The next Opus. The next GPT.
But look at what just happened on my own board: a mix of today's models beat every single model — and a second mix beat it.
No new release. No gated access. Just a smarter way to use what's already here.
The model is a part you can swap. The system is the thing you own.
And the timing makes this huge. Fable 5 is in preview. GPT-5.6 is delayed. The next frontier models aren't really in your hands yet.
So the winning move isn't waiting. It's squeezing more out of the models you already have — by combining them.
You'll wait months — and then you'll still be using one model at a time.
A system that mixes models beats a single model today, and it'll beat the next one too. The system compounds. The model expires.
Inside Hermes there's now a Mixture tab.
At the top it shows the council — your panel models and the chair — live.
You type a prompt, hit run, and watch both models answer in parallel before the chair merges them. You see the final answer and, if you want, each model's private draft.
Under it sits a workspace — every build and every answer the council has ever made, in one place, click to preview. Nothing gets lost.
And it's all configurable. Swap any model into a slot. Save different councils for different jobs. Flip between them with one command.
No — that's the biggest myth about it. Agent OS runs the everyday work on a free local model on your own machine, free APIs slot in for more, and for the frontier work it drives the CLIs you already pay for — your Claude subscription already includes the Claude CLI, and Agent OS plugs straight into it, so you're not paying twice. It's a layer on top of what you already own, not a new meter. And inside the AI Profit Boardroom there are full token-optimisation tutorials, so you learn to cut usage to the bone and stop worrying about it.
The Council Engine lives inside the Agent Operating System I built — the same stack the Boardroom runs every day.
Wrong: "The best model wins. I just need to pick the right one."
Right: The two best things on my whole leaderboard are both systems — not models. A panel of ordinary frontier models beat every single model. The system is the edge, not the model.
Wrong: "Stacking models is lab-only stuff. Too complex for me."
Right: It's one tab and one command. I'm one person and I run three different model panels — Council, Fusion, Sakana — from a single dashboard. If you can pick two models from a list, you can run a council.
Wrong: "I'll wait for the next frontier model to get ahead."
Right: Fable 5 is in preview, GPT-5.6 is delayed — the next models aren't in your hands. A council of today's models is ahead right now, and it'll absorb the next model the day it ships, just by swapping a slot.
I'm not going to paste invented quotes here. The wins are real and written by the members themselves — agency owners, ecom founders, course creators, solo operators across 38 countries. Read them in their own words.
Read the 158-page wins doc →If you only ever use one model at a time, this is the gap you've been feeling.
You're not behind because you picked the wrong model. You're behind because you're using models one at a time, while the systems are pulling ahead.
The people who figure out model systems now — while the tools are this raw — are going to be way ahead when everything settles. Every council you build, every workflow you wire in, it all compounds.
The model expires. The system compounds. That's the whole bet.
You stopped guessing. A council of Opus + GPT, merged by a chair, beats picking one model.
You got proof. It ranked #2 on my whole 42-task board — above every single model.
You got the honest read. It wins the tight builds, trails Fusion on the heaviest — both are systems.
You got one tab. Run the council, see every draft, and a workspace keeps everything it makes.
You stopped waiting. Swap a slot when the next model ships. The system stays ahead.
The model expires. The system compounds.
You can wire this yourself — it's a panel and a chair. Or take the shortcut: get the whole Agent OS, with the Mixture tab built in, and have me walk you through it.