Same five prompts, three frontier coders, one shot each — single HTML file, no follow-ups — all run through Agent OS. Then screenshotted and embedded live, side by side. You play them yourself.
Scored on whether it ran, how close it hit the brief, and how good it looked — out of 10, averaged across all five one-shot builds.
The short version: GLM-5.2 and Opus 4.8 are neck-and-neck — GLM on raw visual flair, Opus on accuracy and game-feel. Qwen3.7-Max ships clean, working builds but landed a step behind on dazzle here — which is the twist, because Qwen has the best benchmark scores of the three (more on that below).
One shot: an endless third-person voxel runner through a procedural city — dodge blocks, grab coins, speed ramps up.
▶ all three live — click a panel and play
GLM built the densest, most colourful city (windowed skyscrapers + speed/coins HUD). Opus ran the furthest with the cleanest motion. Qwen's is atmospheric — a foggy tunnel of buildings — but more muted and it crashes quicker.
One shot: the inner solar system + a few hundred near-Earth-object orbits, with play/pause, speed, and a data HUD.
▶ all three live — click a panel and play
Opus nailed the brief — distinct labelled planet orbits, a real NEO panel, a sim clock. GLM went dramatic with a glowing nebula swirl (gorgeous, but more galaxy than orbit map). Qwen drew a dense, busy orbital swarm — structurally orbit-like but dimmer and harder to read.
One shot: thousands of particles sloshing in a round bowl you tilt with the mouse, soft glowing metaball look.
▶ all three live — move your mouse over a panel to tilt the bowl
GLM filled the bowl with glowing liquid that genuinely sloshes — the most convincing of the three. Opus's particles glowed but clumped to the centre, and Qwen's is a solid working sim but reads thinner. Play them and tilt — GLM's is the one that feels like fluid.
One shot: a premium Apple-keynote-style launch page for a fictional AI model — hero, features, pricing, scroll reveals.
▶ all three live — scroll inside a panel
GLM and Opus both produced premium gradient "Intelligence, reimagined / distilled" keynote heroes — basically a tie. Qwen's is clean and well-built (proper nav + three feature cards) but the headline ("Built for the next generation of builders") lands flatter than the gradient heroes.
One shot: a neon arcade game with screen shake, particle explosions, a combo multiplier, sound, and start + game-over screens.
▶ all three live — click a panel and play
The closest test. All three shipped a real, juicy game. Opus's breakout had the most game-feel (particle bursts + live combo). Qwen's neon breakout is clean and vibrant. GLM went its own way with fullscreen asteroids. Genuinely hard to separate.
Here's the interesting part. On these five one-shot visual builds, Qwen finished third. But on the published coding benchmarks, Qwen3.7-Max is the strongest of the three — Alibaba reports 80.4% on SWE-bench Verified and 60.6% on SWE-bench Pro, beating Opus 4.7 on agentic coding (Terminal-Bench, MCP-Atlas). GLM-5.2, meanwhile, shipped with no scorecard at all.
And on SWE-bench Verified, Qwen3.7-Max's 80.4% sits right up with the leaders (Opus ~80.9, GPT-5.2 ~80.0). So the lesson is the usual one: benchmarks measure agentic problem-solving; they don't measure whether a one-shot demo looks gorgeous. Qwen reasons brilliantly — its builds here were just a touch plainer to the eye.
GLM-5.2, Qwen 3.7 and Claude all run inside the Agent OS — one dashboard, one workspace, builds previewing live. Set it up, run your own head-to-heads, keep the winners.
Get the Agent OS →