I gave all three frontier coders the exact same five prompts — one shot, single HTML file, no follow-ups — straight through Agent OS. Then I screenshotted every build and put them side by side. Same test, three models, you decide.
Across all five builds, scored on whether it ran, how close it hit the brief, and how good it looked. Out of 10, averaged.
The short version: GLM-5.2 edges it on raw visual flair, Opus 4.8 is a half-step behind but ahead on accuracy and game-feel, and Kimi K2.7 ships working builds every time — just the plainest looking in these one-shots.
One shot: an endless third-person voxel runner through a procedural city — dodge blocks, grab coins, speed ramps up.
▶ all three live — click a panel and play · scroll on for the next test (off-screen builds pause to stay smooth)
GLM built the densest, most detailed city — windowed skyscrapers, a speed + coins HUD. Opus ran the furthest with the cleanest motion (Score 303). Kimi's runner plays fine but is unforgiving — it crashes within seconds.
One shot: animate the inner solar system + a few hundred near-Earth-object orbits, with play/pause, speed, and a data HUD.
▶ all three live — click a panel and play · scroll on for the next test (off-screen builds pause to stay smooth)
Opus nailed the brief — labelled planet orbits, a real NEO / close-pass panel, a sim clock. GLM went for drama: a glowing nebula swirl that's gorgeous but reads more galaxy than orbit map. Kimi's is accurate but dim and sparse.
One shot: thousands of particles sloshing in a round bowl you tilt with the mouse, soft glowing metaball look.
▶ all three live — click a panel and play · scroll on for the next test (off-screen builds pause to stay smooth)
GLM filled the bowl with glowing liquid that actually sloshes — the most convincing "liquid in a bowl". Opus's particles glowed but clumped to the centre. Kimi's collapsed into a tiny blob.
One shot: a premium Apple-keynote-style launch page for a fictional AI model — hero, features, pricing, scroll reveals.
▶ all three live — click a panel and play · scroll on for the next test (off-screen builds pause to stay smooth)
Funniest result of the lot: GLM and Opus independently produced near-identical premium "Introducing Nova 1 — Intelligence, reimagined / distilled" keynote pages — gradient hero, full nav, pricing tiers. A dead heat. Kimi's was a plainer set of feature cards.
One shot: a neon arcade game with screen shake, particle explosions, a combo multiplier, sound, and a start + game-over screen.
▶ all three live — click a panel and play · scroll on for the next test (off-screen builds pause to stay smooth)
All three shipped a genuinely juicy game. Opus's breakout had the most game-feel — particle bursts and a live combo. Kimi's breakout was clean and solid. GLM went its own way with fullscreen neon asteroids. The closest test of the five.
Straight answer: there aren't clean ones for these exact versions yet. z.ai shipped GLM-5.2 with no scorecard (open weights + standalone API came after), and Moonshot has only published its own proprietary benchmarks for K2.7. So the head-to-head above is the most honest GLM-5.2-vs-K2.7 data going.
Every model here runs inside the Agent OS — Kimi, GLM-5.2 and Claude in one dashboard, one workspace, builds previewing live. Set it up, run your own head-to-heads, keep the winners.
Get the Agent OS →