Under the hood · why this one's special
Why Qwythos punches way above a 9B.
Most local models are just a base model, shrunk. Qwythos is different — it's a stack of four deliberate moves layered on top of an open Qwen3.5-9B base. That's why a model small enough to live on a laptop writes and reasons like something far bigger. Here's the exact recipe, in plain English.
01 · base
Qwen3.5-9BA strong, open 9-billion-parameter base model. The raw engine.
→
02 · style
Claude Mythos + Fable tracesPost-trained on Claude-style reasoning & creative traces. This is where it learns to "sound like Claude."
→
03 · memory
YaRN → 1M contextContext stretched to a 1M-token ceiling so it can hold huge inputs.
→
04 · unlock
Heretic abliterationResidual refusals trimmed — without retraining or hurting quality.
→
05 · ship
GGUF → OllamaQuantised with llama.cpp so it runs fast on your own Mac.
✍️It thinks like Claude — locallyIt was trained on Claude Mythos & Fable reasoning traces, so its writing and problem-solving carry that style. You get a Claude-flavoured brain that runs on your machine for $0.
🧠A 1M-token memory ceilingThe architecture supports up to a million tokens via YaRN — whole codebases or books in one shot. (We run it at a 16k window in the Agent OS to keep RAM light; you can dial it up.)
🛠️It reasons and calls toolsIt's a "thinking" model with a <think> step and native function-calling — so it can actually drive agents, not just chat. That's what makes it useful inside an OS.
🔓Abliterated — near losslesslyThe "Heretic" pass surgically removes the model's reflex to refuse, so it stops bailing on legitimate work — and it does it with almost zero quality drift (see the KL number below).
The benchmarks — measured + published
Two kinds of numbers here: the speed/quality I clocked myself on an M4 Max, and the abliteration metrics published on the model card.
| Metric | Value | What it means |
| Generation speed (builds, M4 Max) | ~52 tok/s | faster than you can read — measured across 6 real builds |
| vs Ornith-9B (same Snake build) | ~2× faster | 52.4 vs 24.4 tok/s, at 40% less disk |
| Real apps built end-to-end | 6 / 6 ✓ | clock, to-do, Snake, calculator, landing, galaxy |
| Refusals after abliteration | 53 / 100 | on a harmful-prompt eval — far fewer needless "I can't help with that" |
| KL divergence (quality drift) | 0.0066 | ≈ negligible — the unlock barely changed the model's brain |
| Runtime context (Agent OS) | 16,384 tok | raised from the 8k default; model ceiling is 1M |
| Cost · privacy | $0 · 100% local | nothing leaves your machine, ever |
About that "1M context" — read this before you get excited. The 1M is the model's ceiling, not what it runs at out of the box. Ollama loads it with a much smaller live window (8,192 tokens by default) because a true 1M window would need far more memory than most Macs have — the longer the window, the bigger the memory cost. So in the real world you pick a window that fits your RAM. I hit this exactly: a big build started getting cut off mid-code because it ran past the 8k default. The fix was one line — I bumped the Agent OS Local engine to a 16,384-token window (plenty for any single-file build, and it only added ~0.2 GB of RAM). Want to feed it a giant document? You can push the window higher — you'll just trade memory for it. Bottom line: "1M" is real, but treat it as headroom you rent with RAM, not a free default.
Which version should you download?
It ships in a ladder of sizes (quantisations). Smaller = lighter + faster but slightly less sharp; bigger = closer to the original. For most Macs, the recommended Q4_K_M (the default) is the sweet spot.
| Tag | Size | Best for |
| IQ3_M | 4.4 GB | smallest — older / tighter-RAM Macs |
| IQ4_XS | 5.2 GB | great quality-to-size ratio |
| Q4_K_M (latest) | 5.6 GB | recommended — what I run |
| Q5_K_M | 6.5 GB | higher quality, a bit heavier |
| Q8_0 | 9.5 GB | near-lossless — if you've got the RAM |
The honest pros & cons
No model is magic. Here's where Qwythos genuinely shines — and where you should keep your expectations in check.
👍 What's great
- Free & 100% private — runs entirely on your Mac, nothing leaves the machine, no per-token bill ever.
- Fast for its size — ~52 tok/s on my M4 Max, about 2× quicker than Ornith-9B on the same job.
- Light footprint — 5.6 GB, runs on a normal modern Mac with no graphics card.
- Claude-style brain — trained on Claude Mythos & Fable traces, so it writes and reasons with that flavour.
- Agent-ready — a thinking model with native function-calling, not just a chatbot.
- Fewer pointless refusals — abliterated with almost no quality drift (KL 0.0066).
- It actually builds — 6 working single-file apps, start to finish, in this guide alone.
👎 Where to keep expectations real
- It's a 9B, not a frontier model — for the hardest reasoning or huge codebases, Opus / GPT-class cloud models still win. Use the right tool for the job.
- "1M context" costs RAM — the real window you run is limited by memory (we use 16k). True long-context isn't free.
- First load is slow — ~25s cold while it reads 5.6 GB off disk; it also thrashes if you keep another big model warm at the same time.
- Reduced guardrails — abliterated means the responsibility is on you (see below).
- It's just the model — no built-in tools, memory, or web. On its own it's an engine; it needs a system around it to be genuinely useful.
- Occasionally messy output — once in a while it writes a plan instead of clean code, or trails off. A good harness re-runs it.
Notice the biggest con: it's just the model. That's the whole point of this site — a great model is only half the equation. Drop Qwythos into a system that gives it tools, memory, and a place to ship, and a free 9B starts doing real work. That system is the Agent OS.
One honest caveat. This is an abliterated build — its safety guardrails are deliberately reduced, so it'll engage with a wider range of prompts than a stock model. That's a feature for serious local work, but it means the responsibility is on you. Use it within the law, and keep it for the legitimate building this guide is about. The base + abliteration are the work of empero-ai (Qwythos / Claude Mythos series), Heretic by p-e-w, and Richard Young / DeepNeuro.