Qwythos-9B · local model · free + private

A free Claude-style AI that runs on your own Mac.

Qwythos-9B is a Claude-Mythos-style creative + reasoning model that runs 100% on your own machine — no cloud, no per-token cost, nothing leaving your computer. I wired it into the Agent OS Local engine and clocked it on my Mac. Here's how to run it tonight.

Tested live on an M4 Max · ~53 tokens/sec · 5.6 GB · $0
A glowing AI brain-engine running inside a desktop computer with a gold privacy padlock — a private local model
$0per token — it's free, forever
~53tokens/sec on my Mac
1Mtoken context window
100%local — nothing leaves your Mac
Straight from the source

Pull it yourself — it's one line.

Qwythos-9B is free and open on Ollama. Here's the model page, the base it's built on, and the abliteration library behind it:

"A Claude-style creative & reasoning model (Qwen3.5-9B base) … thinking model, function-calling capable, 1M-token context." — Qwythos-9B model card, ollama.com/richardyoung/qwythos-9b-abliterated
I tested it · what it built

I didn't just install it. I put it to work.

I asked Qwythos to build real things — using nothing but my own Mac. No cloud. No API. No internet needed. Here's exactly what came out, and how fast it ran. Click either one — they're live, single HTML files the model wrote start to finish.

Six apps. One model. Zero dollars. Nothing left my Mac. Each one is a single self-contained HTML file Qwythos wrote start to finish — click any of them above.

And it's not just me typing prompts at a terminal. This is the model wired straight into my Agent OS — the Local engine drove every one of these builds from the dashboard, reporting its own speed as it worked:

the raw test — run it yourself
$ ollama run richardyoung/qwythos-9b-abliterated
# prompt: "In two sentences, what is an AI Agent OS?"
An AI Agent OS refers to an autonomous operating system where the core
intelligence consists of capable AI agents rather than passive processes.
These agents perceive their environment and act on it without constant
human supervision, performing tasks end to end.
→ 51.9 tokens/sec · 0 dollars · nothing left the Mac

$ curl localhost:3737/api/local/model
{"model":"richardyoung/qwythos-9b-abliterated:latest","warm":true}
# ↑ the Agent OS Local engine is running Qwythos

$ curl localhost:3737/api/local/chat   # the dashboard, end to end
{"t":"d","c":"..."}  ...streaming...
{"t":"stats","tps":51,"tokens":95,"model":"richardyoung/qwythos-9b-abliterated:latest"}

Measured on an Apple M4 Max via Ollama (Q4_K_M build). I ran these with a second model also loaded, so the first load was slow — but generation held ~44–53 tok/s the whole way. Your speed varies by machine. Every number here is reproducible: pull the model and run the same commands.

Head to head · the test you asked for

Qwythos vs Ornith-9B — same prompt, same Mac.

Ornith-9B is the other strong local model I run. So I gave both the exact same job — "build a neon Snake game" — on the same machine, and measured everything. Both shipped a working game. The difference is speed and weight.

On the same Mac (M4 Max)Qwythos-9BOrnith-9B
Build speed (Snake, same prompt)52.4 tok/s24.4 tok/s
Short-answer speed (same prompt)29.9 tok/s21.5 tok/s
On-disk size5.6 GB9.5 GB
Built a working appYes ✓Yes ✓
Context window1M tokensstandard
Cost · privacy$0 · 100% local$0 · 100% local

Both are great free local models. But for the same work, Qwythos came back about twice as fast while taking up 40% less space on the drive — which is exactly why it's the one I pinned as the default in the Agent OS Local engine. The lesson is the same one this whole site runs on: the model is only half of it — the system you wire it into is what makes it useful.

Under the hood · why this one's special

Why Qwythos punches way above a 9B.

Most local models are just a base model, shrunk. Qwythos is different — it's a stack of four deliberate moves layered on top of an open Qwen3.5-9B base. That's why a model small enough to live on a laptop writes and reasons like something far bigger. Here's the exact recipe, in plain English.

01 · base
Qwen3.5-9BA strong, open 9-billion-parameter base model. The raw engine.
02 · style
Claude Mythos + Fable tracesPost-trained on Claude-style reasoning & creative traces. This is where it learns to "sound like Claude."
03 · memory
YaRN → 1M contextContext stretched to a 1M-token ceiling so it can hold huge inputs.
04 · unlock
Heretic abliterationResidual refusals trimmed — without retraining or hurting quality.
05 · ship
GGUF → OllamaQuantised with llama.cpp so it runs fast on your own Mac.
✍️It thinks like Claude — locally

It was trained on Claude Mythos & Fable reasoning traces, so its writing and problem-solving carry that style. You get a Claude-flavoured brain that runs on your machine for $0.

🧠A 1M-token memory ceiling

The architecture supports up to a million tokens via YaRN — whole codebases or books in one shot. (We run it at a 16k window in the Agent OS to keep RAM light; you can dial it up.)

🛠️It reasons and calls tools

It's a "thinking" model with a <think> step and native function-calling — so it can actually drive agents, not just chat. That's what makes it useful inside an OS.

🔓Abliterated — near losslessly

The "Heretic" pass surgically removes the model's reflex to refuse, so it stops bailing on legitimate work — and it does it with almost zero quality drift (see the KL number below).

The benchmarks — measured + published

Two kinds of numbers here: the speed/quality I clocked myself on an M4 Max, and the abliteration metrics published on the model card.

MetricValueWhat it means
Generation speed (builds, M4 Max)~52 tok/sfaster than you can read — measured across 6 real builds
vs Ornith-9B (same Snake build)~2× faster52.4 vs 24.4 tok/s, at 40% less disk
Real apps built end-to-end6 / 6 ✓clock, to-do, Snake, calculator, landing, galaxy
Refusals after abliteration53 / 100on a harmful-prompt eval — far fewer needless "I can't help with that"
KL divergence (quality drift)0.0066≈ negligible — the unlock barely changed the model's brain
Runtime context (Agent OS)16,384 tokraised from the 8k default; model ceiling is 1M
Cost · privacy$0 · 100% localnothing leaves your machine, ever
About that "1M context" — read this before you get excited. The 1M is the model's ceiling, not what it runs at out of the box. Ollama loads it with a much smaller live window (8,192 tokens by default) because a true 1M window would need far more memory than most Macs have — the longer the window, the bigger the memory cost. So in the real world you pick a window that fits your RAM. I hit this exactly: a big build started getting cut off mid-code because it ran past the 8k default. The fix was one line — I bumped the Agent OS Local engine to a 16,384-token window (plenty for any single-file build, and it only added ~0.2 GB of RAM). Want to feed it a giant document? You can push the window higher — you'll just trade memory for it. Bottom line: "1M" is real, but treat it as headroom you rent with RAM, not a free default.

Which version should you download?

It ships in a ladder of sizes (quantisations). Smaller = lighter + faster but slightly less sharp; bigger = closer to the original. For most Macs, the recommended Q4_K_M (the default) is the sweet spot.

TagSizeBest for
IQ3_M4.4 GBsmallest — older / tighter-RAM Macs
IQ4_XS5.2 GBgreat quality-to-size ratio
Q4_K_M (latest)5.6 GBrecommended — what I run
Q5_K_M6.5 GBhigher quality, a bit heavier
Q8_09.5 GBnear-lossless — if you've got the RAM

The honest pros & cons

No model is magic. Here's where Qwythos genuinely shines — and where you should keep your expectations in check.

👍 What's great

  • Free & 100% private — runs entirely on your Mac, nothing leaves the machine, no per-token bill ever.
  • Fast for its size — ~52 tok/s on my M4 Max, about 2× quicker than Ornith-9B on the same job.
  • Light footprint — 5.6 GB, runs on a normal modern Mac with no graphics card.
  • Claude-style brain — trained on Claude Mythos & Fable traces, so it writes and reasons with that flavour.
  • Agent-ready — a thinking model with native function-calling, not just a chatbot.
  • Fewer pointless refusals — abliterated with almost no quality drift (KL 0.0066).
  • It actually builds — 6 working single-file apps, start to finish, in this guide alone.

👎 Where to keep expectations real

  • It's a 9B, not a frontier model — for the hardest reasoning or huge codebases, Opus / GPT-class cloud models still win. Use the right tool for the job.
  • "1M context" costs RAM — the real window you run is limited by memory (we use 16k). True long-context isn't free.
  • First load is slow — ~25s cold while it reads 5.6 GB off disk; it also thrashes if you keep another big model warm at the same time.
  • Reduced guardrails — abliterated means the responsibility is on you (see below).
  • It's just the model — no built-in tools, memory, or web. On its own it's an engine; it needs a system around it to be genuinely useful.
  • Occasionally messy output — once in a while it writes a plan instead of clean code, or trails off. A good harness re-runs it.

Notice the biggest con: it's just the model. That's the whole point of this site — a great model is only half the equation. Drop Qwythos into a system that gives it tools, memory, and a place to ship, and a free 9B starts doing real work. That system is the Agent OS.

One honest caveat. This is an abliterated build — its safety guardrails are deliberately reduced, so it'll engage with a wider range of prompts than a stock model. That's a feature for serious local work, but it means the responsibility is on you. Use it within the law, and keep it for the legitimate building this guide is about. The base + abliteration are the work of empero-ai (Qwythos / Claude Mythos series), Heretic by p-e-w, and Richard Young / DeepNeuro.
What it is

A Claude-Mythos brain, without the cloud.

Qwythos-9B is a 9-billion-parameter model built on a Qwen3.5-9B base and post-trained on Claude Mythos & Fable traces. So it writes and reasons with that Claude-style voice — but it runs entirely on your own machine through Ollama.

Three things make it special for everyday work:

Thinking it?
"A 9B model can't be any good."

I thought the same. Then I ran it on my Mac: it answered at ~53 tokens a second — faster than you can read. I asked it to define an Agent OS and it gave a clean, correct answer, then wrote a sharp one-line take on local AI. For a model that costs nothing and never leaves your machine, that's the win.

Where your words actually go
the whole loop runs on your Mac · nothing is sent to a cloud · no API bill
your Mac · offline You ask a prompt Ollama runs it on your own chip Qwythos thinks reasons + writes Answer ~53 tok/s · $0
My story · why I run models locally

I was tired of renting my own brain.

Before

Every little task pinged a cloud API.

Every draft, every rewrite, every test — a few more cents on the meter.

My client notes, my private docs, my ideas — all flying off to someone else's server.

And when the wifi dropped or a model got rate-limited, my work just stopped.

Then I started running the model on my own Mac.

After

Now the everyday work runs on Qwythos — right here, on my machine.

No meter. No data leaving the desk. No rate limits. Works on a plane.

It answers faster than I can read, and stays warm while I work.

And it slots straight into my Agent OS as the Local engine, so my agents use it by default.

You can run this tonight. Free, private, on the Mac you already own.

Who's telling you this

Why I test every local model that drops.

I run an AI-first SEO agency and teach thousands of operators how to wire AI into real businesses. A free model that runs on your own machine — fast, private, no bill — is one of the biggest unlocks there is. So when one lands, I pull it, clock it on my own Mac, and report back plainly.

3,600+ founders inside AIPB
400k YouTube subscribers
38 countries · live members
163k X / Twitter followers

I'm not going to paste invented quotes here. The wins are real and written by the members themselves — agency owners, ecom founders, course creators, solo operators across 38 countries. Read them in their own words.

Read the 158-page wins doc →
Before you scroll on —

Commit to owning your AI today.

You've seen the numbers. Free, private, fast, on the Mac you already have.

Here's the deal I want you to make with yourself.

Before you sleep tonight, you're going to run one command and have a Claude-style model living on your own machine.

Just one command. No card, no signup, no cloud.

Because the moment you stop renting your brain and start owning it, everything about how you work with AI changes — your privacy, your bill, your freedom to work offline.

The people who set this up now will be running local agents while everyone else is still watching the meter tick.

Be one of those people.

Commit to running it tonight. One command. It's yours after that.

Qwythos-9B on my Mac — the real numbers
measured on an Apple M4 Max via Ollama · Q4_K_M build · your mileage varies by chip
0generation speed
0on-disk size
0token context
0dollars, ever
Run it yourself

One line. Then it's yours.

Install Ollama (free), then pull the model. That's the whole setup:

# 1 · get Ollama (free) from ollama.com, then: ollama run richardyoung/qwythos-9b-abliterated

The recommended build is Q4_K_M (5.6 GB) — the sweet spot of quality and speed. If you're tight on memory there's a 4.4 GB build; if you want maximum quality there's a near-lossless 9.5 GB one. Pick your size:

Pick your build — smaller = lighter, bigger = sharper
download size per quant · longer bar = more disk + RAM · all run on a modern Mac
IQ3_M · smallest
4.4 GB
IQ4_XS
5.2 GB
Q4_K_M · pick this
5.6 GB
Q5_K_M
6.5 GB
Q8_0 · near-lossless
9.5 GB
Thinking it?
"I'll need some monster GPU rig for this."

You won't. The recommended build is 5.6 GB — it runs on a normal modern Mac, no separate graphics card, no cloud. I ran it on an M4 Max at ~53 tokens a second. The first load reads the 5.6 GB off disk so give it a moment; after that it stays warm and answers instantly. If you've got a recent MacBook, you can run this.

The shift

The old way vs the new way.

Here's the contrast between renting a cloud model and owning a local one.

The old way
cloud · metered
  • Every prompt pings a paid cloud API
  • Your notes + client data fly to someone else's server
  • Rate limits + outages stop your work cold
  • No wifi = no AI
  • The meter never stops ticking
  • Result: a bill that grows with every draft
The new way
local · free
  • Every prompt runs on your own chip — ~53 tok/s
  • Nothing leaves your Mac — fully private
  • No rate limits, no outages, no waitlist
  • Works on a plane, in a cafe, offline
  • 1M-token context for whole books + transcripts
  • Result: $0 per token, forever
How much of your work leaves the machine
cloud models send every prompt to a server · a local model sends nothing
Cloud model
all of it
Qwythos · local
none

Your notes, your client docs, your ideas — they never leave your Mac. That's not a setting you toggle. It's just where the model runs.

Make it the brain of your whole stack

Wire Qwythos into a system, not just a chat box.

Agent OS — Claude, OpenClaw and Hermes connected

A local model is powerful. A local model running as the engine of your whole Agent OS is unstoppable. Inside the AI Profit Boardroom you get the Agent OS that wires Qwythos in as the Local engine — your agents run on it by default, free and private, with Claude, OpenClaw, and Hermes layered on top for the heavy lifting.

  • The full Agent OS zip — the Local engine wiring, every prompt, the dashboard
  • Four live coaching calls a week with operators running local + cloud models
  • Daily tutorials — including how to wire a local model in as your default
  • Token-optimisation tutorials so your paid usage drops to almost nothing
  • A community of 3,600+ founders across 38 countries, online 24/7
Get the Agent OS →
Inside the AI Profit Boardroom · skool.com/ai-profit-lab
link in the description ↑
Doesn't running the Agent OS burn a fortune in tokens?

No — that's the biggest myth about it. The Agent OS runs the everyday 90% on a free local model like this Qwythos (on your own Mac, nothing leaving it), free APIs slot in for more, and for the frontier work it drives the CLIs you already pay for — your Claude subscription already includes the Claude CLI, and the Agent OS plugs straight into it, so you're not paying twice. It's a layer on top of what you already own, not a new meter. And inside the AI Profit Boardroom there are full token-optimisation tutorials, so you cut usage to the bone and never think about it again.

Three beliefs to drop

What's been stopping you running local.

Wrong: "Local models are slow and weak — toys compared to the cloud."

Right: Qwythos answered at ~53 tokens a second on my Mac — faster than you read — with a Claude-Mythos voice and a 1M-token memory. For everyday work, it holds its own, free.

Wrong: "I need an expensive GPU rig to run my own AI."

Right: The recommended build is 5.6 GB and runs on a normal modern Mac — no separate graphics card, no cloud. If you've got a recent MacBook, you're ready.

Wrong: "Free always means worse, so why bother."

Right: The model isn't the moat — the system around it is. A free local model wired into a real Agent OS beats a pricey cloud model used as a lonely chat box. Own the brain, build the system.

Don't take my word for it

158 pages of members who stopped renting tools and built systems instead — real businesses, real wins, in their own words.

Read the 158-page wins doc →
The recap

What you now know.

i.
You own a Claude-style brain

Qwythos-9B — Claude-Mythos creative + reasoning, running on your own Mac.

ii.
You stopped paying per token

It's free, forever. No meter, no card, no cloud bill.

iii.
Your data stays yours

100% local — your notes and client docs never leave the machine.

iv.
It's fast + light

~53 tok/s, just 5.6 GB on disk. Runs on a normal modern Mac.

v.
It holds a whole book

1M-token context — long transcripts, big docs, all at once.

vi.
It's wired into the Agent OS

Set as the Local engine, so your agents run on it free by default.

Stop renting your brain. Own it — free, private, on the Mac you already have.

One system, free at the core

Build the system. Run it free.

Agent OS — Claude, OpenClaw and Hermes connected

Qwythos gives you a free, private brain. The Agent OS inside the AI Profit Boardroom turns that brain into a whole operating system — Claude, OpenClaw, and Hermes on one dashboard with shared memory, with this local model as the free engine underneath so most of your work never costs a cent.

  • The full Agent OS zip — Local-engine wiring, every prompt, the Obsidian memory setup
  • Weekly coaching calls where we set up local + cloud models together
  • Daily tutorials + token-optimisation training so it runs near-free
  • 3,600+ founders across 38 countries, someone online 24/7
  • The 158-page wins doc — read what members actually built
Get the Agent OS →
Inside the AI Profit Boardroom · skool.com/ai-profit-lab
I'll see you in the next one ↓