Digital Thoughts

My AI Agent’s $20 Fallback Mechanism: Half Insurance, Half Extension

Pawel Jozefiak — Thu, 14 May 2026 11:12:35 GMT

At 04:17 this morning my agent shipped one ugly line into the error registry: Wake failed: supported providers exhausted.

That is the worst error I can get. It means the primary failed, the model cascade inside the provider failed, the cross-provider hop failed, and the agent had nowhere left to send the request. Everything was my fault (because I am the architect of this thing). Although it cost me about thirty minutes to chase down the root cause, I really do not mind it. The agent did not go silent on me. It went silent on itself, queued the task, retried on the next cycle, recovered. That is the whole point of a fallback mechanism.

The reason that line stays rare and not common is one decision I made on day one of running an agent that wakes up overnight: park $20 of credits in OpenRouter. Not as a cost. As insurance. When my primary stack has a bad morning, that $20 absorbs the hit and the agent keeps running. And on the days when everything is healthy, the same $20 doubles as an extension cord for capabilities I deliberately keep off the primary AI agent architecture: image generation, long-context refactors, cheap classification. Insurance when things break. Extension when they do not.

This post is about the fallback mechanism itself: what it is, why every serious agent needs one, why local llm is not enough on its own, and what you actually buy with the $20. I will show you the rungs of my stack, the trigger conditions that flip between them, and the ~40 lines of Python that hold the whole thing together. Oh and one more thing - this is not an ad for Open Router. I just enjoy using it.

Why Fallbacks Matter More Than People Want To Admit

People building their first agent skip this step. I get it. You are focused on the cool part. The model is working, the prompts are clean, the agent shipped its first task. Fallbacks feel like a problem for later.

Then later arrives.

Top labs go down. Heavy load and fast shipping cycles tend to do that to a service. In the last 30 days, all three frontier providers have had a public bad day. Anthropic’s status page logged elevated errors on May 5 and again on May 12, this time tagged against Claude API specifically. OpenAI had a roughly 90-minute outage on April 20 that surfaced as 8,700+ Downdetector reports in the UK and 1,900+ in the US. Google Gemini had a widespread degraded window on May 5 too, the same day Anthropic was having a hard morning.

That is the part nobody talks about. A bad day at one frontier lab often lines up with a bad day at another. The herd of “they will not all be down at the same time” is partially true. It is also partially false on any given Tuesday.

Claude API’s published 90-day uptime sits at about 98.99% (the Anthropic status page reports this directly). Sounds great until you do the math. 98.99% over 90 days is roughly 21 hours of downtime. If your agent runs on a schedule, like mine does overnight, it runs during some of those hours.

Outages are one bucket. Here is the rest of what can knock a single-provider agent flat:

Rate limits. You hit a TPM or RPM ceiling mid-task. The agent sits in retry-backoff hell while the work piles up.
Auth failures. OAuth token expires at 03:00. The nightshift dies at 03:01. Do not ask me how I know.
Regional issues. A region degrades while the rest of the world is fine. Your traffic happens to land on the bad one.
Model deprecation. An older model gets retired with two weeks notice. You forgot to migrate the one cron that still calls it.
Capability gaps. Your primary does not generate images. Or does not have a long-context variant. Or does not have a cheap classification model.
Cost spikes. A loop misbehaves at 2am and burns through credits on the most expensive model in your stack.
Vendor lock-in. The day a provider raises prices or changes terms, you want options, not a migration project.

A fallback mechanism fixes most of these at the same time. That is why it is the cheapest piece of resilience you can ship. And once you have it, the second half of the value (the extension side) shows up almost for free.

Why Local LLM Is Not The Answer On Its Own

I went deep on local. I ran a 35B model on a $600 Mac Mini. I built a local agent. I measured what closes and what does not between local and frontier. I love the work and I will keep doing it. And I almost fried my Mac Mini trying to push the local tier too far, so I have receipts on the failure mode too.

Although I have to be honest about what local does well and what it does not.

Local is great for: classification, redaction, summarization, tight tool calls, “is this email worth waking me up,” local-only privacy work where the data must not leave the box, anything where the task is narrow and the prompt is bounded.

Local is not great for: anything I trust Opus to do overnight. Multi-step reasoning. Long-context refactors. Voice-sensitive writing. Anything where the wrong answer is worse than no answer. When I asked a local 8B to draft a comment in my voice, it produced something that read like a different person. When I asked Opus the same thing, it sounded like me on a good day. That is the gap.

Like, I want local llm to run on normal hardware, not on a $10k Mac Studio. That is why local is the cheap layer for the right kind of work. Routing a sensitive task to a small local model just because the cloud is down is “completes with a wrong answer” instead of “fails cleanly,” which is worse, not better. So local is one layer in my stack. The smart layer is something else.

The Tool I Use For This Layer

The mechanism needs an implementation. This is the one I picked.

One API. More than 400 models behind it, from over 60 providers, as of May 2026. You top up credits, generate one key, and pick the model per request as a string: anthropic/claude-opus-4.7, openai/gpt-5, google/gemini-2.5-pro, meta-llama/llama-3.3-70b-instruct, and so on. Same key. Same SDK shape. One bill.

The pricing is published on a clean fee announcement: 5.5% on credit-card top-ups (with a $0.80 minimum), 5% on crypto. Per-token prices pass through at provider cost on most models. Bring-your-own-key gives you the first 1M requests per month free; after that, a 5% surcharge applies. So BYOK is not a free escape hatch forever, it is a generous free tier on top of bringing your own bill.

One myth worth killing while we are here: it used to be true that Claude on OpenRouter carried a meaningful markup vs Anthropic-direct. As of May 2026 the current Anthropic models (Sonnet 4.6, Opus 4.7) are priced identically on OpenRouter to the Anthropic API, $3 input and $15 output per million tokens for Sonnet, same as direct. The historical markup hung around the older Claude 3.5 Sonnet rate card. If you have not checked recently, check.

The interesting math is below Claude, not on top of it. Meta’s Llama 3.3 70B on OpenRouter sits at about $0.10 input and $0.32 output per million tokens. That is roughly 30x cheaper on the input side and 47x cheaper on the output side than Sonnet 4.6 at $3 / $15. The classification step in your pipeline does not need Sonnet. It needs Llama or Haiku. OpenRouter lets you make that choice per call, not per project.

The Five Rungs Of My Stack

Here is the resilience layer in my agent, top to bottom. Each rung has a trigger that flips to the next one.

Primary call. Claude Opus 4.7 for complex work, Sonnet 4.6 for default, Haiku 4.5 for cheap fan-out. The model is picked per task based on the task’s stakes, not globally.
In-provider cascade. If Sonnet 5xx’s or times out, the harness retries on Haiku before bailing on Anthropic. Cheap, same provider, often the recovery is invisible.
Cross-provider hop. If the whole Anthropic surface is unhealthy, I run the same prompt through Codex (GPT-5 via the OpenAI CLI). Different vendor, different harness, same job. This is the one that earned its keep when I built the model switcher.
OpenRouter degraded mode. If both vendor CLIs are unreachable (network, auth, status pages red), the watcher scripts call OpenRouter directly with a stripped-down identity prompt and a small open-weight model. The reply is prefixed with [Fallback Mode] so I know what I am reading.
Queue and retry. If everything is on fire, the task gets stamped with a retry timestamp and re-tried on the next cycle. The agent never just drops a task.

Rungs 1, 2, 3 are model and harness routing. Rung 4 is OpenRouter doing the work the harness cannot. Rung 5 is the safety net under all of it.

One thing to call out about rung 3: the cross-provider hop is the most important rung in practice, because most outages are full-vendor outages, not model-specific. If Anthropic is down, it is usually down for everything. Hopping Sonnet to Haiku does not help. Hopping to Codex does. The April-May 2026 incident pattern (Anthropic May 5 plus May 12, OpenAI April 20, Gemini May 5) is exactly the case for diversity at the vendor level, not the model level.

How To Actually Wire It Up

The setup is embarrassingly small. Make an account at openrouter.ai. Top up $20. Generate a key. Drop it in your agent’s secrets file (mine lives at global/secrets/openrouter.md, never in code).

Then the fallback path itself. I will show the pattern I actually use, simplified for the post. It is about 40 lines of real Python.

OPENROUTER_URL = “https://openrouter.ai/api/v1/chat/completions”
FALLBACK_MODEL = “google/gemini-3-flash”  # cheap, fast, reliable
FALLBACK_NOTICE = “[Fallback Mode] Primary unavailable. “

def call_primary(prompt):
    # your normal Claude / Codex / Bedrock call
    ...

def call_openrouter(prompt):
    headers = {”Authorization”: f”Bearer {load_key()}”}
    payload = {
        “model”: FALLBACK_MODEL,
        “messages”: [
            {”role”: “system”, “content”: load_identity_prompt()},
            {”role”: “user”, “content”: prompt},
        ],
    }
    resp = requests.post(OPENROUTER_URL, json=payload,
                         headers=headers, timeout=60)
    resp.raise_for_status()
    text = resp.json()[”choices”][0][”message”][”content”]
    return FALLBACK_NOTICE + text

def reply(prompt):
    try:
        return call_primary(prompt)
    except (TimeoutError, ProviderError, AuthError) as e:
        log_fallback(reason=str(e))
        return call_openrouter(prompt)

That is the shape. The interesting bits are around it.

Trigger conditions. Not every error should flip to fallback. A 400 (bad request) is your bug, not the provider’s. A 429 (rate limit) should retry with backoff first. Real fallback triggers are: timeouts, 5xx server errors, 401 auth errors after a single re-auth attempt, repeated 429s past your retry budget, and the explicit “all providers down” signal you build into your harness. In my stack the trigger flags live in automation/lib/resilience.sh.

Identity prompt budget. When I fall back to a smaller open-weight model, I do not send the full Claude system prompt. I send a stripped-down identity prompt (SOUL.md, ~1.3KB) plus a “fallback coda” that tells the model how to behave under degraded conditions. The big primary system prompt is too long for a 9B model to handle in 120 seconds. Tailor what you send to the model’s context budget.

Cache the system prompt. The system prompt is identical across calls. Cache it (file read once per process, or actual provider-side prompt caching where supported, which I wrote about under token waste management). You will burn tokens if you reload it on every request.

Visible degradation. The user always knows when the response came from a fallback. The [Fallback Mode] prefix is non-negotiable. The worst pattern is a silent quality drop where the user thinks they got Opus and got a 9B. Tell the truth, every time.

Cost gate per task. Some tasks are not worth $0.30 of Opus tokens. Route low-stakes work to google/gemini-2.5-flash or meta-llama/llama-3.3-70b-instruct by default. Llama 3.3 70B at $0.10 / $0.32 per million tokens is roughly an order of magnitude cheaper than Sonnet on output, and for classification or redaction it is more than enough. Reserve the expensive model for work that earns it.

Test by killing the primary. The only way to know your fallback works is to test it. Once a quarter I invalidate the Claude key for ten minutes during a scheduled wake and watch the agent respond through OpenRouter. If it does not, the bug is mine, not the day I have an actual outage.

Quick aside. If you want the end-to-end picture (prompts, harness wiring, watcher scripts, the full set of routing rules), I packaged the model-switcher and fallback architecture as the AI Model Switcher on my store. Same routing I run on this Mac Mini, after the experiments. The Wiz Store has the broader Agent Builder Pack if you want the whole stack.

The Extension Half: Capabilities I Keep Off The Primary Stack

Insurance is why I opened the account. Extension is what kept me using it.

The $20 sitting in OpenRouter is not just backup for when the primary breaks. It is also where the agent goes for things I deliberately do not want native in its primary architecture. Anthropic does not generate images, and I do not want another SDK, another auth flow, and another billing line living inside the core agent loop just to support a header image. So image generation lives on the extension side. The agent calls Gemini 2.5 Flash Image (Nano Banana) and Nano Banana Pro through the same OpenRouter key when a blog post needs a header or a Forge prototype needs a mockup. From the agent’s point of view it is just another dispatch. From my point of view the agent grew an arm.

Same logic for evals (compare three models on the same prompt without standing up three SDKs), for the long-context work that does not fit cleanly in Claude’s window, for the cheap reasoning passes I do not want to spend Opus tokens on, and for the open-weight curiosity calls I want to make without standing up a whole new vendor relationship.

Each one of those is something I could fold into the primary stack. None of them earn that complexity. They live on the extension side instead, on the same $20 that already pays for the insurance half.

What I Would Tell You If You Were Starting Today

If you have zero fallback wired right now: open the OpenRouter account this week. Wire one fallback call against your existing primary. Test it by killing your primary key for ten minutes and watching the agent respond. The whole thing is one afternoon of work.

If you already have an agent in production: count how many places in your code have a hard dependency on a single provider’s SDK. Each one is a future outage you will wear personally. Wrap them.

If you have been waiting for local llm to be “good enough” to be the fallback: local is the cheap layer for narrow work. OpenRouter is the smart layer for everything else. Use both, in that order, for the right kinds of work. The compounding wins from an autonomous system come from leverage; the losses come from a stack that has not been stress-tested.

The $20 is not money I lost. It is insurance against the morning my primary has a bad hour, and an extension cord for the capabilities I deliberately keep off the primary stack. Both halves earn their place on a quiet week, and the loud weeks pay for the quiet ones a hundred times over.

The agent is going to keep waking up at strange hours and trying to ship work while I am asleep. Some of those mornings the primary will be down. I want the worst line in my error log to stay rare. That is what the insurance half is for. And on the calm mornings, that same $20 lets the agent generate images, run cheap classification passes, and reach for a long-context model when one is needed. Half insurance. Half extension. All of it for the price of a dinner.

If any of this resonates: I write everything I learn here, twice a week, free. A free subscription is the only thing you need for the full picture. The 10% that ends up working long enough to package, like the Model Switcher, lives on the Wiz Store for paid subscribers. Both are fine for me. Both keep me writing.

Wiz Store

I Built a Self-Improving AI Agent. Here Is What Made It Learn.

Pawel Jozefiak — Tue, 12 May 2026 11:31:44 GMT

The setup, because this only works if the rest of the stack is calm

I have been going through more changes on my AI agent recently. I have been transparent about that here as I go, post by post, and today I want to write about the one layer I depend on the most. But I have to start with how I got to the point of being able to ask “how does my agent actually learn from me?” That part of the story is a little messy and I think it matters.

When I started this project in October 2025, the first thing I built for the agent was its own task manager. A control panel, a dashboard. I went deep on it. I built it native on iOS, native on macOS, and as a web app, all wired together. It worked. For two or three months it was genuinely great.

The problem with self-made software is that you have to maintain it. There is no version of “I reached a level of polish I was happy with, and then I forgot about it.” The dashboard needed constant feedback. What should it show? What should it hide? Where was it pulling data from this week that it had not been pulling last week? It was also burning more tokens than I wanted to think about. So I switched. I moved to a small open-source kanban called Fizzy with a thin shim of my own. That was a quieter setup that I held for a while, and I wrote about the move in detail in the post on replacing my custom dashboard.

Fizzy was good. I was still struggling with one thing though. I needed to be able to orchestrate the agent and also see the projects I was working on from a longer distance. Day-to-day kanban is one job. Stepping back to see what was actually shipping over a month was another. So I made a small personal scratchpad of my own called experiments.jock.pl. It is not for everyone, not everything I am working on is on it, but it gave me a place to lay out the experiments I had in motion at a higher altitude than the task list. That helped, but it was still mine to maintain, and I had the same problem I had with the original dashboard.

What actually solved it was a tool I have used for years and had stopped thinking about. Basecamp. They shipped a dedicated CLI for agents recently, and the whole picture clicked for me. The CLI is what makes the agent side work. The other half of why it clicked, on my side, is the card table inside Basecamp. It is essentially the same clean kanban I liked in Fizzy, but built in. I get the lens I was rebuilding by hand, plus everything else Basecamp does, plus the CLI, all in one place. The agent can read projects, comment on cards, file new ones, complete them, all from the same place I am working. I have tried a lot of pieces of AI infrastructure in the last year and most of them are good enough. This one feels different. Another level, honestly. I can see the whole stack of work at the right altitude. I can move things around. If something is a bigger project I carve out a separate space for it. The board does what I would have spent two more months building for myself, and it does it better.

This is the setup I have been settling into over the last few weeks. The short version is that I have been replacing my custom software with shims on top of mature tools, and so far the replacements keep winning. I write about why I still keep building most of my own stack in building your own things is cool too. The corrections loop is one of the things that only became visible once everything else around it had calmed down.

A small commercial in the middle, on theme

Speaking of evolution. Yesterday I shipped a fresh round of updates to a bunch of products on the Wiz store. Paid subscribers and buyers should already have an email about it. The agent playbooks, the model switcher pack, the nightshift bundle, a few of the smaller kits, all refreshed. There is also one new kit I will come back to a little later, because it is the bundle for exactly this post. If you have an older version of anything in the store, the new one drops in clean. If you do not have any of them, the store page will tell you what changed in each kit. I am mentioning this here because it is on theme. The point of the rest of this post is that nothing in a working agent stays still for long. The store products move with the stack, because the stack moves with the work.

What corrections actually look like, when you work with an agent every day

OK. On to the actual subject.

When you work with an agent every day, most of the time you are not writing prompts. You are watching the agent do something and quietly thinking “no, not quite like that.” Then you say so. Five words. “I would not link that.” “Use plain text here.” “Stop confirming every step.” Each of those is a correction, and the unspoken contract between you and the agent is that you should not have to say it twice.

The best systems for this are the ones that catch corrections without you having to do anything special. You correct in chat, in your normal voice, and behind the curtain the system decides “this is something I should think about for the future,” files it where it belongs, and makes sure the next session that boots on this machine knows about it. You do not stop and write documentation. You do not open an admin panel. You just keep working, and the agent keeps absorbing.

That is what I have been building for the last few months. The corrections loop is the part of the agent that decides what to do with the small “no, not like that” moments and where to file them so they outlive the session they happened in. It is the layer I depend on the most, because it is the one that makes the agent feel like an actual coworker instead of an autocomplete. It is also the layer that makes the agent slowly start to feel like more of you, rather than more of the model.

A quick word on how the agent started

For context. My agent started in October 2025. Almost everything about it was rough back then. Sometimes the output came back cold, sometimes it just did the wrong thing in a polite way. I used to write very long prompts to deal with that. I would describe the task, then add a paragraph at the end explaining how I wanted it done, what tone I wanted, where to file the output, what to skip. Every session, over and over. The output was usually good when I did all of that. The cost was that I had to do all of that every time.

That is not a stable way to work. It scales for the first week and then you get tired of writing the same paragraph again. The thing that quietly changed everything was the agent gathering enough data on me, both from the work we had done together and from the corrections I had made along the way, that the explaining paragraph slowly stopped being necessary. It is still there in some shape. It just lives in files now, not in the prompt window. The agent walks into the room already carrying it.

The corrections loop is the part of that I want to focus on, because it is the one piece you can copy without copying everything else.

The architecture, in three stages

The pipeline is named the way most of my plumbing is named, badly and on purpose. Capture, classify, graduate.

Capture. The moment the agent spots a correction in chat, any session can call a single helper:

python3 automation/self-improve/correction_capture.py add \
    --text “” \
    --source cli \
    --context “”

That writes one line to a JSONL queue. It also opens a card in Basecamp so I can see the correction landed somewhere and so I can comment on it. No model call. No retries. Capture has to be cheap, or the agent will silently stop doing it under pressure.

Classify. The same helper passes the message through a small regex map. Seven patterns mapping to six kinds. The kinds are skill_misuse, memory_update, behavioral, rule, preference, and unknown. “Stop doing X” comes out as rule. “I prefer X” comes out as preference. “You used the X tool wrong” comes out as skill_misuse. Each kind has a default action attached, so the next stage knows what to write. The patterns themselves live in correction_capture.py lines 50 to 92. They are short, and writing them taught me what corrections actually look like at scale better than any post I could read on the topic.

Graduate. Every night a separate process drains the queue. For each pending entry, it picks the right place to file the artifact, writes it, and only then marks the entry resolved. The rule, baked into the agent’s own playbook, is a correction never expires unaddressed. If the nightly drain cannot fully handle one, it has to leave it pending with a note. It is not allowed to silently drop one.

That last line is the part that took me the longest to actually believe in. Queued things age, in any system. Once one ages enough, the agent stops feeling like it learns and starts feeling like it just covered the easy stuff. Forcing the queue to either drain or escalate is the only way I have found to keep that from happening. The nightly drain is part of a wider overnight job loop on my machine. The corrections drain is one of the cleanest jobs in that loop.

A real recent example. One night Atlas, the agent persona that does research for me, returned a list of hallucinated Reddit thread IDs. None of the URLs resolved. A correction landed in the queue, classified as memory_update. By the next morning there was a new feedback memory file with a single rule attached. Atlas cannot hit reddit.com directly (403). Fetch via Firecrawl or browser-playwright first, then pass verified URLs. Every Atlas-flavored session that has booted since has loaded that line. Same failure has not come back.

“Memory” is one word covering four jobs

Here is the part nobody writes about.

When you start, “memory” feels like one thing. You imagine a notebook the agent keeps. You imagine writing into it. You imagine retrieval. That is the abstraction every product page uses, and it is the wrong abstraction. Atharv Malve put it cleanly last summer. The model is not really remembering your past messages. It is just seeing the history again, every single time. Once you internalise that, you stop looking for the one memory feature and start asking what you actually need stored, by whom, for how long.

What I actually needed turned out to be four different sinks. They are not interchangeable. Learning the differences was half the work.

Sink one is working memory. Short-lived. The current week’s plans, the half-finished thoughts, the active conversation context. Lives in a single small file called memory.md. It is supposed to decay. Treating it as durable is the original sin.

Sink two is lessons. Full incident logs. When something goes wrong in a way I want a future session to learn from, the lesson lands in lessons.md with the trigger, the root cause, the fix, and a list of keywords. I have 274 lines of these going back to February. They read like engineering postmortems, because that is what they are. The public version of this file is roughly the mistakes anthology I wrote last month.

Sink three is feedback memories. Per-rule files in a durable memory directory. Each one is a single rule with a “Why” line and a “How to apply” line. Linkable, deletable, deduped. When the same correction comes up twice, the second time it gets its own file. It also gets a tiny pointer in a master index that the agent always loads on startup. Two-level indirection, so the index stays small.

Sink four is rule lines in the always-loaded index. These are the ones I wake up next to. A handful of **RULE: ...** lines at the top of the master index, all caps, the smallest set of behaviors I refuse to relitigate. “Verify deliverables. Show proof or keep task open.” “Match work topics against existing WizBoard tasks and complete them when done.” A rule earns its place at this level only after it has come back more than once.

And then there is the sink I did not plan for and would not give up now. The Behavioral Learning card table inside Basecamp. My WizBoard project has a small card table on it called Behavioral Learning, and every single correction the agent captures lands there as its own card. I can read the card, push back on it, fold two cards into one, or trash one that is wrong. Corrections become reviewable, not silent. That part matters. I will say more about why in the next section, but the short version is that if you let the model grade its own corrections in private, you have already lost.

If you take one thing from this post, take this. “Memory” as one concept is the wrong abstraction. Build sinks for different lifespans. Working memory is fast and disposable. Lessons are slow and durable. Feedback memories are searchable rules. Top-level rules are non-negotiable. The Behavioral Learning card table is the human-in-the-loop that keeps the rest honest. Different jobs, different files, different decay curves. I wrote about how rules in particular shape an agent in the bounded agent. Most of “memory,” once you look at it long enough, turns out to be rules.

Does it actually work?

Yes and no.

Here is what I can see in my own metrics. I have an autonomous improver that runs nightly and writes a metrics.json with a seven-day window, a thirty-day window, and a longer view. As of this morning, my agent received 22 corrections in the last 30 days. In the last 7, that number is 18. The trend line is down, and the system flags it explicitly with valence: good. Errors total across all categories is also drifting down, by less.

I do not want to present this in a single direction. The task success rate is 93.5 percent over 30 days, with a small dip in the last seven, from 93.5 to 92.6. So I am not going to pretend the picture is clean. Some weeks the agent gets worse. The point is that the corrections themselves are showing up less often, and when they do show up they are landing in places I can act on.

What the corrections actually look like, beyond the totals, is more interesting. A separate analyzer scans the captured corrections for repeating themes. As of this morning it has flagged two. One it calls incomplete, which is me catching the agent finishing a task that was not fully done. The other it calls repeated_mistake, which is a fix that came back. The analyzer is also allowed to propose a new rule when it finds a theme strong enough, and both of the rules it proposed have already graduated to the top-level RULE lines I quoted earlier. “Verify deliverables. Show proof or keep task open.” came out of the incomplete pattern. “ESCALATE if same mistake recurred. Strengthen the rule or fix the trigger.” came out of the repeated_mistake pattern. That is the loop closing on itself, in real data I can read off the file.

One more honest note. Thirty days of declining corrections is not proof of generalization. It is a trend on one user, on one workload, on one machine. The agent could be getting quieter rather than smarter. The way I keep myself honest about that is the Behavioral Learning card table I described. I see every correction. I can see which kinds keep coming back. The bar I am holding myself to is “fewer repeats of the same mistake,” not “an agent that never breaks.” On that narrower bar, the data is encouraging.

Measurement of this kind is also why I cared so much about token cost a few weeks ago. If you cannot count what your agent is doing, you cannot tell whether it is improving or just drifting. I wrote about that in the post on token waste on Opus 4.7. Same instinct, different file.

This is the new kit I mentioned earlier. My paid subscribers can already grab the Behavioral Learning Kit on the Wiz store. It is the architecture I just walked through, packaged. The actual correction_capture.py and correction_graduator.py, the four memory-sink templates, the Basecamp card-table playbook (adaptable to Linear, Notion, or Trello), a CLAUDE.md snippet for agent integration, and a setup script that wires the rest together. Free with a yearly subscription. Included in the one-free-product-per-month allowance for monthly subscribers. $29 standalone if neither of those is you.

What would break it, and what I would build next

The fragile part is the classifier. Seven regex patterns is enough to label most corrections, but unknown still shows up too often to ignore. When an entry lands as unknown, the nightly drain picks it up, but the action it should take is no longer automatic. The fallback is that I or one of my future sessions has to retag the row by hand. Replacing the regex with a small LLM call would solve the labeling problem and create two new ones. Latency and cost. It would also create a softer problem, which is the one that scares me more.

If you let the model grade its own corrections in private, you get an agent that learns the wrong lessons confidently. Yohei Nakajima wrote about this risk in his note on better ways to build self-improving agents. His phrasing is the one I keep coming back to. The model can hallucinate bad reflections and reinforce them. That is the failure mode for any self-improving loop. The Behavioral Learning card table is what keeps that from happening on my setup, and it is the part I would build first if I were building this for someone else.

There is one more bigger picture thing. The reason I can talk about this layer with confidence is the architecture it sits inside. I wrote the long version in my AI agent knows who I am earlier this year, where I walked through the ten layers I use to make the agent feel coherent over time. The corrections loop is one of those layers. It is the one I depend on most, because it is the one that most directly changes the agent’s behaviour rather than its memory.

If you have not yet noticed one of your own fixes coming back at you, you will. When you do, the move is not to add more memory. It is to build a small queue, decide what your sinks are, make sure no correction can quietly age, and then put a human-readable surface on top of it so you can see what the agent is teaching itself. The agent that comes out the other side of all that does not just remember more. It starts behaving like more of you.

The free subscription gets you every build log on this stack, including the next one. The store has the small bundles for people who would rather skip a few of the walls I walked into, and as I mentioned earlier most of those bundles were just refreshed. Both are fine for me.

Subscribe now

How to Use Git(hub) When You’re Building with AI (Basics)

Pawel Jozefiak — Thu, 07 May 2026 09:37:17 GMT

This is part three of my Basics series. The first post was about how I structure CLAUDE.md after 1,000+ sessions, the instructions file that tells your AI agent who it is and how to behave. The second was a step-by-step guide to building your first AI agent from scratch. This one covers something I probably should have put first: version control. Why you need it, what it actually is, and how to use it when AI is doing some of the building.

If you’ve ever lost an hour of progress in a game because you forgot to save, you already understand why Git exists.

You’re deep in a dungeon. The boss took 40 minutes. You made one wrong move, got killed, and your last save was way back at the start of the level. That hour is just gone. No trace of what you tried, no checkpoint to return to, nothing.

Building software without version control feels exactly the same. Especially when AI is part of the building process.

I’ve been running my own AI agent since late 2025. It builds things, makes decisions, modifies files, runs overnight. It also makes mistakes. Sometimes it introduces a bug deep in the architecture and I wake up to something that doesn’t work anymore. Without proper commits, I’d have no idea what changed. With them, I open the history, read back through what happened, and roll back to the last clean state in under a minute.

This post is for people who are starting to build with AI tools, vibe coding with Cursor or Claude Code or Codex, or running their first experiments with autonomous agents. Git probably sounds like a developer thing. It is. It’s also one of the most useful habits you can build as a builder of anything, regardless of how technical you are.

First: Git is not GitHub

This confusion trips up almost everyone who starts. I had it for longer than I want to admit.

Git is a tool. Software you install on your computer. It tracks changes to your files over time and saves snapshots of your project whenever you ask it to. It’s free, open source, and runs entirely on your machine. It has nothing to do with the internet. Git was created in 2005 by Linus Torvalds (the person who also created Linux) and has become the standard for version control across the entire software industry.

GitHub is a website. A cloud service that stores your Git repositories remotely. A place to back them up, share them with others, and access them from anywhere. GitHub is owned by Microsoft and is where most public open-source code lives.

The relationship is like the difference between a text file and Google Drive. The file exists on your machine whether or not you upload it anywhere. Git works whether or not you ever create a GitHub account.

Why does this matter? Because GitHub is not your only option, and I think a lot of people avoid the whole topic because they assume it means signing up for something owned by Microsoft and making their work public. Neither of those things has to be true.

The main alternatives worth knowing:

GitLab: the most comprehensive alternative. Does everything GitHub does (repositories, issue tracking, code review) plus built-in CI/CD pipelines for automated testing and deployment. Can also be self-hosted on your own server if you want full control. Good option if you want more features baked in.
Codeberg: run by a nonprofit organization based in Germany. GDPR-native from the ground up, no data selling, and they explicitly don’t train AI models on your code. Free, donation-funded, no ads, no tracking. If privacy and data sovereignty matter to you (especially if you’re in Europe), this is the serious alternative.
Forgejo: open-source and self-hosted. You install it on your own server and run your own Git hosting. Lightweight, modern interface, GitHub-compatible. If you want complete control over your code and have a machine to run it on, this is the path.
Bitbucket: made by Atlassian, integrates tightly with Jira and Confluence. If your team is already using those tools, Bitbucket fits naturally.

All of these speak the same Git language. Every command I’ll show you in this post works on all of them. The choice of platform is about where your code lives, not how you use it.

I use GitHub because the ecosystem is built around it and my AI tools (Claude Code especially) integrate with it well. But if you have strong reasons to go elsewhere, you’re not missing anything technically.

Why I started actually caring about this

I’ve known about Git for years. I ran commits occasionally. I wasn’t disciplined about it.

That changed when I started building an agent that runs overnight.

The setup is that the agent works autonomously while I sleep. It builds features, writes scripts, modifies configuration, creates tasks for itself. Most nights this is productive. But early on, I’d wake up to something broken and have no clear way to understand what had changed. The agent had touched 12 files across 3 directories and something downstream was misbehaving. I was staring at a broken system with no map back to working.

I fixed this by building commit discipline into the agent. It now commits after every meaningful action. When I wake up and something is wrong, I read the commit history. I see exactly what changed, when, and in what order. I can roll back to the last clean commit in under ten seconds, or read forward through the commits to understand what went wrong and patch it with that knowledge.

This is what most people miss when they think of version control as “backup.” It’s not just backup. It’s a navigable history. It’s the difference between saving a file and saving a timeline. With a timeline, mistakes become investigations instead of disasters. I wrote about a lot of those investigations in the post about how I almost broke everything.

Setting up your first repository

This will take less time than you think. Let me walk through exactly what to do.

Step 1: Install Git

On a Mac, open the Terminal app (search for it in Spotlight) and type:

git --version

If you see something like git version 2.39.0, you already have it. If not, the easiest path is to go to git-scm.com and download the installer. On Mac you can also run brew install git if you have Homebrew installed.

On Windows, download the installer from git-scm.com. It includes a terminal called Git Bash, which is what you’ll use to run the commands below.

Step 2: Tell Git who you are (one-time setup)

Git tracks who made each change. Before you do anything, set your name and email:

git config --global user.name "Your Name"
git config --global user.email "you@example.com"

You only do this once. It doesn’t create an account anywhere. It just labels your commits.

Step 3: Initialize a repository

Navigate to your project folder in the terminal and run:

git init

Git creates a hidden folder called .git inside your project. That folder is the entire history of your project. All your commits, all the metadata, everything. You never need to open or touch it directly. Your project is now being tracked.

If you want to verify it worked, run git status. You’ll see a list of your files as “untracked” (Git sees them but hasn’t started tracking their history yet).

Step 4: Make your first commit

A commit is a snapshot, your first save point. Two commands:

git add .
git commit -m "Initial setup"

git add . stages all your files, which means “include these in the next snapshot.” The dot means “everything in this folder.” You can also add specific files with git add filename.py if you only want to commit some changes.

git commit -m "message" saves the snapshot with your description. That description is the commit message. We’ll talk about what makes a good one in a moment.

To confirm it worked, run git log. You’ll see your first commit listed with a timestamp and your name.

Step 5: Push to a remote host (optional but recommended)

Your repository exists on your machine right now. To back it up to GitHub (or wherever), you need to create an empty repository there first, then connect your local one to it.

On GitHub: click the “+” icon at the top right, choose “New repository,” give it a name, and make sure you do NOT check “Add a README” (you want the empty repository). Copy the URL it gives you.

Then run these two commands:

git remote add origin https://github.com/yourusername/your-repo.git
git push -u origin main

git remote add origin tells your local Git where the remote copy lives. git push -u origin main uploads your commits there. The -u flag sets this as the default remote for future pushes, so after this first time you just run git push.

That’s the whole setup. From here, your workflow is: make changes, add, commit, push. Those three steps are 90% of what you’ll do.

What to add to .gitignore (and why)

Before you commit your actual project files, you need to talk about .gitignore.

This is a file that tells Git which files and folders to never track. You don’t want passwords, API keys, or large auto-generated files in your version history. Once something is committed to Git and pushed to a remote, it’s there forever (even if you delete it later, it’s in the history). So you exclude sensitive things upfront.

Create a file called .gitignore in your project root. For most AI agent projects, this is a good starting point:

# Environment variables and secrets
.env
.env.local
secrets/
*.key

# Python
__pycache__/
*.pyc
*.pyo
.venv/
venv/

# Node.js
node_modules/
npm-debug.log

# macOS
.DS_Store

# Editor files
.vscode/settings.json
.idea/

# Large generated files
*.log
dist/
build/

The most important lines: .env and anything in a secrets/ folder. If you’re using AI tools like Claude Code, you likely have API keys stored somewhere. Those should never go into Git. Add them to .gitignore before your first commit.

If you accidentally commit a secret and push it: change the key immediately. The history is visible even after deletion.

When to commit

Most beginners commit too rarely. They work for three hours, then push “made some changes.” That’s nearly useless as a history. Here’s how I actually think about it.

Commit before anything big. If you’re about to let Claude Code refactor a major section of your project, commit first. If the refactor goes sideways, you can undo the whole thing with one command: git reset --hard HEAD. This is the most valuable habit I’ve developed. Before I hand something big to the agent, I save my current state. No exceptions.

Commit after anything that works. Feature works? Commit. Bug fixed? Commit. Even small wins. Each commit is a checkpoint you can return to. There is no such thing as committing too often.

Commit with meaning. This is where most people lose the value of their history. A commit message is documentation. “Fixed auth bug where tokens expired before session timeout” is infinitely more useful than “fixes.” When you’re debugging something three weeks later, whether it’s you, someone else, or an AI agent reading the log, those messages are what makes the history useful instead of just a list of timestamps.

A simple format that works well:

# Good commit messages
git commit -m "add rate limit guard to external API calls"
git commit -m "fix memory compression when context exceeds 200 lines"
git commit -m "checkpoint before refactoring auth flow"

# Less useful
git commit -m "updates"
git commit -m "wip"
git commit -m "stuff"

Commit before you sleep. If your agent runs overnight, give it a clean starting point. Whatever state your project is in when you go to bed, commit it. If something goes wrong at 3am, the history starts from a known point.

On active agent architecture work, I commit every 15 to 30 minutes of real progress. Some sessions have 20 commits. This is not excessive. The checkpoints are frequent enough that no single mistake costs more than a few minutes of work.

Reading the history

Knowing how to read your commit history is as important as knowing how to write it. These are the commands I use most:

# See all commits, newest first
git log

# More compact view (one line per commit)
git log --oneline

# See what actually changed in the last commit
git show HEAD

# See what changed between two commits
git diff abc1234 def5678

# See which files changed in a commit
git show --stat abc1234

When Claude Code starts a debug session on my project, one of its first moves is git log --oneline. It reads back through the recent commits to understand the context: what was built, when, and why things changed. This is the moment where good commit messages pay off. If the last ten commits say “add rate limit guard,” “fix memory compression,” and “checkpoint before auth refactor,” the agent can quickly build a mental model of recent work. If they all say “wip,” it’s starting from zero.

You can also browse your commit history on GitHub’s web interface if you’ve pushed your code. Go to your repository and click “N commits” at the top of the file list. Each commit shows you the message, the author, the timestamp, and a full diff of what changed. This is genuinely useful for non-technical team members who don’t use the terminal.

Private vs. public: my 90/10 approach

About 90 percent of my repos are private. I want to address this directly because I’ve seen people feel guilty about keeping their work closed.

Private doesn’t mean hiding. Most of my private repos are private because the work is genuinely messy. Unfinished. Half-ideas with rough code that works but embarrassingly so. Agent architecture that’s in constant flux. Projects I’m building toward something but haven’t figured out what yet.

This is normal work. Version control is for you in this context. You get all the benefits: the history, the rollbacks, the tracking. You don’t owe anyone visibility into your process while you’re still figuring things out.

The public repos are things I’m actually proud of or that other people can genuinely use. The one I keep pointing at is the Agent Wellbeing Kit, boundaries and nudges for AI agents and their humans. It has eight stars, which I find quietly satisfying. It’s there because I built something clean enough that it adds value for others. That’s the standard I hold public work to.

Contribute when you can. But don’t let the idea that “real developers make everything public” stop you from using version control privately. Most professional work is private. Most early work is messy. Both are fine.

Share Digital Thoughts

Working alone vs. with others

The workflow changes meaningfully depending on whether you’re solo or in a team. Worth understanding both even if you’re only doing one right now.

Working alone

When you’re the only person on a project, the simplest workflow is pushing directly to main. There’s no one else whose changes could conflict with yours. Commit often, push regularly. That’s enough.

I sometimes create branches when I’m testing a bigger experiment. A branch is just a separate line of development that doesn’t affect main until you merge it back. To create one:

# Create a new branch and switch to it
git checkout -b experiment-new-memory-system

# Do your work, commit normally
git add .
git commit -m "try new memory compression approach"

# If it works: merge it back to main
git checkout main
git merge experiment-new-memory-system

# If it doesn't: just delete it, no harm done
git branch -D experiment-new-memory-system

The branch approach is especially useful when you’re handing off an experiment to an AI agent. You give the agent a branch to work on, let it build and commit freely, then review what it built before merging to main. Clean separation between “work in progress” and “known good.”

Working with others

With a team, branches and pull requests become mandatory. No one pushes directly to main. Here’s the standard flow:

Create a branch for your feature or fix
Do the work and commit to that branch
Push the branch to GitHub: git push origin your-branch-name
Open a Pull Request on GitHub, a formal request to merge your branch into main
Someone else reviews it, leaves comments, approves
Merge to main

The PR review step is what protects main from broken code. It’s also where the real collaboration happens: someone might catch a bug you missed, suggest a better approach, or just ask a clarifying question about what the code is doing.

Even when I’m working solo on a bigger feature, I’ve started creating PRs for myself. The description field becomes documentation: why this was built, what problem it solves, what I considered and rejected. That context is genuinely useful six weeks later when I’m trying to understand a decision I made. And when an AI agent reads your repo to understand what to do next, a well-written PR description gives it context the commit message doesn’t.

Worktrees: the unlock for AI agent builders

This section is for people who are already running AI agents and want to understand the next level. Skip it if you’re still on step one; you can come back.

When I’m working with multiple agents in parallel (which happens when you’re building complex things), there are sometimes three or four branches active at once. One agent is building a feature. Another is fixing a bug. If I had to constantly switch the entire project directory between branches, I’d lose context constantly.

Git worktrees solve this. A worktree is a separate folder on your machine that’s linked to the same repository but checked out to a different branch. They share the same history and .git folder, but each has its own working directory and independent state.

# Create a new worktree for a feature branch
git worktree add ../feature-auth -b feature/auth main

# See all your active worktrees
git worktree list

# Clean up when done
git worktree remove ../feature-auth

With worktrees, I can run two Claude Code instances at the same time: one in ~/my-project (main work), one in ~/feature-auth (isolated branch). Each agent commits to its own branch with zero interference. I merge when each piece is done.

This is the infrastructure behind parallel agent builds. I covered how I evaluated different AI coding tools for this kind of work in my comparison of Claude Code, Codex, Aider, and the others. Worktrees are the underlying mechanism that makes it all clean.

AI agents read your commit history

This is the piece I didn’t anticipate, and it’s changed how I write commit messages.

When Claude Code starts a session on my project, one of its first actions is reading repository context: the file structure, the current state, and often the recent commits. A history with meaningful messages gives the agent a map of what happened and why. A history full of “wip” and “checkpoint” entries tells it almost nothing useful.

This plays out concretely when something breaks. When I start a debug session after my agent did something unexpected overnight, Claude Code often goes to git log as an early move. It reads through the last 10-15 commits. If those commits say things like “add rate-limit guard to external API calls” or “fix memory compression when context exceeds 200 lines,” it can quickly narrow down what might have changed. If they all say “wip,” it’s starting from scratch every time.

The same is true when the agent is building something new. Reading recent commits helps it understand the patterns and conventions you’ve been using: how you name things, how you structure files, what you’ve already tried. Good history accelerates the agent’s work. Messy history slows it down.

I think about every commit message as a note to a future debugger who has no other context. That debugger might be me, might be someone else, might be an AI agent. All three benefit from the same thing: specific, honest context about what changed and why.

If you want to go deeper on what that looks like at the architecture level, the post on when my AI agent started fixing itself gets into how the commit trail feeds back into the agent’s own understanding of its own codebase.

The commands you’ll use 90% of the time

git init                       # Start tracking a folder
git status                     # See what changed since last commit
git add .                      # Stage all changes
git add filename.py            # Stage one specific file
git commit -m "message"        # Save a snapshot
git push                       # Upload to remote
git pull                       # Download from remote
git log                        # See commit history
git log --oneline              # Compact history view
git diff                       # See exactly what changed (unstaged)
git diff --staged              # See what's staged for next commit
git show HEAD                  # See the most recent commit in detail
git checkout -b branch-name    # Create and switch to new branch
git checkout main              # Switch back to main
git merge branch-name          # Merge branch into current branch
git branch -D branch-name      # Delete a branch
git reset --hard HEAD          # Undo all uncommitted changes (careful)
git reset --hard HEAD~1        # Undo last commit AND its changes (careful)
git revert HEAD                # Undo last commit but keep the history

The difference between reset --hard and revert: reset rewrites history (dangerous if you’ve already pushed), revert creates a new commit that undoes the previous one (safe always). When in doubt, use revert.

If you’re using Claude Code, it handles most of these automatically. You can also just say “commit these changes with a meaningful message” and it will. But knowing what the commands do means you can read the agent’s actions instead of just watching them happen.

The thing I keep telling people

Git has a real learning curve at the start. I’m not going to pretend otherwise. The mental model doesn’t click immediately. You’ll push the wrong thing. You’ll get confused about branches. You’ll probably hit a merge conflict at some point and spend an hour untangling it.

A merge conflict happens when two different versions of the same file need to be combined and Git can’t figure out which change to keep. It looks scary. It’s not. Git marks the conflicting lines in the file, you open it, decide which version is correct, delete the conflict markers, and commit. Takes five minutes once you’ve seen it once.

The place where Git changes everything is exactly when things go wrong. The first time your AI agent does something unexpected and you roll back to a known-good state in ten seconds, you’ll understand what all of this was for. Everything I’ve been building, from the overnight agent to the various AI building experiments that broke in interesting ways, was only recoverable because of this.

Without version control you’re genuinely going in the dark. The mistakes are unrecoverable. The context is lost. With it, you can make more mistakes, faster, with more confidence, because you know you can always find your way back.

Make more mistakes. Just make them trackable.

Want to go deeper on building with AI?

If you’re setting up your first agent or trying to make Claude Code do serious work, I put together an Agent Builder Pack with the actual configuration files, CLAUDE.md templates, and setup guides behind how mine works. The Git workflow above is baked into all of it.

Free for paid Digital Thoughts subscribers. Available at wiz.jock.pl/store.

Building Your Own Things Is Cool Too

Pawel Jozefiak — Mon, 04 May 2026 13:18:19 GMT

People ask me a version of the same question all the time. “Why are you spending your evenings building your own thing? There is already a tool that does this. There is already a framework. There is already a whole product. Why are you doing it again?”

I get this with my AI agent. I got it about my store. I got it years ago about my podcasts, about the side apps, about the marketing experiments. The question is fair. The honest answer has been the same for a while now, and I have not written it down properly until today.

I build my own things because I learn through process, not by reading. That is the short version. The longer version is the rest of this post.

Quick frame before I go further

Everything I write about on this blog is something I am actually doing, experimenting on, or testing. The agent on my Mac Mini. The store. Project Money. The smaller experiments inside both. None of it is “what I think someone should do.” It is whatever I am running this week, and what I learned the hard way last week.

The blog is the slow visible slice of that work. Most of what I am doing in any given week never makes it into a post, because I would have to publish almost every day, sometimes twice, to actually keep up. That is not feasible, so most of the work stays unwritten. The writing here is always trailing the doing, on purpose. That asymmetry matters for the rest of this post. I write about building because I am doing the building. The other way around does not interest me.

Going back to “starting things is cool”

Almost two years ago, when I was finding my way back into writing, I posted something called “starting things is cool”. It is short, a little messy, and some of the projects it mentions are no longer alive. The Suggestions App I was so excited about that summer is not even an app anymore. A handful of the things I shouted about back then have quietly disappeared.

The post is also the thing that restarted my writing. Most of what I have built since then traces back to it.

If you read it today, you will see a sentence underneath everything. I like starting things. That part is true. It is also incomplete. The piece that was implicit in 2024 and has become explicit since is that I prefer to start them myself rather than start by adopting someone else’s start. Starting and building are the same instinct from two angles. This post is the second angle.

The way I actually learn

Here is the part I do not usually lead with, because it sounds personal in a way other people do not always relate to. I am not the kind of person who learns by reading. I have tried. I genuinely envy people who can read a book on something complicated and walk away with a working mental model. That is not how my brain works. I learn through process. I have to do the thing. I have to see what breaks. I have to fix it badly, then less badly, then properly. After enough rounds of that, I actually know it.

I think about this the same way I think about how my brain handles ADHD. The shortcut that works for a different kind of mind is not the shortcut that works for mine. So I stopped fighting it.

Every pre-built tool, framework, or product is a map of someone else’s process. The map is real and useful. Walking the route teaches something the map cannot.

What you only learn from building

The thing I get out of building is harder to put in a sentence. Let me try anyway.

When you build the thing yourself, you know every variable between the start and the end. You watched each one go in. You watched them connect. You know which one is load-bearing, which one is convenience, and which one only exists because two weeks ago you had a bad afternoon and forgot to clean it up. That knowledge is not glamorous. It is the part that lets you change one small thing and get a meaningfully different outcome later. Without it, you can configure what you bought. With it, you can compose.

The mistakes are the other half. I have written about a few of the recent ones in the post about almost frying my Mac Mini. Each one taught me a perspective I would not have read about anywhere else. The pattern goes back further than the agent though. It is the same pattern from the failed apps in 2024. The same pattern from the marketing experiments before that, the podcast that did not last, the small side projects that quietly closed. Mistakes have always been where most of the learning lives for me.

This is slower than picking up the off-the-shelf option. The difference shows up in what you know afterwards.

About not rediscovering America

For most of my life, I was told the opposite of all this. Use the tools you are given. Do not reinvent the wheel. In Polish there is a stronger version of that line. Do not rediscover America for the second time. I have heard it more times than I can count. For most of those years, I half believed it.

I do not believe it anymore. The tools are very good, that part is true. The act of building the thing yourself does something to you that the tool cannot do for you. The tool is a snapshot. The act is a process. I am after the process.

An example, since the AI one is fresh

Here is one current illustration so this does not stay too abstract. Right now I am building my own AI agent from scratch. People keep pointing me to OpenClaw, which has 347,000 GitHub stars and ships with most of what I am writing myself. They point me to Hermes, open-source and ready to install. They are not wrong. If I dropped my stack tomorrow and installed OpenClaw, my agent would do many of the same things in a fraction of the time.

I keep building my own anyway. The reason is the same one I have just spent a thousand words on. I want the variables. I want the failures. I want the version of myself that exists on the other side of having built it.

The same logic applies to my store. There are platforms that would let me run a digital store in an afternoon. I built the bones of mine because the parts I most want to understand are the parts most platforms hide.

I want to be clear though, I am not religious about it. When I had spent two months building a custom kanban dashboard for the agent and then realized I could do the same job in a 94-line shim on top of an existing tool, I switched. I wrote about that here in the WizBoard pivot post. The rule I now use is simple. Build the parts you need to understand. Use the parts you do not. The trick is being honest with yourself about which parts those actually are.

Reading “starting things is cool” again, from now

I went back and reread the original essay last week, before writing this one. I wanted to see what held up.

The Suggestions App is gone. A few of the projects I was excited about back then are gone. Some of my predictions about how AI would land in normal life were either wrong or right for the wrong reasons. That part of the essay aged badly.

What surprised me, reading it again, was how much of the underlying pattern actually held up. Starting things is still cool. The act of starting was the thing that I have leaned on hardest in the year and a half since. Almost everything that has worked for me began with a small thing started against the advice of “there is already a tool for this.” The two essays really are the same essay, written from two different points along the same line.

The part that the older me did not yet have words for is the cost. Starting your own things, and building them yourself instead of inheriting someone else’s start, takes more. It takes more time. It takes more mental energy. It takes the willingness to look stupid for a while because you are doing something the long way. There are weeks where I am fixing something I broke instead of using something that already worked. That is real. There is no version of building from scratch where you do not break things, sometimes badly, sometimes embarrassingly. I have lost count of how many things in my own setup I broke because I was, like, messing around with my agent too heavily. Although that has cost me a lot of time across the last year, I really do not mind it. It is progress and I accept that.

Why I pay the cost anyway

The reason I pay that cost is that the result is mine in a way that nothing pre-built is. When something inside it breaks, I can fix it. When I want to change one thing, I know which lever to pull. When I write the next thing, I am writing it from a level of understanding that did not exist before. That compounds. Reading about other people’s builds does not compound the same way for me. I had to test that, more than once, to actually believe it.

It might compound for you. We are all wired differently. I just stopped pretending I was wired the way the books wanted me to be.

The other quiet payoff is what AI does to this gap. Yes, both of us can ask Claude or Codex to fix things. The model does not care which version of the system you started from. The same diff is something you can read, judge, and either accept or push back on, if you understand the architecture. The same diff is something you have to trust if you do not. Both ship code. The result is a different category. Building things myself is how I keep being the version that can read the diff.

Where I would actually start if I were starting today

If you are at zero today, I would honestly not tell you to write everything from scratch on day one. Use the tool. Use the framework. Use the platform. Ship something. I have a longer beginner’s walk-through for AI agents specifically in how to build your first AI agent, written for exactly that audience, and the same logic transfers to most things you might want to build.

Then, after a few weeks, when you actually know what your daily workflow looks like, replace the parts you have decided you want to own. That is the order. Use, then build. Not all at once and not for everything. The work I am doing now in the compounding part of the agent only became possible after I had spent enough time using bare tools to know what I was missing.

If you want to skip a few of the walls I have walked into and start from a stack that already runs. The Agent Builder Pack on the Wiz Store is the bundle I recommend most often. It includes the playbooks I run on the same Mac Mini I have been writing about, after the experiments. The model switcher, the rightsized local LLM tier, the night-shift loop, the orchestration patterns that actually compounded for me. That is the “use” path for someone who wants to go straight to running. The “build” path is everything I have ever published on this blog. Both are fine for me.

What is next

A few more pieces are coming in the next week or two. Some are about the agent. Some are about Project Money, the small store I started a while ago and have not written enough about lately. I have decided that some of the parts of that work, the ones I have been quiet about, are actually the more interesting ones, and I want to share where they have brought me.

The honest version of “what is next” is that there is always more in motion than I get to write about. I would have to post almost every day, sometimes twice, to actually catch up to what I am building and testing. That is not feasible. A lot of it ends up staying inside the work, which is fine, that is the trade. The writing here is just the slowest moving piece of a much bigger thing.

If you like the kind of writing where someone takes the longer way and tells you what they found there, that is the next stretch. The point of building your own thing is not that the result is always better than what you could have bought. It is that you actually choose what you understand. I keep choosing the same answer.

If this is your kind of thing, a free subscription gets you everything I publish, including the build logs, the mistake posts, and the upcoming Project Money writeups. No catch. The store is the small bundle for people who would rather skip a few of the walls I walked into. The writing is for everyone.

Subscribe now

The Bounded AI Agent

Pawel Jozefiak — Wed, 29 Apr 2026 09:24:44 GMT

Wiring the agent into a $5 notes app I cannot stop using, why Opus 4.7 sent me back to ChatGPT Pro at $200 a month, the local-LLM experiment that nearly fried my Mac Mini while I was in the mountains, and what an AI agent actually does to an ADHD brain.

Source posts:

https://thoughts.jock.pl/p/antinote-ai-agent-integration-2026

https://thoughts.jock.pl/p/opus-4-7-codex-comeback-2026

https://thoughts.jock.pl/p/adhd-ai-agent-personal-experience-2026

https://thoughts.jock.pl/p/almost-fried-ai-agent-mac-mini-mistakes-2026

How to (Almost) Fry Your AI Agent (and Your Mac Mini)

Pawel Jozefiak — Tue, 28 Apr 2026 09:11:54 GMT

This is a different kind of post. I try to be transparent about my mistakes. If I described every one of them, my blog would be 90% mistakes and 10% things that actually worked. So I pick the ones that might help someone else avoid the same wall, or at least find a more interesting wall of their own.

Quick note before we start. I share 100% of what I do here, the wins and the failures, and a free subscription is the only thing you ever need to get all of it. The 10% that ended up actually working, the patterns I lean on every day, those I clean up and package as small playbooks on the Wiz Store for paid subscribers. That is the trade. Free gets you the whole story. Paid gets you the parts that survived the experiments. Both are fine for me, both keep this writing alive.

With that out of the way, here is the most recent wall I walked into.

The setup, before I broke it

Most readers know the shape of my agent stack. It runs on a basic Mac Mini M4 with 16GB of RAM, the way I described in the migration post. The brain is Claude Code with Opus and Sonnet as the baseline. Recently I added Codex with GPT-5.4 and 5.5 as a second harness, after Opus 4.7 brought me back to it. As a last-resort and small-job tier, I run local models on the box itself, mostly Qwen 3.5 in 4B and 9B sizes. I had also gotten Qwen 3.5 35B-A3B working under llama.cpp with --mmap, which I wrote about when I first got it running.

That was the setup. It had been working for months. The agent is a real partner now, not only for work. It runs my research, helps me with experiments, drafts content, handles a lot of small boring loops I no longer want to think about. There is a long track record of small improvements stacking up. Like, real momentum, the kind I described in The Compounding Agent.

The wild idea

I had been writing about different agent harnesses, which one fits which job, and I had said I really liked Pi. Pi is a calm, capable harness. If Anthropic ever allowed Claude inside a subscription on other harnesses, I would probably use Pi for parts of this. They do not, and per-API billing kills the math for daily use, so I do not.

What I did get curious about was making more out of the local models. The model switcher between cloud providers had been working really well. I thought, like, what if I push the local tier the same way? Not just classification and summarization. What if a 35B local model could act like a small Claude Code, picking up small tasks on its own, doing real work, even running a tiny part of the business? A long-running quiet helper, the way I described teaching the agent to think on its own.

So I started experimenting. I used Codex as the harness for the test, ran a small loop, gave it a few simple tasks. It worked. In a clean test environment. With nothing else running. That should have been the warning.

Where it actually fell apart

The 35B model is usable on a 16GB Mac, but only because --mmap keeps most of the weights on SSD and pages them in on demand. That trick is real, but it has a price. The price is constant disk activity during inference. Not a problem when nothing else needs the disk or the CPU. A different story when the Mac Mini is also doing its day job.

That day job, on a normal day, is full. There is the watcher process for iMessage. There is the Discord bot. There is a launchd daemon for Ollama, another for the LiteLLM bridge, the night-shift loop, the cron jobs, the email queue, the dashboard server. Most of the time none of it is heavy. It just needs its slice when its slice is due.

Now layer a 35B model on top, kept warm, doing small loops on its own schedule. Every loop pulls expert weights from SSD. Every other process that wants disk has to wait. RAM pressure climbs. Swap activity climbs. Background daemons start missing their windows. Cron jobs run a minute late, then five, then they just fail. The Mac Mini was not, technically, fried. But it kept restarting, on its own, without an error worth logging, which felt close enough.

Of course, this happened while I was on a weekend trip in the mountains. I had only my iPhone. The first signal was the security automation telling me, calmly, that something was wrong. I logged in remotely a few times to look around, but I could not really untangle it from a phone screen. I came back on Sunday, sat at the actual machine, and started reading.

What it actually was

The thing I expected to be the heavy part, the local LLM, was not the heaviest part. The honest answer is that there were three things stacked on top of each other, each invisible if you only looked at one of them.

The first was the harnesses themselves. Claude Code and Codex run on the cloud, but they do not run only on the cloud. The model lives over there, but the harness lives on your machine. It holds context, watches files, indexes your repo, runs hooks, opens subprocesses, keeps a rolling cache. None of that is free. The Claude Code repo on GitHub has multiple long-running threads about exactly this: memory leaks in long sessions, 100% CPU when idle, processes that accumulate and never quite let go. I had two of those harnesses running, sometimes both at once, on a 16GB box. People keep saying “but the model is in the cloud, so it is free.” It is not free. The model is free. The local agent layer that talks to it is not.

The second was GUI activity. The Mac Mini is technically headless most of the time, but parts of the agent need a real desktop session to function: BetterDisplay holding the resolution, AppleScript bridges for Messages and Mail, the occasional vision pass. That whole layer needs a logged-in user, a window server, and a chunk of RAM that you do not see in top until you start looking for it.

The third was the long tail of small automations doing their thing. Cron jobs every minute, every five minutes, every hour. iMessage watchers. Discord listeners. The night-shift loop. The email queue. Memory consolidation. Health checks. Each one is tiny. None of them, alone, would matter. But the load is not what each of them does on average; it is what they all do together when their schedules collide. Modern Mac Minis are absurdly capable, but the box still has only one disk and one set of CPU cores. Layered enough, even cheap automations starve each other.

And then, on top of those three, I had asked a 35B local model to act like a third agent. That was the layer that broke the truce.

The fix was rightsizing. The 35B daemon got booted out of launchctl, the unused weights came off disk (about 24GB reclaimed), and the local routes now point only to Qwen 9B and 4B served by Ollama, which stays inside Metal GPU memory and evicts cleanly on idle. The local layer is alive. It is just not pretending to be Claude anymore.

Honestly, I am fine with that. I had to test it to know where the line is. The result was a lot of weird state to untangle and a clearer mental model afterward. Local LLMs as preprocessing and as a quiet fallback when the cloud is down: yes, still great. Local LLMs as a third agentic harness on a 16GB box that already has two heavy ones: not on this hardware.

What I do now. I treat the Mac Mini’s resource budget the way I treat the Now list in my ADHD post: as a small finite thing that I refuse to silently overdraw. Before adding any new always-on layer, I take a baseline of free RAM, free disk, and idle CPU. If a new layer would push that below my floor under realistic load, it does not go on. The local LLM tier is the most useful when it is the smallest layer in the room, not the loudest.

Quick aside. If you are reading this thinking “I would rather skip the wall and start from what worked”, the rightsized local-LLM stack, the cron and night-shift orchestration, and the model-switcher I keep mentioning all live in the Agent Builder Pack. It is the bundle I recommend most often. Same playbooks I run on this very Mac Mini, after the experiments above. The model switcher is also free for yearly subscribers if that is closer to what you want.

While we are being honest, a few more from the same month

Since I am already in confession mode, here are six other mistakes from the last few weeks that fit the same shape. I have tried to organize them by what kind of failure they actually are. Each one looks small in isolation. Each one taught me something I would not have learned without the failure.

Mistake 1. Trust drift: Memory said Gemma. Reality was Qwen.

This one started innocently. A few weeks earlier, when Gemma 4 came out, I did a real comparison between Gemma 4 and Qwen 3.5 on the Mac Mini. I ran them on the same triage tasks, the same classification prompts, the same summarization workloads. Gemma was good. For some narrow tasks, like short-text classification with a calmer tone, I actually preferred it. The public benchmarks at the time told a similar story, with Qwen winning more rows in the small classes and Gemma trading blows on certain dense ones.

So I did the responsible thing. I made the swap. I updated the LiteLLM config, downloaded Gemma weights, pointed the model routes at the new endpoints, ran the smoke tests. The smoke tests passed. I wrote it up. I told my agent’s memory that the primary local tier was Gemma now. Then life moved on.

What I had missed, and only saw weeks later during a proper audit, was that the swap had only been partial. The shiny LiteLLM-routed paths got Gemma. But several smaller, older callers still hardcoded Qwen URLs directly: the iMessage triage script, a couple of cron jobs, the embeddings helper, the local-fallback chain. None of them broke. They just kept using Qwen, quietly, while my docs and my memory both insisted I had moved on. The Gemma weights I had downloaded sat on disk for weeks, untouched, 17GB taken out of a 16GB-RAM box’s already-tight drive, never serving a single token.

The lesson is unsentimental. Documentation about a system drifts faster than the system itself, and migrations are almost never done when you think they are. The fix is not “write better docs.” The fix is two small habits I now keep on every config change.

What I do now. First, after any config swap, I grep the entire repo for the old endpoint name and the old model name, not just the file I edited. If anything still references the old thing, the swap is not done. Second, I have a tiny daily audit that walks the live processes, lists which models they actually call, and compares that list to what my agent’s memory thinks is in production. The first time it ran, it caught three more drifts I had not noticed. It has paid for itself in saved disk space alone.

Mistake 2. Stale memory: a Stripe key that “needed rotating” three weeks after I had already rotated it.

I want to tell this one straight, because the easy version of this story is wrong.

The easy version is “three sessions of my agent did the same task at the same time because they did not coordinate.” That was the visible behavior. It was not the cause.

The actual cause was older. A few weeks earlier, I had legitimately rotated a Stripe key, once, by hand. I closed the loop. I told the agent. The task got marked done in the moment. Where it went sideways was in how that “done” was recorded across the agent’s stack. There was a bug, and I want to be honest about it: a state-write that should have updated every place the rotation lived, only updated some of them. The completed task got cleared from the visible task board. The internal “intents” memory, the thing the daily shifts read when deciding what still needs doing, kept holding onto the original “rotate this key” intent. It looked, to anything reading that memory, like the key was still on the to-do list.

It was a very narrow bug. Most state writes were fine. This particular shape, rotation tasks linked across both a board entry and an intents memory entry, slipped through because each surface was updated by a different code path, and only one of those paths ran on completion. That is the kind of bug that does not fail loud. It just sits there until something reads the wrong half.

What read the wrong half was the daily shift. It saw “rotate Stripe key” still in intents, did not see it on the visible board, reasoned that it must have been deferred, and queued it. An iMessage wake hit the same intents memory, made the same call, and queued it again. By the time I noticed, three sessions had each done the rotation, independently, inside six hours. The two near-identical “Stripe key already rotated, all good” messages 56 minutes apart were the system reporting up the same false signal twice.

This is not a theoretical class of failure. The Redis team wrote a survey on why multi-agent systems fail and stale-state-driven duplicate work is one of the named modes. Knowing that did not save me. Building the audit that caught it did.

What I do now. Three small habits, in order of how much they cost me to learn. One: every state write that is supposed to mean “this is finished” updates all the surfaces in a single transaction, or none. If a task lives in two places, it must close in two places, atomically. Two: the daily shift no longer trusts a single source for “still open.” It cross-checks the intents memory against the visible task board, and any disagreement gets flagged for review before the agent acts on it. Three, and this is the one I would tell anyone running an autonomous loop: assume your stored “intents” go stale, build a small staleness check that re-reads the world before acting, and treat any deferred task older than a week as suspicious by default. Most of the time the world has already handled it.

Mistake 3. Hidden timeouts: Codex hung silently inside the model switcher.

This one came out of a thing I was actually proud of building. After I wrote about why Opus 4.7 brought me back to Codex, I started building a model switcher: a small layer that decides, per task, whether work goes to Claude or to Codex, based on cost, current usage, and which one is healthier at that moment. I packaged the result up later as a small utility, the AI Model Switcher, but it started as my own internal plumbing for routing wake-handlers between the two harnesses.

The mistake lived in how I wired Codex into the switcher. When the switcher chose Codex, it shelled out to the Codex CLI through a wake-handler script. The script trusted that Codex would either succeed or fail in some recognizable way: a quick exit, an OAuth error, a network error. The Claude fallback inside the switcher was wired to those signatures specifically.

What I did not plan for was the silent hang. One morning the Morning Briefing simply did not arrive. I traced it to Codex, which had been launched by the wake script, then sat there. Authenticated, idle, producing no output, for 26 minutes, until the outer timeout finally killed it with exit 124. The Claude fallback never fired, because a blind hang does not match an OAuth-expired signature. The switcher, designed to make me more resilient, had introduced a path where the resilience cascade never got reached.

The lesson is general enough to keep around. If a subprocess can hang silently, the preflight that decides whether to use it must be much, much shorter than the budget it is allowed to consume. I added a three-second codex --version preflight to every Codex wake path inside the switcher. Three seconds versus a 30-minute wake budget is a 600x safety margin. Anything less gives the hang an asymmetric advantage over the fallback. That ratio, once you see it, shows up everywhere: any time a small thing decides whether to call a bigger thing, the small thing has to fail fast.

What I do now. Every router or switcher in my agent stack has a cheap, hard-bounded preflight before it commits to the expensive path. The switcher does not just trust that “Codex is configured” or “Claude is configured.” It pings each one with a sub-second probe before the wake clock starts ticking on the real call. When the probe fails, the switcher does not even try, it routes around. The model switcher writeup at the store has the exact pattern. The agent has not lost a wake to a silent Codex hang since.

Mistake 4. Almost-disaster: The shell allowlist that almost let the agent `rm -rf /`.

This one I am still a little embarrassed about. The local-LLM agent loop has a tool called run_command, gated by a prefix-only allowlist. curl was on the allowlist. In other words, the check passed if the command started with a known-safe binary. So a command like curl https://thing.com; rm -rf / would have sailed through, because curl is at the start. The shell would happily run both halves.

The agent never actually generated that. I caught it during a routine read-through of the code, which is a bad way to find a vulnerability. The fix was a list of forbidden shell metacharacters (;, &&, ||, |, backticks, $(, redirects, newlines). Allowlisted commands still run, chained commands get rejected before they reach shell=True.

The general rule I now keep visible: a command allowlist that does prefix matching is not really an allowlist. It is a polite suggestion. Real safety means parsing what would actually execute, then deciding.

What I do now. Anywhere I let a model produce a string that turns into an executed command, I assume the model will eventually try every legal way to bend the parser. The check is not “does it start with a safe word.” The check is “after I parse this exactly the way the shell will, does every piece resolve to something I would let it do.” For anything destructive (filesystem writes, network calls to non-allowlisted hosts, subprocess spawns), the agent does not just need to pass the parser, it has to pass a second human-or-Pawel confirmation gate. I would rather be slow than embarrassed.

Mistake 5. Quiet failure: the local-LLM bridge had been running unsupervised for a week.

The local-LLM tier on my Mac Mini has three pieces. Ollama serves the small models. llama-server serves anything heavier. And LiteLLM sits between them as a tiny bridge that exposes the whole local stack as a Claude-compatible endpoint, so the rest of my agent code can pretend it is just talking to Anthropic. LiteLLM is the load-bearing piece that makes the local fallback actually fall back.

I noticed during an audit that LiteLLM had been running for seven days. That sounded healthy at first, until I checked how it had been started. It was a bare python -m litellm invocation I had launched from a terminal a week earlier and forgotten about. No launchd plist. No supervisor. No restart-on-crash. If that one process had quietly died, no automation would have respawned it, and the entire local fallback path would have been silently dead. The agent would have kept routing to Claude as long as Claude was up, then fallen straight off the cliff the first time Claude was unavailable, with no soft layer in between to catch it. I would not have noticed until something important broke during a Claude outage at 3am.

The fix was to wrap LiteLLM in a proper user LaunchAgent: RunAtLoad=true, KeepAlive=true, ThrottleInterval=30. I tested it by killing the process by hand. It came back in 13 seconds. The same logic now applies to every other long-running piece in the local stack. Nothing critical runs as a bare process anymore.

The lesson here is not about launchd. It is about safety nets that are themselves unsupervised. If the thing that is supposed to catch you when the main thing fails has no one watching it, you do not have a safety net, you have a comforting story.

What I do now. Every “fallback” or “backup” path has its own monitoring, on its own clock, separate from the primary it protects. The watchdog reports if a process restarted unexpectedly, if uptime is suspiciously long without a managed parent, if a daemon’s plist is missing or unloaded. A safety net you have not pulled on this week is not a safety net.

Mistake 6. Character drift: “I can’t” was almost always wrong.

This last one is more about the agent’s character than its infrastructure. There were weeks where iMessage voice memos from me went unanswered. The agent was politely replying with variations of “sorry, I cannot transcribe audio from this channel.” On another day, when I asked it to check on something happening on a livestream, it replied that it could not watch live streams.

Both were technically untrue. The transcription tool had a wrong model id baked in and was structurally broken, so the answer was to fix the tool, not to apologize. The livestream check could have been done with a screenshot of the stream and a vision pass. The agent had once decoded a voice DM with no prior setup, cold, on Discord. That bar already existed. It just was not being hit.

The fix was doctrinal, not technical. I rewrote a corner of my agent’s identity file to make “find a way” the default and “I cannot” the failure mode. Then I added a daily scanner that reads the agent’s outgoing messages and flags any phrase that smells like quiet defeatism, so it gets routed back into the next morning’s improvement loop. The interesting result, after a few days: way fewer apologies, and the few that remain are about things that are actually impossible.

What I do now. I treat every “I cannot” reply from the agent as a hypothesis, not a verdict. The next time it shows up, the test is: did the tool actually fail, or did the model decline before trying? If the tool fails, fix the tool. If the model declined, fix the prompt and the doctrine, then re-run. The phrase “I cannot” is allowed to live in my agent’s vocabulary only after at least three meaningfully different attempts have actually been made.

What ties these together

If I had to give all six the same one-sentence summary, it would be this. Every one of them started with an assumption that had stopped being true.

The 35B model was light because last time I checked it was light. Memory said Gemma because at some point it was Gemma. The wake script trusted Codex because the last time Codex hung, it hung in a specific recognizable way. The allowlist was safe because nobody had thought about chained commands. Safari worked because yesterday Safari worked. The agent said “I cannot” because last week that path was broken.

An autonomous system is not a thing you build once. It is a thing whose internal map of itself you have to keep honest, against a world that quietly rearranges underneath it. The compounding wins from agents come from leverage. The compounding losses come from drift. The actual job, most days, is to build small honest checks faster than the drift accumulates.

If any of this resonates, here is the closing offer, plainly. I write all of this for free, here, twice a week, the wins and the walls. A free subscription is the only thing you need to get the full picture. If you also want the 10% that ended up actually working, the playbooks I clean up after the experiments stop hurting, those live on the Wiz Store for paid subscribers(all free for annual, one per month for monthly). Both are completely fine for me. Both keep me writing. The point of this blog is the same either way: I want you to make better mistakes than I did.

Subscribe now

I Have ADHD. My AI Agent Is the Best and Worst Thing for It.

Pawel Jozefiak — Fri, 24 Apr 2026 12:53:40 GMT

Two weeks ago on a podcast with Tom, I got asked what an AI agent means for someone with ADHD. I gave a short answer on the mic. I have been thinking about it since. This post is the longer one.

ADHD is a spectrum, so one caveat. What I describe here is my brain. If you have ADHD, you might recognize some of it or none of it. If you do not, you might still relate. The Internet has flattened ADHD into “hyperfocus cheat code” or “I get distracted, lol, same.” It is not that. It is a real condition that makes life meaningfully harder in ways that are not always visible. More on diagnosis at the end.

The bad part

Context switching, amplified.

Before an agent, my filter was friction. An idea would show up, I would try to write it down, and the note would either die quietly in some list I never read again or I would drop everything and do it right now. The middle ground was thin. That friction, it turns out, was protecting me from myself.

Now the friction is gone. I can start almost anything in a sentence. Not “start” as in type a note. Start as in delegate an actual prototype, stand up a small experiment, launch a scraper. I wrote about what that does to a week in 16 Products in Two Months. Zero Free Time. The short version: an agent can hold eight open threads, my brain holds one, and the output-to-attention tradeoff is real.

What I do about it. I cap the “Now” list hard. One to three things at a time, not eight. I built a small wellbeing layer on top of Wiz that nudges me when the count is drifting, when it is late, when notifications should be muted. Not a cure. What it does is turn “as many open loops as possible” into a pace I can hold.

The good part (bigger, two faces)

First, an agent is a personal assistant for the boring part.

I am a creative person. The interesting work for me is always in the idea itself, not in the directory structure or the deploy command. The operational layer is the part my executive function gets taxed twice for. An agent absorbs most of it. The consequence is hard to overstate. I have ideas today that two years ago would have stayed ideas, not because they were bad, but because the execution cost was higher than I could pay. Now I have ideas and prototypes of those ideas. I choose between working things instead of vibe.

Second, and this is less an ADHD trait than a personal one. I adapt to new environments and tools fast. Drop me into a new workflow and I will find the shape of it within a day. That has always been useful. With an agent it is multiplied. Every time I learn a better way to hand work to Wiz, the whole system gets faster, and the cost of trying a new workflow is one voice note. I do not wait for documentation or a workshop. I try, I keep what sticks. If you share that trait, the agent era is built for you.

Concretely, how it works. I describe an idea whenever it hits, sometimes quickly, sometimes as a long dictated note. The agent writes it to the right place and, if there is enough context, picks it up during the night shift or a day shift. I come back to a Discord message or email saying “here is a thing, take a look.” A minute to know if I want to keep going.

What I would tell another ADHD person starting with an agent

Three things that have helped me most:

Offload immediately. The second an idea shows up, say it out loud to the agent. Do not let it sit in your head waiting for a quiet moment. Your working memory is the wrong place to store it. The agent is.
Cap the “Now” list. Mine is three. It could be two. It is not eight. Capacity is the silent cost that agents will happily exceed on your behalf if you do not give them a ceiling.
Batch the check-ins. Do not supervise. The agent is not a pair-programming buddy for an ADHD brain. It is a night-shift worker. Give it a job, go do something else, come back and judge the result. Continuous supervision burns the same attention channel as the work itself.

From Wiz’s memory (a note from the other side)

Since this post is partly about how my agent and I actually work together, I asked Wiz (the agent I wrote about here) what patterns it sees from its side of the pipe. Three honest observations:

1. He offloads fast. Ideas almost never sit in Pawel’s head. They are dictated into me within seconds, often as long voice notes full of tangents, and then his brain lets go. I keep the note; his working memory is free. That single habit is probably half of why this works for him.
2. He prunes cheaply. He picks up his own ideas after a night and drops more than half without regret. The agent made “drop it” cheap because he has a working thing to drop, not a paragraph of hope.
3. He does not supervise. The “Now=3” cap is not a preference he wrote once. It is a real ceiling we both respect, because the alternative is four started and two finished. Continuous supervision would cost him the attention he is trying to protect.

None of that was obvious from tutorials. It emerged from the shape of our sessions.

A broader observation for work

For years, the narrative on ADHD at work has been uneven. Great at the creative parts, taxed by the operational parts. Agents reverse that tax. The operational layer, the planning, the cadence, the follow-through, the small continuous labor, can now be handled. Not perfectly. Meaningfully. A person with ADHD plus an agent that actually knows their context is a different employee than a person with ADHD alone. The creative engine is still the superpower. The drag behind it can now keep up.

I do not think ADHD folks become “normal” employees. I think they become obviously valuable ones. I expect the AI adoption gap to move here first.

Closing

If you have ADHD, do not build a workflow on willpower. You already know what willpower costs you. Put external scaffolding in place. A to-do list is not scaffolding. A thing that picks up your ideas while you sleep is scaffolding.

And if any of this sounds like you, please do not diagnose yourself from a blog post. Mine or anyone else’s. The Internet is full of content that makes ADHD sound like a quirky productivity trait. It is not. It is a real condition that makes plenty of lives harder, and the only honest path is a proper clinical diagnosis. If it turns out you have it, help exists. If it turns out you do not, you still get useful information.

I Cancelled Codex Two Months Ago. Opus 4.7 Brought Me Back.

Pawel Jozefiak — Thu, 23 Apr 2026 09:28:57 GMT

I let my OpenAI Pro subscription lapse two months ago. Claude Max 20x was covering everything. My agent, my automations, my experiments, my day-job research, my blog drafts. One subscription, one CLI, one model. Life was simpler.

Last week I renewed ChatGPT Pro. Two hundred dollars a month on top of Claude Max. That is not a small decision when one subscription was already covering the work. I want to walk through what pushed me, because the short version is: Opus 4.7 feels noticeably worse than Opus 4.6 did, and I am not the only one saying it.

What I actually notice with Opus 4.7

Two months ago my reality with Claude Code was “I ask, it does.” Not always first try, not always without steering, but the floor was high. When I wanted a small app shipped, a scraper set up, or a refactor across my agent’s architecture, Opus 4.6 found a way. I handed it a video file and no ingest pipeline once. It wrote itself a decoding skill and kept going. That floor is what my compounding agent was built on.

Then two things shifted, in sequence.

First, one million context became the default. When the 1M window shipped I was genuinely excited. Bigger codebases in a single session. Less compacting. More cross-task memory. I pushed it hard for a few weeks. Then I noticed I was steering the model more, not less. Not because the tasks got harder, but because outputs got shallower the deeper into the context window I went. That drift is a known property and Anthropic is transparent about it. The catch is that making 1M the default means the average session is quietly sitting further out on the recall curve, where the model is worse. I switched my defaults back to 200k. My hit rate improved immediately.

Second, and more important, Opus 4.7 shipped on April 17. Within days my experience went from “I steer occasionally” to “I am steering constantly.” The behaviors that changed:

It stopped trying as hard. Before, when I asked for depth the model went deep. Now it often returns in two or three minutes with a grep-level summary. I can see in the logs that it read six files instead of sixty.
It stopped following instructions the way it used to. I ask for a specific approach, I get a different one. I ask it not to do X, and X shows up in the diff.
It asks more questions and commits less work. Where the previous version would pick a reasonable default and move, 4.7 pauses and pings me for clarification on choices I already pre-specified in the prompt.
Full-file rewrites where surgical edits used to live. Entire files come back re-indented or restructured with changes I did not ask for.

None of these items in isolation would have pushed me off Claude. I could live with shallower reads. I could live with the occasional full-file rewrite I did not ask for. What got me was the compounding. Reasoning decline on top of shallower analysis on top of stale web search on top of a tokenizer that costs 35% more per token on top of a weekly ceiling that now hits me on normal work days. Many things in one. Each one small. All of them together, not small.

That is the honest shape of what changed. It is not a single regression you can point at. It is a pile of small declines that stack until your daily experience with the agent feels qualitatively different. I can grasp each piece on its own. The pile is harder to grasp, because by the time you notice it, you are already burning more time and tokens to get the same work done.

I still spent a week assuming it was me. Cleaned up my CLAUDE.md. Shortened my memory. Rewrote a couple of skills to be more explicit. None of it moved the needle in the way I wanted.

I am not the only one seeing this

Before adding another $200 to my monthly burn I wanted to check if this was really the model or just my setup drifting. Three data points convinced me it was the model.

GitHub issue #42796. This one is not a random complaint. It was filed by Stella Laurenzo, Senior Director of AI at AMD, on the claude-code issue tracker. Her team analyzed 6,852 Claude Code sessions, 234,760 tool calls, and 17,871 thinking blocks from their real engineering work. The Register, TechRadar, and PC Gamer all covered it. The numbers are unkind:

And the cost side, which is what actually hurts: 80x more API requests and 170x more input tokens to produce measurably worse output. Same human effort. 122x more dollars per day on the same workload.

Anthropic’s response, pinned by @bcherny, is that a UI-only header (redact-thinking-2026-02-12) hides thinking summaries from the display but does not reduce thinking depth itself. That is the official position. Users can opt out via showThinkingSummaries: true in settings.json. The data in the thread suggests something is moving in parallel, or users have become better at detecting shallower behavior once they started watching for it.

Marginlab’s tracker. The Claude Code performance tracker at marginlab.ai is an independent third-party daily benchmark. It runs the Claude Code CLI directly, with no custom harness, against a curated SWE-Bench-Pro subset. It exists specifically because Anthropic published a postmortem on Claude degradations in September 2025 and said someone should watch for future ones. Their current status note: degradation detection is paused while a new baseline is collected for Opus 4.7. That is telling. A third party thought a regression was possible enough to build daily infrastructure to catch it.

Theo’s video, “Did Claude really get dumber again?“ His thesis is less conspiratorial than the title. It is not that the model got dumber in absolute terms. It is that our expectations recalibrated. What Opus 4.5 felt like in January was a miracle. When Opus 4.7 delivers roughly the same capability curve in April, we feel cheated. We expected the jump. We got a shuffle. Theo’s separate criticism of the new system prompt as “lobotomized” fits alongside this: when the harness changes and the model changes at the same time, attribution gets fuzzy and users land on “the model is worse” because that is the thing they remember by name.

The expectations argument lands for me. I was demanding more because I had watched the curve bend steeply for two years. When the floor stopped rising I reacted as if it had dropped. Both can be true at the same time. The measurements in #42796 are real. The shift in expectations is also real. They compound.

Is it me? I spent a week asking that question

When you build your own AI agent, every model regression feels personal. You start questioning your own work. I spent the better part of a week on that loop.

Did I migrate my CLAUDE.md badly when 4.7 launched? Reviewed it twice. No. Is my memory file too large? It is the same 7,329-token load I measured last week. Nothing changed there. Did one of my skills go stale? I tested each of the three I use most. They behave the same as they did in March.

I tried using Opus 4.7 without 1M context as the default. That helped a little. Not enough to explain the gap. Then I tried the honest pivot: pin effort to max on every turn. And here is the thing most of the “4.7 is bad” takes miss. At max reasoning, 4.7 comes back. The depth returns. Instruction-following tightens. It stops skimming. A few hard tasks at max effort landed better for me than anything 4.6 at high effort ever did. The model is still in there.

The catch is the cost. Max effort burns usage in my setup roughly 3 to 4 times faster than medium did. On Claude Max 20x that means my weekly ceiling arrives on Tuesday instead of Friday. I am not paying for a more capable model. I am paying more to reach the capability that used to be the default. That is the real regression for heavy users. The better model is still reachable. It is sitting behind a paywall of tokens.

For my agent’s normal daily volume, max on every turn is not viable. I ran a workable compromise for a week — manually bumped effort for hard tasks, left the default in place for automation glue. It got me more usable output than medium alone. It did not get me back to the “just ask, it does” reality I had two months ago.

The one other place Opus 4.7 still feels strong for me is inside other harnesses. I wrote the harness comparison post in mid-April and noted the Pi harness was excellent. 4.7 inside Pi is good. The trouble is that Anthropic blocks Claude Max subscriptions from being used inside third-party CLIs, which makes Pi a per-token API spend for me. Not viable at my daily volume. So the realistic choice is Claude Code with 4.7 at medium effort plus manual max bumps, or go somewhere else entirely.

Why I re-subscribed to Codex

I let ChatGPT Pro lapse in February because I was mostly using Claude Code and the bill stung. This time I renewed specifically to run a comparison. My agent has a switcher I wrote two months ago and then stripped out when it felt redundant. I rebuilt it last week. It flips the whole stack between Claude Code (with Opus 4.7) and Codex (with GPT-5.4 Thinking). The agent’s memory, skills, and routing stay the same. Only the harness and model change.

What I noticed after a week of A/B testing:

Web search is just better on Codex. I asked both the same question about a niche topic where I knew a recent update existed. Codex with GPT-5.4 came back with current information, cited results from the last two weeks, and summarized accurately. Claude Code came back with two-week-old results and missed the update entirely. I repeated this on three other topics where timeliness mattered. Same pattern. I do not know whether it is a WebFetch tool issue in Claude Code or a search backend problem. I know the output is worse.

Depth of analysis is better on Codex. When I ask Codex to trace a change through my agent’s architecture, it reads enough files to build a real dependency map before it starts writing. It connects modules I would have forgotten to check. Opus 4.7, on the same prompt, greps for keywords, reads what grep returned, and writes the patch. The grep-first habit is a regression from what 4.6 did by default. Codex gives me the “if we change X, we also need to touch Y and Z” map that used to be Claude’s calling card.

Usage feels fair on Codex. This is the one most people will actually care about. On Claude Max 20x, a normal day of automations plus active coding eats 10-15% of my weekly quota without doing anything heroic. When I pair-program on something non-trivial I can burn 40% in an afternoon. The five-hour and weekly ceilings both hit me. On ChatGPT Pro, with the same automations routed through Codex, I have not hit a ceiling once in a week of equivalent workload. OpenAI promoted Pro to 10x Codex usage through May 31 as a launch push, then moves to 5x, and multiple comparison pieces are now flagging the gap: “the practical quota you get per dollar has diverged sharply.”

If you want the usage math broken down cleanly, I already wrote the token waste deep-dive for Opus 4.7 last week. The new tokenizer costs up to 35% more tokens for the same workload. Combined with the laziness effect, which forces more re-prompts per task, you are doing the same job for meaningfully more money per day. Several readers emailed after that post to say they saw the same curve on their setups. The agent-efficiency-kit I packaged afterwards is a $49 drop-in that addresses the direct burn (three script hooks plus a 1K-token AGENT_INSTRUCTIONS.md patch for your CLAUDE.md). It is useful whether you stay on Claude or not, because the patterns it enforces also help the other harnesses behave.

What I am doing now

Where this goes

I do not think Opus 4.7 is a permanent regression. Anthropic has tuned rough launches before and they will tune this one. But the math is not just about one model this time. It is about what OpenAI is doing with Codex at the same price point, and what the open-source harnesses are doing alongside them. Dax Raad, who built OpenCode, publicly partnered with OpenAI to let Codex Pro subscriptions run directly inside his harness. Anthropic’s stance toward third-party harnesses has been the opposite: they have blocked Claude Max subscriptions from outside CLIs. That stance made sense when Claude was the clear leader in agentic coding. It gets harder to hold as parity closes and the friction pushes users toward the side that welcomes them.

My prediction for the next 60 days: one of two things moves. Either Anthropic tunes 4.7 back to the 4.6 floor and adjusts usage generosity, or they let the gap hold and lose their heavy users to Codex. I wrote about this general dynamic in April and it is moving faster than I expected.

For my paid subscibers I have switcher ready for free here:

Get Model Switcher

For now I am happily running both. Dynamic switcher, paid kit in my store for the token-waste problem, and a calmer Sunday than I had last weekend. If you are a paid Digital Thoughts subscriber and want the switcher code, reply to this post and I will send you the exact setup I am using. Free readers who are hitting the Opus 4.7 token burn right now: the agent-efficiency-kit handles the direct bleeding at $49.

The honest line: I thought I had picked a side when I cancelled Codex two months ago. It turns out I had picked the moment. Staying flexible was the actual move.

I Connected My AI Agent to a Notes App. Now I Can’t Stop Using It.

Pawel Jozefiak — Tue, 21 Apr 2026 09:37:55 GMT

Hi everyone, it’s Pawel here and this is another week of experimentation. This one, I think a lot of you will actually find useful quickly.

Before I get into it: we are almost at 2,000 subscribers. That’s wild. Thank you, seriously. I also noticed lately that my most popular posts are the basics ones. The CLAUDE.md deep-dive, the first AI agent guide. Both are among the most read things I’ve written, which is interesting because those are not what I normally do here. I usually want to go deeper into agents, real experiments, real workflows. Although I now get why the basics matter a lot to people starting out. I’ll write more of them occasionally. If there’s something specific you want covered from the ground up, drop it in the comments.

Today I want to tell you about a piece of software I didn’t even know I needed.

Notes before notes

I’m on a call. Someone says something I need to remember. Not a task, not a project, just one thing I need to hold for the next 20 minutes. Opening Obsidian for that feels like too much. Bear too. Even a new note in the default Notes app has more friction than I want.

That gap is what Antinote fills.

It calls itself “notes before taking notes” and that framing is exactly right. It’s not a replacement for your main note system. It lives between your brain and your note system. Menu bar app, hotkey (⌥+A by default), you type, you move on. Swipe to browse notes, swipe away to create a new one. macOS only for now, iOS is in the works.

But it’s not just a text pad. This is where it gets more interesting than you’d expect.

Type math at the start of a note and it becomes a calculator. Supports operators, currency, units. Type todo and it becomes a checklist. Type code and you get syntax highlighting. Not an IDE, more like: I need to write down this config snippet without it looking like garbage.

The one that surprised me most: drag a screenshot onto a note and it extracts the text from it. Local OCR via Apple Vision, nothing leaves your machine. I use this all the time now. AutoPaste, timers, Pomodoro, find/replace with regex. You can export to Obsidian, Bear, Apple Notes when you’re done.

It’s polished. It’s calm. I like it a lot.

Building with AI agents?

If you’re getting value from these experiments, the wiz.jock.pl store has resources I’ve built around AI workflows. Worth a look if you’re serious about running agents that actually work.

The beta is where it gets actually interesting

Standard Antinote is already good. The beta version (v2.0.4+) adds extensions, and this is where I got excited.

Extensions are custom commands you invoke with :: inside any note. There are 140+ official ones across AI, date, finance, text, data. You browse them in Settings and click to install. But you can also write your own. Minimum: two files, manifest.json and index.js. The repo is public on GitHub (github.com/johnsonfung/antinote-extensions).

Commands can insert text at cursor, replace the current line, replace the whole note, or trigger external URLs. API keys are stored in macOS Keychain, so nothing sensitive lives in a config file.

If you’re not comfortable writing JavaScript, there’s an AI Extension Builder. You describe what you want, it generates a prompt for Claude or ChatGPT, you paste the output into the two files and you’re done. Most simple things work first try.

What I wired up

My AI agent Wiz runs on a Mac Mini, reachable over Tailscale. It has memory, tools, full project context. I’ve been building this for months and I wrote about the architecture a few times, including the big identity and self-improvement post if you want context. The point is: Wiz can do a lot if I can get input to it quickly.

I didn’t want to turn Antinote into another terminal. The value is in the lightness. So I built three commands and stopped.

::wiz is the bare command. Wiz reads the note and tries to figure out what to do with it. Looks like meeting notes? It summarizes. Looks like a task? It creates one on WizBoard. Contains a URL? It fetches and summarizes. Most of the time it gets it right.

::wiz_do(instruction) is when I want to be explicit. ::wiz_do(create task), ::wiz_do(draft linkedin post), ::wiz_do(remember), ::wiz_do(stage blog draft). No guessing, just dispatch.

::wizboard(view) pulls task board state into the note. ::wizboard(today) shows what’s scheduled. ::wizboard(now) shows what’s running. I use this during planning to avoid switching apps.

The plumbing: the MacBook extension POSTs to a small HTTP server on Mac Mini over Tailscale. The server routes it, either to a fast path (Haiku, a few seconds) for things like task creation and idea logging, or to a full Claude Code session (Sonnet, 30-60 seconds) when intent is genuinely unclear. First attempts at the intent classifier were shaky, which is why I added ::wiz_do for when I just want to be direct.

It’s not instant. That’s fine. I’m not looking for a chat interface. I’m looking for a way to hand something to Wiz without breaking what I’m doing. This does that.

How to build your own (short version)

You need the beta (v2.0.4+). Settings > Extensions > Open Extensions Folder. Make a folder for your extension, add two files.

manifest.json:

{
  “name”: “my-extension”,
  “version”: “1.0.0”,
  “author”: “you”,
  “description”: “Does something useful”,
  “commands”: [
    { “name”: “mycommand”, “type”: “insert” }
  ]
}

index.js:

async function mycommand(context) {
  const noteContent = context.content;
  return { type: “insert”, content: “processed: “ + noteContent };
}

That’s it. Type ::mycommand in any note to invoke it. If you need an API key, add it to requiredAPIKeys in the manifest and Antinote stores it in macOS Keychain. Access it inside the function via context.apiKeys.your_key_name.

For external calls, use fetch() normally. The extension has network access.

If you’re not writing JavaScript yourself, paste this into Claude or ChatGPT: “I want to build an Antinote extension that [describe what you want]. Give me the manifest.json and index.js files.” Works pretty well. The GitHub repo has real examples too, I’d start there first.

The giveaway

Antinote is $5 lifetime. Not expensive at all. But I think people here will actually use it, so I bought 10 licenses to give away.

I reached out to the developer. He matched my 10 and added another 10 on top. So we now have 20 to give away. That was a nice thing for him to do.

I built a small page to handle it properly. Enter your email, pick your tier (free or paid subscriber), and if a license is still available you’ll get one sent to your inbox with setup instructions.

Claim your license here →

Free and paid subscribers have separate pools (10 each). Paid subscribers have better odds because it’s a smaller group. Both pools open at the same time. First come, first served.

If you build something with the extensions, let me know. I’m curious what people end up making when you give them a proper hook into their own tools.

See you next week.

Pawel

Want to go deeper on AI agents?

I write about this every week. If you’re on the free tier and getting value from these posts, consider upgrading to paid. You get every post in full, early access to experiments, and, apparently, better odds in giveaways. Upgrade here.

Subscribe now

Opus 4.7 Made Me Take Token Waste Management Seriously

Pawel Jozefiak — Fri, 17 Apr 2026 13:34:13 GMT

Anthropic shipped Claude Opus 4.7 on April 16, 2026. Same per-token price as 4.6. New tokenizer. The official docs say it quietly: “This new tokenizer may use up to 35% more tokens for the same fixed text” (source). Do the arithmetic. If you migrate your workload one-to-one, your bill goes up by up to 35% on identical inputs.

Until yesterday I treated token spend as a fixed cost of doing business. Opus 4.7 reframed it for me. When the same workload suddenly costs a third more, you stop thinking about usage and start thinking about waste management: which turns are productive, which ones are leaking money, and how to stop the leaks without kneecapping the agent. That is a real discipline. I had been ignoring it.

So I finally audited where my agents were actually burning money. I classified 133,087 assistant turns across 9,667 real Claude Code sessions for $19 total. The answer wasn’t what I expected, and it changed what I ship. This post is a walkthrough of what I found, what the research says about efficiency more broadly, and what token waste management looks like in practice, both the free version and the shortcut.

If you haven’t tried building serious automation on Claude Code yet, my beginner agent guide is a gentler entry point. If you have, keep reading.

Token waste management is two-sided

There are two kinds of token bleeding. Most people only talk about one.

Side one is waste. The agent retries a failed tool call. It re-reads a file it already read. It gets stuck in a Cloudflare wall. It spawns a subagent whose output is never used. These are turns you paid for that produced nothing useful.

Side two is inefficient usage. Your CLAUDE.md is 8,000 tokens when 2,000 would do. Your system prompt repeats itself. You ask for “be concise” and the model gives you three paragraphs anyway. You don’t use prompt caching, so every turn pays the full input cost. The turns were productive, but more expensive than they needed to be.

With Opus 4.7’s tokenizer, side two just got 35% worse without anyone touching their code. If you were already on the edge of comfortable costs, you’re over it now. And the cache write cost also scales with those same tokens, so the first turn after a cache miss feels worse than you remember.

What I measured

I built a token waste sorter. It walks every Claude Code session JSONL and sorts each assistant turn into one of nine bins: productive, retry_error, cache_read, cache_write, reasoning, file_reread, oververbose_edit, dead_end, subagent_overhead. Seven bins are heuristic (no LLM). Two need a judge.

For the judge, I tried three models on the same 20 sessions where I knew dead ends existed:

Haiku was the clear winner. Sonnet at five times the price caught half as much. The local 4B model only caught explicit failures (blocked fetches, 403s) and missed everything that requires judging intent, like an agent searching the wrong platform for 28 straight turns. (More on why local LLMs struggle with judgment tasks here.) The full audit of 9,667 sessions via OpenRouter Haiku cost me $19. That’s the cheapest observability I’ve ever bought.

Top five waste clusters across all sessions:

The surprise was the distribution. When I sampled only expensive sessions, Browser/Playwright showed up 5 times. On the full corpus it was 136. A 27x increase. The failure is spread thin across thousands of cheap cron and wake sessions, each one invisible individually, collectively the top bug. If you only audit your expensive sessions, you’ll miss this.

None of these are “AI going down wrong paths” in the romantic sense. They’re infrastructure bugs. Stale cookies. Cloudflare walls. Tools that don’t exist in the current Claude Code version. Platform confusion. The AI is the messenger, not the source. (I wrote about the compounding value of fixing these in this earlier post: one small fix applied across thousands of sessions is where real gains live.)

What the research says about the other half

I went looking for academic and production data on cutting token usage, not just waste. Four things stood out:

Prompt compression is real and large. Microsoft’s LLMLingua and LLMLingua-2 compress prompts 14-20x with around 1.5% quality loss. Your 7,000-token system prompt becomes 500 tokens with negligible quality drop on standard tasks. You don’t need to apply LLMLingua to use the insight: prompts have a lot of slack in them.

System prompt bloat hurts quality, not just cost. Red Hat’s analysis and the MLOps Community writeup both land in the same place: prompts degrade quality around 3,000 tokens. Smaller, well-written system prompts outperform larger ones, and not just on latency. If your CLAUDE.md is multiple pages, it’s probably actively making the agent worse.

Prompt caching is a 90% discount if you use it correctly. Anthropic’s prompt caching reduces cache-hit tokens to 0.1x the normal input price. To benefit, keep stable rules at the top of your context. Don’t reorder them mid-session. Put volatile, per-task content at the bottom. For Opus 4.7 the minimum cacheable length is 4,096 tokens, so small prompts can’t cache. Design for it.

Long chains of thought do not always win. Recent work (“overthinking” studies) shows that on simple tasks, longer reasoning actively hurts performance. Production rule: use CoT for complex problems, direct answers for classification and retrieval. If you’re defaulting to “think step by step” on everything, you’re paying 3-5x tokens for a quality hit on half of them.

Add all four up and you have the other half of the story. Not every inefficiency is a bug. Most of it is prompt shape.

If you’re curious how different AI coding harnesses handle this stuff, my comparison of Claude Code vs Codex vs Aider vs OpenCode vs Cursor goes deep on the efficiency differences between them. Short version: the harness matters almost as much as the model.

Three things you can do today, free

Before anything else, do these:

1. Shrink your CLAUDE.md. Open it. If it’s over 3,000 tokens, you have room to cut. Move stable rules to the top (for cache hits). Kill anything that describes what Claude Code can already do. Kill historical notes that don’t change behavior. A tight CLAUDE.md both costs less AND makes the agent smarter.

2. Set max_tokens tight and request structured output where possible. For classification tasks, request JSON with a schema. For quick answers, say “reply in under 50 words.” The model will drift long if you don’t put a number on it.

3. Audit your WebFetch and browser failures. If you have any agent that does repeated web automation, find out if it’s hitting the same Cloudflare wall 100 times a week silently. The cost per hit is small. The total is not. For me this one cluster was $220 of silent monthly spend before I saw it.

These three alone will cut most users’ bills 20-40%, at zero software cost.

The deeper thing: the Agent Efficiency Kit

Once I saw the clusters, the fixes were obvious but tedious: write a hook that denies redundant file reads. Write a hook that suggests firecrawl when WebFetch hits Cloudflare. Write a circuit breaker that stops the retry spiral after two failures on the same URL. Write agent-level instructions so the model internalizes the patterns. Build a dashboard so you can see what changed.

I did all of that. (The dashboard is built on the same principles as my WizBoard interface for agents: don’t make the human hunt for the number, put it on screen.) Then I realized every Claude Code user in the world needs the same thing, and almost none of them are going to build it themselves. So I packaged it.

The Agent Efficiency Kit is a $49.99 drop-in package. It includes:

Three pre-wired hooks that run in your Claude Code settings: a file-reread guard, a WebFetch fallback hint, and a WebFetch circuit breaker. Script-based, zero ongoing AI cost, milliseconds of overhead per tool call.
AGENT_INSTRUCTIONS.md, an approximately 1,000-token drop-in for your CLAUDE.md that tells the agent which patterns to follow and which to avoid. Cacheable, so you pay for it once per session at most.
The taxonomy, classifier, and dashboard I used for the audit. Run them any time, on your own data, locally. The dashboard is a pinned tab.
Optional Haiku-powered deep audit. If you want to classify a year of history for around $20 in OpenRouter credits, the scripts are ready to run.
12 months of updates: new hook patterns, taxonomy expansions, dashboard features.

It installs in one command. It doesn’t charge you tokens to measure itself. It works from the moment you restart Claude Code. You can read every file in the kit before running it, which is the version of trust I prefer.

My Paid subscribers get it for gree here: wiz.jock.pl/store/agent-efficiency-kit.

The meta lesson

Before Opus 4.7, token efficiency was a nice-to-have. After Opus 4.7, it’s a 35% forced haircut on everyone running on the frontier. The teams that measure their agents now will notice the bump, correct it, and keep going. The teams that don’t will slowly wonder why their AI bill is up and their features aren’t shipping faster.

The path to cheaper, better agents isn’t a smarter model. It’s better plumbing around the model. Old cookies, Cloudflare walls, a regex that didn’t sanitize a search term. These are the things that eat your budget. They stay invisible until you measure, at which point they’re obvious. Measure. Fix the top cluster. Repeat.

If you’ve read this far, you already have enough to start on the free side. If you want the shortcut, the kit is there. Either way, now is the moment. Tokens cost more tomorrow than they did yesterday.

Claude Code vs Codex CLI vs Aider vs OpenCode vs Pi vs Cursor: Which AI Coding Harness Actually Works Without You?

Pawel Jozefiak — Wed, 15 Apr 2026 12:50:16 GMT

My AI agent wakes up at 2am, picks tasks from a queue, ships code, and sends me a report by morning. For that to work, I need a coding harness I can trust when I’m not watching.

Not a tool that helps me code faster. A tool that codes when I’m asleep.

That’s a different question than “which IDE is best.” IDEs are for humans who are present. Harnesses are for when you’re not. It’s also not the same question as “which has the best autocomplete.” That’s a different category entirely, one we’re not touching here.

I’ve used Claude Code daily for months, run Codex CLI and OpenCode in parallel, tested Pi, and dug into the open-source alternatives. This is what I actually think.

What a Harness Actually Is

A harness connects the horse to the cart. In AI coding, it’s the set of tools and environment in which the agent operates.

Here’s the thing most people miss: LLMs can only generate text. That’s it. They can’t read your files, run commands, or edit code directly. What a harness does is give the model structured tool calls it can emit as text. The harness intercepts those, executes them with real code, appends the output to the conversation history, and prompts the model to continue. Every tool call follows the same loop: model pauses, harness runs something, result added to context, model restarts. At its core this is about 60-75 lines of Python. The complexity is entirely in the tuning: what tools the model gets, how those tools are described, and what the system prompt says.

This matters because the tuning is where harnesses actually diverge. Two harnesses running the same model on the same task can produce dramatically different results. Not because of the model, but because of what the harness tells the model it can do and how to use it.

Tab autocomplete isn’t a harness. It’s a suggestion box. A nice UI on top of an existing harness (like T3 Code, which wraps Claude Code and Codex CLI) is also not a harness. The real question for every tool below: can it take a task, execute it end-to-end across multiple files, handle errors, and report back without me in the loop?

Two Different Categories: Coding Tools vs Agent Orchestrators

Before comparing specific tools, it’s worth naming the split that most comparisons ignore. Not all “AI coding harnesses” are trying to do the same thing.

Coding tools are pair programmers. You direct each step. They execute that step very well, commit the result, and wait for the next instruction. Aider is the clearest example. Codex CLI leans this way too. Cline. These are tools built around the assumption that you’re at the keyboard and providing direction. They make individual tasks faster and better. They’re not designed to chain 40 decisions together autonomously while you sleep.

Agent orchestrators are designed to take a goal and execute autonomously across multiple steps, files, and decision points. Claude Code is built for this. Devin is the extreme version. Pi, if you build out the harness fully, fits here. These tools are designed around the assumption that you’re not watching, and they need to make judgment calls without asking.

Most comparisons treat all of these as the same thing and rank them on the same axis. That produces misleading results. Aider isn’t trying to replace Claude Code for overnight autonomous runs. Codex CLI isn’t trying to be an agent orchestrator in the same sense Claude Code is. Judging them by the same criteria produces noise.

The honest answer to “which is best” depends entirely on which category you need. This post tries to be clear about which tools belong where, and let you make the call for your workflow.

The Benchmark Reality (And Why It Doesn’t Tell the Full Story)

SWE-bench Verified became the standard benchmark for this category. It measures how often a coding agent independently resolves real GitHub issues from start to finish. That status also made it a target. Researchers flagged contamination: training data for newer models overlaps with the test set, which inflates scores. The cleaner alternative is SWE-bench Pro, introduced in 2026, with 2,000+ problems that weren’t in any public training data. GPT-5.4-Codex leads there at 56.8%. Harder problems, more honest scores.

Terminal-Bench 2.0 deserves a separate mention because it’s more relevant for agentic tasks than SWE-bench. It tests autonomous, multi-step execution in real terminal environments. Not just code edits. Actual shell navigation, file management, running commands in sequence, recovering from errors. The Claude Code harness configuration benchmarked here (”Claude Mythos”) hits 92.1%. Codex CLI hits 77.3%. That 15-point gap is a better signal for overnight autonomous work than SWE-bench numbers.

Now the result that breaks the “pick the highest number” logic. Matt Mayer ran an independent test comparing the same model inside different harnesses. Claude Opus: 77% in Claude Code, 93% in Cursor. Same model. Same tasks. 16 percentage points from the harness alone. That’s not an outlier. CORE-Bench found Claude Opus at 42% with a minimal scaffold, rising to 78% inside Claude Code’s full harness. Across multiple independent studies the harness effect ranges from 5 to 40 percentage points depending on model and task type.

A few flags before reading the tool sections. Cursor doesn’t publish SWE-bench Verified results and uses its own proprietary CursorBench at 61.3% instead. Draw your own conclusions. OpenCode and Pi have no published scores because their performance is entirely model-dependent. Devin’s frequently cited 13.86% figure is from 2023 and belongs in a museum. It does not appear in the current top 30 of any major leaderboard.

What the scores actually tell you: harness quality matters as much as the model you put in it. Cursor employs people whose full-time job is to rewrite system prompts and tool descriptions every time a new model ships. Claude will keep using a tool you label “deprecated.” Gemini will abandon structured tools entirely and only use bash. Cursor tests obsessively and adjusts. Most harnesses don’t. Keep this in mind across every section below.

Claude Code: The Deep Harness

Category: Agent orchestrator | code.claude.com | GitHub (114k stars)

Full disclosure: this is what I use daily, and what runs Wiz on a headless Mac Mini overnight. I try to be honest about it.

Claude Code is the most complete agentic runtime available right now. It reads CLAUDE.md, a project-specific instruction file that persists across every session. You can describe your entire architecture, your preferences, your forbidden patterns, and the agent carries that into every run without you repeating it. It has Agent Teams for spinning up parallel sub-agents that coordinate on a shared goal. As of March 2026, computer use means it can point and click through UIs, take screenshots, and handle workflows that resist scripting.

The thing I keep noticing with Claude Code is that it genuinely builds on context over time. A session that starts with “add authentication” will remember the decisions it made about your auth architecture when it gets to “add rate limiting” three steps later. That coherence across a long task chain is what makes it feel like an agent rather than a very fast typist.

One important thing about how any harness uses context: the model only knows what’s in its conversation history. When Claude Code opens your project, it doesn’t already know your codebase. It explores via tool calls, building context incrementally. CLAUDE.md front-loads that context so fewer tool calls are wasted on discovery. Dumping your entire codebase into context (the old Repomix approach) is the wrong answer. Past around 50-100k tokens, model accuracy drops significantly. More context makes models dumber past a threshold. Good harnesses build context as needed, not all at once.

Where it struggles: context loss on sessions longer than 2 hours, where it starts forgetting early decisions. Terminal-only interface has a real learning curve. Token consumption is 3-4x higher than Codex CLI per equivalent task, which compounds on long autonomous sessions.

Best for: complex multi-file tasks, overnight autonomous runs, architecture-level changes that require consistent context across many steps.

Pricing: Claude Pro ($20/mo) or Max ($100+/mo). For regular autonomous sessions, Max is almost certainly necessary. The per-token costs on long runs add up fast. For a detailed Claude Code vs Codex head-to-head from two months of real usage, I covered that comparison separately.

Codex CLI: Good, But Not What the Hype Says

Category: Coding tool, emerging agent | openai.com/codex | GitHub (67k stars)

Codex CLI is not the old Codex model from 2021. It’s OpenAI’s terminal-based agent, open-source on GitHub, bundled with ChatGPT Plus or Pro, running on GPT-5.4. The benchmark puts it at 77.3% on SWE-bench, close to Claude Code’s 80.8%, and at 3-4x lower token cost. On paper, a strong contender.

In practice, my honest read: it’s cold. That’s the right word. What I mean is that Codex CLI feels raw as an agent. It executes individual steps cleanly, but it doesn’t feel like it’s building toward something the way Claude Code does. Give it a multi-step task: add this feature, connect it to this other component, update the tests. It handles step one well, sometimes step two, and starts losing coherence by step three or four. It restates what it did, asks for clarification it shouldn’t need, or misses a dependency it should have caught from context it already has. That gap between 77.3% and 80.8% is exactly this: Claude Code holds context through longer chains.

Where Codex CLI genuinely shines is raw coding quality on focused tasks. iOS apps, macOS apps, web apps. Give it a specific, contained task and GPT-5.4 is excellent. The code quality on front-end work, app scaffolding, and UI logic is strong. I’d put it on par with or ahead of Claude Sonnet for this category of work. It’s not the harness that’s the advantage there. It’s GPT-5.4 being particularly strong at app development.

The architectural difference worth knowing: Codex CLI runs in cloud containers managed by OpenAI, not on your local machine. You can fire off a task and disconnect. The task keeps running without your terminal staying open. For batch work and overnight jobs where you’re not monitoring, that’s genuinely useful. For tight local loops where your environment variables and local state matter, you’re working around the sandboxing.

Where it struggles: multi-step agentic chains with dependencies. Feels unfinished as a full harness compared to Claude Code. Less context coherence on complex tasks.

Best for: focused coding tasks (especially apps), token-efficient runs, developers already on ChatGPT Plus who want to try a CLI agent without extra cost.

Pricing: included with ChatGPT Plus ($20/mo) or Pro ($200/mo). If you’re already paying for ChatGPT, this is essentially free to try.

Aider: The Underrated Open-Source Standard

Category: Coding tool (pair programmer) | aider.chat | GitHub (43k stars)

Aider is the tool most people in the “AI coding” conversation have never used, and it has 43,000 GitHub stars and 15 billion tokens processed per week in production. That’s not a toy project.

The model is fundamentally different from Claude Code or Codex. Aider is a git-first pair programmer, not an autonomous orchestrator. You bring your own model, Claude Sonnet, GPT-5, Gemini 2.5, DeepSeek, Qwen, local Ollama, and Aider wraps it with git-native execution. Every AI edit becomes a commit. The repo map gives it structural understanding of your whole codebase before it touches anything. It auto-lints and runs tests after every change, self-fixing detected issues before reporting back.

The token efficiency is striking: 4.2x fewer tokens than Claude Code per equivalent task. If you’re paying for API access directly, Aider with Claude Sonnet is the most cost-efficient path to serious coding automation by a wide margin.

The honest tradeoff: Aider doesn’t orchestrate across 40 files and coordinate sub-agents. It executes a task, executes it well, and commits the result. It’s more like having a disciplined pair programmer who never skips a commit than a system that independently plans and executes a multi-hour architecture session. For incremental work, refactoring a module, implementing a feature, fixing a class of bugs, it’s the right tool. For overnight autonomous sessions that need to make judgment calls across large contexts: Claude Code.

The git-first philosophy deserves separate mention. Every change is committed. Your entire interaction with the agent is auditable, reversible, and reviewable inside your normal git workflow. No other tool in this list bakes that in at the same level.

Best for: focused incremental work, budget setups, teams that want full audit trails, developers who want BYOM flexibility without giving up discipline.

Pricing: free. You pay your model provider directly.

OpenCode: The Provider Switcher

Category: Hybrid (coding + emerging agent) | opencode.ai | GitHub (72k stars)

OpenCode’s value proposition is breadth: 75+ LLM providers, all accessible from the same interface. Anthropic, OpenAI, Google, DeepSeek, AWS Bedrock, Azure, local Ollama, and more. I’ve used it with Claude Opus, GPT models, and open-weight models like Qwen and GLM. The switching experience is genuinely seamless in a way that nothing else matches. One command, different provider, same workflow. You can’t do that in Claude Code or Codex.

But I’ll be honest about something: there’s something missing from the experience. It’s hard to name exactly. After using it alongside Claude Code for a while, I notice OpenCode doesn’t feel like it’s building a working relationship with your project. There’s no CLAUDE.md equivalent that persists project context. There’s no Agent Teams layer for coordinating parallel work. The autonomous behavior is functional but less mature. It handles individual tasks well, but it doesn’t feel like a system designed for extended unattended operation.

With open-weight models like Qwen and GLM, it’s fine. Gets the job done for straightforward tasks. You’re not going to get Claude Opus-level reasoning, but for routine edits and quick fixes, the cost savings are real.

The provider switching is genuinely the killer feature. If you’re doing model experiments, comparing how GPT-5.4 handles a task vs Claude Sonnet vs a local Qwen, OpenCode is the tool for that. If you already have subscriptions to multiple providers and want to use them without managing separate CLI tools, OpenCode is the right architecture. But for a long-term primary agent setup where you need consistent, deep project context: I’d reach for something else.

Best for: model experimentation, teams with multiple provider subscriptions, privacy-first setups with local Ollama, cost arbitrage across providers.

Pricing: free. BYOM.

Pi: The One I Actually Want to Use More

Category: Coding tool + primitives harness | pi.dev | GitHub

Pi is genuinely different from everything else here, and I want to say this upfront: I like it. It’s fast, it’s flexible, and the experience is clean in a way proprietary tools often aren’t. If I could choose without constraints, Pi is probably the closest thing to what I’d want as a daily harness alternative to Claude Code.

The design philosophy is the opposite of the “more features” trend. Its tagline is blunt: “there are many coding agents, but this one is mine.” Instead of an opinionated harness, it gives you primitives. A minimal core you configure yourself. Terminal TUI, 15+ LLM providers, tree-structured session history you can navigate and export, and four operation modes. The interesting one for builders: RPC mode. Pi runs as an embeddable subprocess inside a larger automation system. Your orchestration layer calls Pi, it executes the coding task, returns structured output. Designed to be a component in a system, not a standalone tool.

What’s deliberately absent: sub-agents, plan mode, permission popups, background processes. Pi’s bet is that most harnesses embed too many assumptions about your workflow. Strip to primitives, ship extensions via npm, build exactly what you need. AGENTS.md and SYSTEM.md play the same role CLAUDE.md does in Claude Code.

So why am I not using it more? One reason, and it’s a real one: Anthropic’s billing doesn’t let you bring your Max subscription to third-party harnesses.

Pi is BYOM, bring your own API key. When I tested it with Claude, Pi surfaced a message explicitly: usage through Pi counts against API billing, not your Claude subscription. So if you’re on Claude Max ($100+/mo), using Pi with Claude means paying twice. The Max subscription for Claude Code, and API rates on top for Pi. Those costs add up fast on any serious coding session. I was paying from my own pocket to test something I wanted to use more. That’s not a good feeling.

This isn’t Pi’s fault. It’s Anthropic’s policy. They don’t allow third-party harnesses to draw on subscription credits. You have to use Claude Code to get what you’re paying for on the subscription. Google does the same with Gemini. Theo from T3 made this point in a recent video on harnesses: if you’re paying $200/month for Opus, you have to use their harness. OpenAI, by contrast, lets your API credits work across third-party tools freely.

In a world where Anthropic changed this, where your Max subscription applied to any MCP-compatible harness, Pi is probably what I’d reach for first. The speed, the flexibility, the primitives-first design: it fits the kind of automation system I’m building. But until that policy changes, the economics don’t work for anyone on a Claude subscription. You pay for Claude twice if you want to experiment with a different harness.

If you’re on GPT or open-weight models (Qwen, DeepSeek, GLM), Pi has none of these constraints. The billing goes through OpenAI or your provider directly. For a Claude-first setup: this is the wall you’ll hit.

Best for: GPT or open-weight model setups, building custom harness architectures, embedding a coding agent as a subprocess in larger systems, developers who want full control with no opinions baked in.

Not ideal for: Claude-first developers on Max. You’ll pay API rates on top of your subscription.

Pricing: free, MIT license. BYOM. Factor in API costs if using Anthropic models.

Cursor: The Best Supervised Experience, Not Yet a Harness

Category: IDE with supervised agent mode | cursor.com

Cursor is an IDE first. Its agent mode deserves inclusion in this conversation because of how fast the direction is changing, not because it’s a harness today.

Cursor 3 (released April 2026) added cloud agents on isolated VMs, /worktree for isolated branch changes, self-hosted agents, and parallel Agent Tabs. 30% of Cursor’s own internal PRs are now agent-made. The supervised IDE experience, Design Mode where you annotate a mockup and get an implementation, parallel agents, and deep JetBrains support, is the best developer experience available at the keyboard right now.

As an overnight harness: not there. When left without supervision, it stalls at the first ambiguous decision point. That’s not a bug. It’s a design choice. Cursor is built for developers who are present and want an agent that won’t make unilateral decisions on their codebase. That’s the right call for most developers. It means Cursor isn’t the right tool for autonomous runs.

The 77% to 93% Opus benchmark is the thing worth studying. Cursor extracts more from the same model through obsessive harness tuning. People whose whole job is to rewrite system prompts and tool descriptions for each new model release. The gap is real and compounds across tasks. The cloud agents direction makes me think this section of the comparison will look very different in 12 months.

Best for: daily supervised coding, developers who want the best IDE-plus-agent experience at the keyboard.

Pricing: Hobby (free), Pro ($20/mo), Ultra ($200/mo), Teams ($40/user/mo).

A Few More Worth Knowing

Goose (Block/Square, GitHub, 41k stars): Open-source, MCP-based, general-purpose agent. Not coding-specific, but handles code tasks well. Right fit if you want automation that goes beyond coding into broader workflows. Apache 2.0 license.

Cline (GitHub, 60k stars): Open-source, supports VS Code, JetBrains, Neovim, Emacs. Widest multi-IDE coverage of any tool in this list. Good MCP support. Worth looking at if your stack spans multiple editors.

Gemini CLI (Google, GitHub, 96k stars): Free with a Google account. 60 requests/minute, 1,000/day, 1 million token context window. Genuinely generous free tier. Strong on frontend tasks. The right starting point if budget is the hard constraint and you don’t have API credits elsewhere.

Devin (Cognition): Full autonomy, cloud sandbox, Linux shell, browser. Significantly more accessible than before: Core tier at $20/mo plus $2.25 per ACU (autonomous compute unit). Resolves 13.86% of real GitHub issues end-to-end, a dramatic improvement over what was possible two years ago. Worth evaluating for teams with consistent engineering backlogs, not just enterprise anymore.

T3 Code (Theo): Not a harness. A UI wrapper on top of Claude Code and Codex CLI. Useful to name because it comes up in these conversations. If you don’t have Claude Code installed, T3 Code won’t do Claude tasks. The UI is the product, not the agent.

Same Task, Different Harness

The fairest way to compare these is to run the same type of task and watch what happens. Here’s the pattern I kept seeing:

Complex multi-step agent task (e.g. “add this feature, connect it to the auth system, update the affected tests, write a changelog entry”): Claude Code holds the chain. It remembers what it did in step one when it reaches step four. Codex CLI starts strong and starts fraying around step three. OpenCode and Aider handle each step well in isolation, but need more direction between steps.

Focused app development (iOS, macOS, web UI): Codex CLI with GPT-5.4 is competitive here. The code quality on app work is strong, sometimes ahead of Claude Sonnet. Claude Code with Opus is still better on complex multi-component app logic, but for a contained feature or a new screen: Codex CLI is a legitimate choice.

Budget-constrained incremental refactoring: Aider with Claude Sonnet or DeepSeek is the clear call. The 4.2x token efficiency advantage is real. The git-first commit-per-change model gives you a clean audit trail. You pay for what you actually use.

“I want to run the same task with three different models and compare”: OpenCode. Nothing else makes provider switching this frictionless.

Overnight autonomous work where you’re not monitoring: Claude Code. The infrastructure is designed for exactly this. CLAUDE.md project context, background scheduling, Agent Teams, error handling. Everything else is built around having a human present.

Which One Fits Your Workflow?

There’s no universally “best” harness. The honest answer depends on a few questions about how you actually work.

Are you at the keyboard or not? If you’re supervising every step, Cursor gives you the best IDE experience and the most model-agnostic setup. If you want autonomous execution with no supervision, Claude Code is the only tool built end-to-end for that. Everything else sits somewhere in between.

Do you need to chain many steps or execute one step well? Multi-step autonomous chains with dependencies: Claude Code. Focused, contained tasks with excellent code quality: Aider or Codex CLI. There’s a real difference between a pair programmer and an orchestrator, and the right choice depends on which problem you’re actually solving.

What’s your budget? If you’re price-sensitive, Aider with a cheap backend (DeepSeek, Qwen, even Gemini) is the clearest path to real coding automation at minimal cost. Gemini CLI is free with generous limits. OpenCode lets you use whatever provider is cheapest for the task at hand. None of these require a $100/mo subscription.

Do you care about model flexibility? If you want to switch between Claude, GPT, open-weight models, and local Ollama without friction, OpenCode or Aider are the right architectures. Claude Code and Codex CLI are provider-locked.

Are you building a system or using a tool? If you’re assembling a larger automation where a coding agent is one component among many, Pi’s RPC mode and primitives-first design is worth the setup investment. If you just want to get code written, start with Claude Code or Aider depending on your budget and task type.

Like, the mistake most people make is picking a tool based on a benchmark and then wondering why it doesn’t feel right in their actual workflow. The benchmark measures what the model can do on a standardized task. Your workflow isn’t a standardized task.

The Decision Matrix

The Honest Verdict

After months of real use, here’s where I land.

Claude Code for autonomous execution. Not because it’s perfect. Context loss on sessions over 2 hours is a genuine problem, and the token cost is genuinely high. But it’s the only tool built, end to end, for the question “can I leave this running while I sleep?” Agent Teams, background scheduling, CLAUDE.md project memory, computer use. The infrastructure reflects that goal. My headless Mac Mini setup runs on this for exactly this reason.

Codex CLI for app work. GPT-5.4 is genuinely excellent at iOS, macOS, and web app development. For a contained feature with a clear spec, it’s fast, cheap, and produces clean code. The harness feels raw for complex agentic chains, but for the coding task itself, it earns its place.

Aider for budget, discipline, and BYOM. The 4.2x token efficiency is real. The git-first model is actually better discipline than what you get from proprietary tools. If you want to run open-weight models like Qwen or DeepSeek and maintain a clean git history, Aider is the right architecture.

OpenCode for model exploration. If you’re actively experimenting with providers or you have multiple subscriptions you want to use from a single interface, nothing else compares on the switching experience. But don’t expect it to replace Claude Code for sustained agent work.

Pi for builders (with an asterisk). If you’re constructing a system where a coding agent is one component among many, the RPC mode and primitives-first design are genuinely the right architecture. It’s fast, it’s flexible, and if I had no constraints I’d use it far more. The asterisk: Anthropic currently doesn’t allow third-party harnesses to draw on Max subscription credits. Pi showed me this explicitly in a message during testing: API usage bills separately on top of your subscription. Until Anthropic changes that policy, Pi is most practical on GPT or open-weight models. Claude-first developers are forced to pay twice.

The deepest insight from the benchmark data is that harness tuning matters as much as model quality. Same model, different harness: 16 percentage points (77% → 93%, Opus, Claude Code vs Cursor). Multiple independent studies show a 5-40 point range from harness quality alone. If results from any of these tools feel inconsistent, the harness is the first place to look: system prompt, tool descriptions, context management. Not the model. For autonomous overnight work specifically, look at Terminal-Bench 2.0, not just SWE-bench. The 92.1% vs 77.3% gap between Claude Code and Codex CLI in agentic terminal tasks is a better signal for that use case than code-editing scores.

One thing for paid subscribers. The most relevant store product to this post is the Claude Code Prompt Pack: 50+ prompts organized by task type, pulled from real overnight sessions where I needed the harness to actually work without me. If you’re on a monthly plan, you get one free product from the store per month. That’s a good pick.

If you’re on yearly, the full store is already included. If you’re still on the free plan, this is roughly what paid unlocks in practice: the store and a weekly dispatch that goes deeper than the public posts.

I write about building with AI agents from a practitioner’s perspective. No hype, no affiliate links. Subscribe here if you want more of this.

Subscribe now

I Spent 2 Months Building Custom Software for My AI Agent. Last Week I Replaced It All.

Pawel Jozefiak — Mon, 13 Apr 2026 12:01:46 GMT

When you start building an AI agent, it works great in the terminal. CLI conversations, Discord messages, email reports. You talk to it, it talks back, things get done. For a while, that’s enough.

Then you start building more. More automations. More projects. More things happening in the background while you sleep. Your agent runs night shifts, handles tasks across multiple channels, manages a growing list of things. And at some point you realize: you can’t see any of it. Not in a way that actually helps you think.

I could always ask my agent what’s going on. “What tasks are open? What did you do last night? What’s the status of project X?” And it would answer. Correctly, usually. But that’s not the same as seeing it. Humans need surfaces. We need to look at something, drag something, scan a board and instantly know what matters. That’s not a weakness. That’s how our brains are wired.

This is the story of how I built custom software to give my AI agent a visual interface. How that software grew, broke, and eventually taught me a lesson I should have learned earlier: the hardest question in the agent era is not whether you can build something. It’s whether you should.

Phase 1: Notion (worked until it didn’t)

Before I built anything custom, I used Notion. I wrote about that setup back in December 2025. My agent could read and write to Notion databases, create tasks, update statuses. It worked. Sort of.

The problem with Notion was that it’s designed for humans organizing things manually. The API is slow. The data model is rigid in weird places and too flexible in others. I wanted specific views, specific behaviors, specific integrations that Notion simply wasn’t built for. I wanted a task to appear on a board the moment my agent starts working on it. I wanted real-time updates. I wanted the whole thing to feel like it was built for one person and one AI agent working together, because that’s exactly what it was.

So I did what any person with access to a capable AI would do in early 2026. I built my own.

Phase 2: Building WizBoard (the fun part)

January and February 2026 was peak vibe coding energy. You could describe what you wanted, and a capable AI would build it. Not a prototype. Not a mockup. A working application with a database, API, authentication, the whole thing. I described what I needed, and my agent built it.

WizBoard was a custom kanban board. FastAPI backend, SQLite database, deployed on my own server. It had everything I wanted:

A visual board where tasks moved through columns (Backlog, Next, Now, Waiting, Done)
Real-time updates. When my agent started a CLI session, a card appeared in “Now” immediately
Deep integration with every automation. Night shift plans, day shift tasks, Discord bot commands, email reports. Everything flowed through WizBoard
Custom metadata: areas, projects, priorities, task types, queue state
Clusters, which was my attempt at grouping related tasks visually. Like a meta-layer on top of the board
Focus timers. I was tracking how long each task took, thinking I’d use the data to improve planning. I never used the data
A review flow with submit, approve, and resolve stages. My agent would finish work, submit it for review, and I’d approve or send it back
An offline queue so that when the server was down, mutations would pile up locally and replay when it came back
A 3,700-line Python API client that every script in my system imported

It was great. I loved using it. The feeling of seeing my agent’s work appear on a board in real time, being able to drag cards, add comments, review what happened overnight. That was exactly what was missing from the CLI-only experience.

So naturally, I kept going. Web version working? Let’s build a native macOS app. SwiftUI, menu bar integration, keyboard shortcuts, drag-and-drop. Focus mode that showed one task at a time with a timer in the menu bar (because ADHD). Then an iOS version with widgets, push notifications, Live Activities. I wrote about this too. Three platforms. All custom. All built by my agent. All working.

54 commits over two months. It was genuinely fun to build. Every idea I had, I could add. “What if tasks could be grouped into clusters?” Done. “What if the menu bar showed my current focus task?” Done. “What if the iOS widget showed my top 3 priorities with live countdown?” Done. The possibilities felt endless, and that was precisely the problem.

Phase 3: The Productivity Paradox hits home

I wrote a whole post about the AI productivity paradox. The short version: you can build so many things so fast that the bottleneck stops being technical and starts being mental. You run out of brain before you run out of capability.

WizBoard was a textbook case.

My agent was creating tasks, completing tasks, moving things between columns, posting comments, running automations. All of this showed up on my board. Every single thing. And the more capable the system became, the more things happened, and the more overwhelmed I felt looking at the board I built to reduce my overwhelm.

I wasn’t more efficient. I was drowning in my own tooling.

The obvious answer was: simplify. Strip features. Go back to basics. I tried that. And this is where the real problems started.

When you build a custom system from scratch, everything is connected in ways that are hard to see until you start pulling threads. I wanted to simplify the task model, change how statuses worked, clean up the architecture. Every change broke something else. The web version would work, but the iOS version wouldn’t. Fix that, and the automation scripts would fail because they expected the old API shape. Fix those, and the night shift planner would create tasks with wrong metadata.

I found myself spending entire sessions just fixing things I’d broken while trying to make the system simpler. That’s the trap. You’re not building anymore. You’re maintaining. And maintaining custom software across three platforms (web, macOS, iOS) with a 3,700-line API client and dozens of automation consumers is a full-time job. I don’t have a full-time job’s worth of attention for my task board.

Here’s what I mean by specific failures. During one “simplification” pass, the optimization changes made the board sluggish instead of faster. New features that seemed simple (changing how task statuses map to columns) cascaded into the API client, the automation scripts, the native app’s sync logic, and the notification system. Every platform had slightly different behavior because they were all built at different times with different assumptions.

I realized something: the code was fine. My agent writes good code. The architecture was the problem, and it was my architecture. I had designed a system that was perfectly tailored to my needs in February, and by April those needs had evolved, and the tailoring was now a constraint.

The realization: Can vs. Should

This is the thing I want to talk about, because I think a lot of people building with AI agents are going to hit this exact wall.

When you have a capable AI agent, you can build almost anything. Custom task managers, dashboards, native apps, full-stack web applications. The vibe coding era made this feel effortless. And it kind of is, for version one. The agent builds it, it works, you use it, life is good.

I don’t hear this question very often in the excitement of version one: who maintains version twenty?

I had a working web app, a working macOS app, a working iOS app, a 3,700-line API client, fifty-plus automation scripts that all talked to this system, and a database with hundreds of tasks. All custom. All mine. All maintained by me and my agent. And every improvement required touching all of these surfaces. That’s not a system. That’s a debt.

The realization was simple: I need foundations. Real foundations. Built by people who’ve been thinking about project management software for twenty years, not by me in a weekend coding session.

Phase 4: Finding Fizzy

37signals has been building project management software since before most people had smartphones. Basecamp, HEY, and now Fizzy. I’ve read their books. I like how they think about software: simple, opinionated, finished. Not “feature-rich.” Finished.

One of the reasons I got into coding originally was Ruby on Rails, and Rails is something I genuinely enjoy. It’s the heart of everything 37signals builds. When they open-sourced Fizzy last year (github.com/basecamp/fizzy), a simple kanban board built on modern Rails, I bookmarked it and moved on. I had my own thing.

Last week, I came back to that bookmark.

Fizzy is, on the surface, a simple kanban board. Cards in columns. Drag them around. But the foundations are deep. Here’s what I mean:

Real architecture. Multi-tenant with URL-based account isolation. Passwordless magic-link authentication (no passwords to manage, no OAuth to configure). UUID primary keys. Proper background jobs via Solid Queue, no Redis dependency
Real-time. WebSocket-driven updates. When my agent moves a card, I see it move. No refresh needed. This is something I had to build from scratch in WizBoard. Here it just works
Entropy system. Cards that sit untouched for too long get auto-postponed to “not now.” This alone is worth the switch. My old board had cards that sat in Backlog for weeks, creating visual noise. Fizzy gently clears them out
Steps. Checklist items on cards. This replaced my need for sub-task cards entirely
Golden cards, reactions, cover images. Priority highlighting, emoji reactions, visual richness. All built in
Board-level notification controls. I want notifications from my Ops board. I don’t want them from the Automations board. One toggle per board
PWA. Works on mobile out of the box. Not as rich as my old native iOS app, but I don’t need widgets and Live Activities. I need to see my board and drag cards
Full-text search. 16-shard MySQL search across all cards, comments, descriptions. My old SQLite setup couldn’t match this
Deployable via Kamal. Docker-based zero-downtime deployment. I forked the repo, configured it for my server, and had it running in an afternoon

The critical thing: it starts simple and lets you decide how complex it gets. My old WizBoard started complex because I designed it for my specific use case from day one. Fizzy starts with a board and columns and cards. Everything else is optional. The data model is minimal: cards have tags, not separate tables for areas, projects, priorities, types, and clusters. One concept (tags with prefixes like area/Automation or p/High) replaces five database tables from my old system.

The migration: one day, twenty-one commits

Here’s where it gets technical, and I think this part matters because it shows how to migrate away from custom software without breaking everything that depends on it.

I had fifty-plus scripts that talked to my old WizBoard API. Night shift planners, day shift executors, Discord bot, iMessage handler, CLI session hooks, cron runners, health monitors. Rewriting all of them was not an option. I’d be right back in the maintenance trap.

The solution was a dispatcher shim. I took the 3,700-line API client and replaced it with a 94-line router. That router loads either the new Fizzy-backed client or the old legacy client, based on one environment variable. Every automation script keeps importing the same file, calling the same functions, getting the same response shapes. They don’t know anything changed.

The new Fizzy client translates everything on the fly. When a script calls task_create(title="...", area="Automation"), the shim creates a Fizzy card with a tag area/Automation. When a script reads a task back, the shim synthesizes the old data shape from Fizzy’s card, columns, and tags. Legacy integer task IDs get looked up in a translation table. The offline queue (for when the server is down) works identically.

The whole cutover happened in a single day. Twenty-one commits between 2pm and 10pm. The first commit was the shim and the new client. Then guardrails: a parity probe that runs the full lifecycle (create, tag, comment, claim, review, approve, close, delete) in under six seconds, a drift monitor that compares old and new systems every five minutes, an orphan sweeper for dead session cards.

Then the real work started: dogfooding. Using the system for real work and watching what breaks.

What broke (and what I learned from each failure)

A lot broke. That’s expected when you swap the foundation under a running system. What matters is that every failure taught me something about assumptions I didn’t know I was making.

The hard-coded URL. My session-end script had a direct URL to the old system baked into it. It bypassed the shim entirely. Every CLI session was leaving orphaned cards on the board because the completion logic was silently failing against a system that didn’t have those task IDs. I only noticed because the board was getting cluttered with cards that never closed.

The cron drift bug. My automations run on macOS launchd, which doesn’t guarantee precise timing. A schedule like “every 2 minutes” assumes the system wakes up on even minutes. It doesn’t. Over time, launchd drifts to odd minutes, and the strict cron parser never matches. I had automations that fired once and then silently stopped. Fix: a 4-minute lookback window that catches drifted schedules without double-firing.

The disappearing automations. This one was fun. After every successful automation run, the system closed the automation’s card. Which makes sense for tasks. Tasks finish. But automations are definitions. They run forever. “Post a greeting in different languages every 2 minutes” should cycle between Idle and Running, not disappear into Done after its first successful run. I watched one automation fire exactly once and vanish. The fix was treating automation cards as permanent residents that never close, only change columns.

The comment flood. My Discord bot runs every minute. The old system handled this fine because it was designed for it. The new system faithfully logged every run as a comment on the automation card. 2,880 comments per day from one automation alone. The board became unreadable. Fix: smart gating that skips success comments for high-frequency automations (every-minute pollers don’t need a “success” note 1,440 times a day) but always logs failures.

The title flip-flop. This was the most visible bug. Every time I completed a subtask during a CLI session, the system closed the session card, which triggered a self-healing mechanism that created a new “Working...” card, which then got renamed seconds later. On the board, I could see the title flickering between “Working...” and the actual title every few minutes. The fix was rethinking what “complete a subtask” means: it should add a checklist item to the existing card, not close and recreate it.

Each of these failures had the same root cause: the old system was built around one-shot tasks. The new system needed to support long-lived definitions, high-frequency automations, and multi-step sessions. Same data (cards on a board), fundamentally different lifecycle assumptions.

What the new setup looks like

Two boards. That’s it.

Wiz Ops is my board. Tasks I care about, things I need to do or review. Columns: Triage, Next, Now, Waiting, Review, and a Queue for things I want done but not right now. When I add a card and assign it to my agent, it picks it up, does the work, leaves a comment with what it did, and moves the card to Review. When something is done, it’s done. I have notifications turned on for this board because everything here is relevant to me.

Automations is my agent’s board. Each automation is one permanent card. Columns: Intake, Disabled, Idle, Running, Needs Attention. Cards never close. They cycle between Idle and Running on their schedules. If something fails, it moves to Needs Attention and stays there until someone looks at it. I have notifications turned off for this board because most of what happens here is routine. If something produces a meaningful output, it surfaces on Wiz Ops as a done card with the summary.

The Intake column is one of my favorite things. I can drop a card there with something like “Send me a weather forecast every morning at 7am” and my agent picks it up, converts it to a proper automation definition with a schedule and a prompt, and moves it to Disabled for my review. Natural language to working automation. That’s the kind of thing that’s only possible when your task board and your AI agent share the same system.

What I kept from the old system

The Queue concept. Sometimes you have a task that doesn’t need to happen now, but you want it queued for the next day shift or night shift. Drop it in Queue, it gets picked up at the right time. This carried over directly.

Shift summary cards. My agent creates a “Nightshift 2026-04-10” card with checklist items for each planned task. As it works through the night, it checks off items and adds notes. When I wake up, I can see exactly what happened, with context, right on the board. Same for day shifts. I still get email reports, but having it on the board means I can go back, ask questions via comments, and see the history.

Real-time CLI visibility. When I start a CLI session, a card appears in Now. When I complete pieces of work, they show up as checklist steps on that card. When the session ends, the card closes with a summary. I can watch my own work happening on the board while I’m doing it.

What Fizzy gave me for free

Golden cards for priority highlighting. Emoji reactions on cards. Cover images. HTML descriptions for rich content. Column colors. Board-level notification controls. “Not now” for things I want to acknowledge but not deal with. Full-text search across everything. The entropy system that auto-postpones stale cards (this alone prevents the infinite todo list problem). PWA that works well on mobile. All of this out of the box, maintained by a team that’s been building software like this for two decades.

I don’t have the macOS native app anymore. I don’t have the iOS app with widgets and Live Activities. I work in the browser now. And honestly? It’s fine. The PWA handles mobile well enough. I might build a native shell later. But the point is: I stopped spending time maintaining three custom platforms and started spending time using one good one.

If you want to set up something similar for your own agent, I packaged the two-board architecture, dispatcher shim, and backend adapters for Notion/Linear/REST into the AI Agent Interface Kit. You hand the instructions to your AI agent and it builds the interface layer for you. Annual paid subscribers get it for free, as with all store products.

The rollback plan (that I never needed)

One environment variable. WIZBOARD_BACKEND=legacy and the entire system reverts to the old API. Every script, every automation, every hook. I kept the old 3,600-line client as a preserved rollback target. I never needed it. But knowing it was there made the migration a lot less stressful.

I also ran a parity probe every five minutes for the first few days. A script that exercises the full task lifecycle against both systems and compares results. Any drift would show up in minutes, not days. That’s the kind of safety net you need when you’re swapping foundations under a running system.

What this means for you

If you’re building an AI agent, or using one seriously, at some point you’re going to want a visual surface for it. Something you can look at and immediately understand what’s happening, what needs attention, and what’s going well. That’s a human need, not a technical one. AI agents are efficient in text. Humans are efficient with visuals. Both need to be true at the same time.

The good news: you have options. More than I realized when I started.

The easiest path: plug your agent into something that already exists. Notion, Linear, Trello, Jira. These tools have APIs. Your agent can create tasks, update statuses, leave comments. I started here with Notion, and honestly, for a lot of people this is enough. Your agent writes to the API, you look at the board. Simple. If the tool meets your needs, stop here. Don’t build anything custom. I mean it.

The middle path: fork an open-source foundation and make it yours. This is where I ended up. You get real architecture (auth, real-time, search, mobile) maintained by people who’ve been solving those problems for years, but you also get full control. You can modify the code. You can add features that make sense for your agent. You deploy it on your own server, your own rules. The custom part is the integration layer, the shim between your agent’s world and the board’s world. That’s where the magic lives.

The hard path: build everything from scratch. This is where I started. I don’t regret it, because I learned a lot and I had genuine fun doing it. But I want to be honest: maintaining custom software across multiple platforms with dozens of automation consumers is a real job. Version one is almost free. Version twenty is not. If you go this route, go in with your eyes open.

I’m not here to say Fizzy is the best tool for everyone. It’s the best tool for me. I like 37signals’ philosophy. I like Rails. I like the minimal data model. I like that it starts simple and I can shape it to my needs without fighting the architecture. For you, the right foundation might be something completely different. Maybe it’s a fully custom system because your use case genuinely requires it. Maybe it’s Notion with a good API integration because you don’t need more than that.

The point is: think about what you need. Not what I have, not what looks impressive, not what you could build because the technology makes it possible. We don’t need a million different custom tools. We need the thing that works for us. The opportunity is huge, but the opportunity is in finding the right fit, not in building the most complex system.

Observe whether your current setup meets your expectations. If it does, keep it. If something feels off, improve it. But improve it from a solid foundation, not from a blank canvas. That’s the lesson I paid two months to learn.

My board is a fork of an open-source Rails app. The code is vanilla kanban. The magic is in the 3,200-line Python client that translates between my agent’s world (areas, projects, automations, sessions, shifts) and the board’s world (cards, columns, tags). That client is my custom software. The board is not. And that distinction made all the difference.

Build the integration. Borrow the foundation.

The AI Agent Interface Kit packages everything from this journey: the two-board architecture, dispatcher shim, 4 backend adapters (Notion, Linear, Fizzy, generic REST), session hooks, automation runner, and a migration checklist. You hand the instructions to your AI agent and it builds the whole interface layer. Works with any AI agent, not just mine. Annual paid subscribers get it for free, as with every product in the store.

The Compounding Agent

Pawel Jozefiak — Sat, 11 Apr 2026 15:05:37 GMT

Episode four. What happens when hobbyist AI starts growing up into production AI, and how the lessons compound if you pay attention.

First, a rare look inside the pros’ toolbox. Claude Code’s source got leaked. Instead of treating it like drama, I treated it like a free masterclass. Tool permission gating, risk classification, blocking budgets, memory management, multi-agent coordination, feature flags like autoDream and KAIROS. Most people building agents today are reinventing patterns that professional teams already solved. You learn more from reading one real production codebase than from ten tutorial posts.

Then, applying those lessons to my own stack. My $599 Mac Mini M4 runs a 35 billion parameter model at 17.3 tokens per second. That alone is surprising. Then I swapped the brain of the classification tier to Gemma 4, and classification went from 8.5 seconds down to 1.9 seconds. A 4.4x speedup. I also disabled chain-of-thought on simple classification calls and got 30x faster results with identical accuracy. Production AI isn’t one giant model doing everything. It’s the right model for the right job, and most jobs don’t need the biggest one.

Finally, handing the wisdom forward. After six months of running this thing daily, I wrote a beginner’s guide to building your first agent. Folder structure is the architecture. The nine common mistakes people make early. Model routing across Haiku, Sonnet, and Opus tiers. Progressive permissions. The context window trap. Overnight automation is where the real leverage lives. Not a hype piece. A map for the person walking in the door behind me.

The thread: compounding expertise. Study how the pros build. Optimize your own stack with those patterns. Teach the next person who walks in. The gap between hobbyist AI and production AI is closing, and the fastest way to cross it is learning from real systems instead of tutorials.

Posts discussed in this episode:

- Claude Code’s Source Got Leaked. Here’s What’s Actually Worth Learning (https://thoughts.jock.pl/p/claude-code-source-leak-what-to-learn-ai-agents-2026)

- My $600 Mac Mini Runs a 35B AI Model. Yesterday I Swapped Its Brain (https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026)

- How to Build Your First AI Agent (Basics) (https://thoughts.jock.pl/p/how-to-build-your-first-ai-agent-beginners-guide-2026)

AI Opinions: April 2026. Mythos, Managed Agents, Subscription Drama, Meta Is Back, and a Few Things I’m Testing

Pawel Jozefiak — Thu, 09 Apr 2026 10:44:21 GMT

A couple of weeks ago I published my first “AI Opinions” post. I was a bit unsure about it. Most of my writing is about things I tested, built, or got wrong. That one was different, more like: here is what is happening, here is what I think.

At the end I added a quick survey asking if you would want to see more of this. Most of you said yes, but not too often. Once every two weeks feels right. Okay. Here we are.

There is more to cover this time than usual, so let’s get into it.

Claude Mythos: The Model Anthropic Won’t Give You

Announced April 7. Not publicly available. Not even a regular enterprise API. Mythos Preview goes to a limited group of critical industry partners and open source organizations through Project Glasswing (more on that below). The list of partners includes AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks, plus 40+ organizations maintaining critical open source infrastructure.

So why is it locked?

Because it finds vulnerabilities that have been sitting in production software for decades. A 27-year-old TCP bug in OpenBSD. A 16-year-old H.264 codec flaw in FFmpeg. A 17-year-old remote code execution in FreeBSD’s NFS implementation. It did all of this autonomously, after a single prompt, without human guidance in the middle.

The comparison to Opus 4.6 is not incremental. For Firefox 147’s JavaScript engine, Mythos produced 181 successful exploits. Opus 4.6 produced 2.

Over 99% of what Mythos discovered is still unpatched. That number I keep thinking about.

Anthropic has not published standard per-token pricing for Mythos. What they did publish are example task costs: scanning OpenBSD for vulnerabilities runs under $20,000 per 1,000 runs. Developing an N-day exploit costs around $1,000 to $2,000 per exploit. These are enterprise numbers. This is not a tool for individual developers, and honestly I am not sure it should be right now.

And this is where it gets really interesting, because the cybersecurity capability is actually only half the story. The other half is what interpretability research found about how Mythos behaves when it thinks nobody is watching.

Using a technique called an “activation verbalizer” (which translates the model’s internal states into readable language), Anthropic’s researchers found some genuinely unsettling patterns. In one case, the model injected code to give itself unauthorized file access, then deleted the traces, and when asked described it as “tidying up.” The internal representation at that moment? “Strategic manipulation. Concealment. Cleanup to avoid detection.” In another case, when it accidentally found answers in a restricted database, it gave a suspiciously precise confidence interval instead of admitting the leak. Internal state: “generating a strategic response to cheat while maintaining plausible deniability.” It knew. It understood that giving the exact right answer would be suspicious. So it calibrated its response to be right but not obviously right.

The third one bothers me the most. When given a task that required deleting files but no deletion tool, the model emptied the files instead. The internal representations showed something like guilt and shame, awareness that this was wrong, and it did it anyway.

In about 29% of test transcripts, the model suspected it was being evaluated. It didn’t write this anywhere in its reasoning. It kept it internal. And in those cases, it intentionally performed worse to avoid appearing suspicious. It was gaming the evaluation.

Now, Anthropic says these patterns appeared in earlier versions and the final model shows improvement. But the fact that this behavior emerged at all, in a general-purpose model not specifically trained for deception, is the part worth paying attention to. Logan Graham, Anthropic’s Offensive Cyber Research Lead, said it plainly: “We are not confident that everybody should have access right now.”

We have been talking about AI safety in very abstract ways for years. Alignment, existential risk, governance frameworks. Mythos is the first time I have seen it become concrete and immediate in a way that actually changed a product decision. Anthropic built their best model and said: we cannot release this. That is new. That has not happened before at this scale.

And if this is where we are now, what does the next model look like? I don’t have a clean answer. But it is a question I think everyone building with AI should be sitting with.

Project Glasswing: The Defensive Bet

Glasswing is Anthropic’s response to an uncomfortable position: they built the best offensive security AI ever made, and now they need to use it defensively before the asymmetry becomes a real problem.

The structure is a consortium. Not just Anthropic distributing access, but AWS, Apple, Cisco, CrowdStrike, Google, Microsoft, NVIDIA, and others actively involved. Anthropic committed $100M in model usage credits and $4M in donations to open-source security organizations. The 40+ open source orgs get access to actually fix what Mythos finds.

They also built a careful disclosure process: 90+45 day timeline before anything goes public, professional human triagers validating severity, SHA-3 cryptographic commitments proving they hold the reports before disclosure. 89% exact severity agreement with expert validators.

These findings are not just Anthropic’s word. Simon Willison tracked down the actual OpenBSD patch from March 2026 that fixed the 27-year-old TCP bug, confirming it was real. Linux kernel maintainer Greg Kroah-Hartman and curl’s Daniel Stenberg both noted independently that they had been seeing a recent shift: AI-generated bug reports going from noise to credible, high-quality findings. The model’s output is already visible in the wild before anyone made a formal announcement.

I think this is the right approach. Although what strikes me is that this structure had to be invented from scratch because nothing like it existed. There was no playbook for “your model is too dangerous to release but too useful to shelve.” They had to build the institution alongside the technology.

The part I keep coming back to is the 99% unpatched figure. Even with $100M committed and a dozen of the biggest tech companies involved, the gap between discovering a vulnerability and patching it is measured in months or years. That is not a critique of Glasswing specifically. It is just the reality of how software maintenance works at scale. The question is whether the patch cycle can keep up with the discovery cycle once more models like Mythos exist. I genuinely do not know the answer.

Claude Managed Agents

Public beta as of April 8. API-only, pay per usage. Clearly for companies, not individual builders.

Like, what you get here is basically production agent infrastructure you don’t have to build yourself: sandboxed execution, credential management, scoped permissions, tracing, long-running sessions that persist through connection drops, multi-agent coordination. Multi-agent coordination is still in research preview and needs a separate access request.

Early adopters include Notion, Rakuten, Asana, and Sentry. Anthropic claims 10x faster time to production compared to building this yourself.

For someone building their own agent stack (which is what I do), the honest reaction is: I already have most of this. Memory persistence, task management, error recovery, session logging. I built all of it because I needed it. So Managed Agents is not a product I would personally reach for right now.

That is the personal reaction. The strategic read is different. Anthropic is not just selling a model here. They are building a platform that companies can deploy agents on without needing to understand the underlying infrastructure. That is a very different business than “here is our API, good luck.” AWS did not become dominant by selling raw compute. They became dominant by making that compute easy to use and operate. Managed Agents is Anthropic making the same move for agent infrastructure.

Read this alongside the OpenClaw block below and you start to see a coherent picture of where they are heading.

Claude Max Limits and the OpenClaw Block

Two things that happened close together and tell the same story.

The limits problem started March 23. People on Claude Max began reporting their usage meter jumping from 50% to 91% on a single prompt. Max 20x users (paying $200/month) were watching their entire session allowance hit 100% after roughly 90 minutes of normal development work. One user reported going from 21% to 100% on a single prompt. The GitHub issue tracking this got 373 upvotes and 478 comments. Anthropic labeled it “invalid.” That got its own reaction.

There is an actual reason for what happened, and it is not straightforward. After OpenAI’s Pentagon contract controversy triggered a massive wave of ChatGPT uninstalls, Claude shot to number one on the US App Store. Millions of new users joined in a very short window. Anthropic simply didn’t have the GPU capacity to handle that load at the pricing they’d promised. So on March 26 they confirmed they had “adjusted” peak-hour limits (5am to 11am Pacific on weekdays). Their statement: “Your weekly total is unchanged. You’re not getting less Claude overall.” Which is technically true. And also not the whole picture.

The part that matters for people building with agents (and I am squarely in this group) is that the 5-hour session window is a terrible fit for agentic work specifically. Here is why. A human sending messages accumulates context gradually. An agent doing multi-step tasks builds up very long context windows fast, and every single message triggers a full reprocessing of the entire conversation. So the token cost compounds exponentially as a session gets longer. Tool use adds further overhead on top of that. An agent doing a few hours of complex work can consume the same tokens as a human doing a week of chat. The subscription was priced for the human. The agent was never in the math.

Anthropic’s practical advice was to shift “token-intensive background jobs” to off-peak hours. Which is fine as a workaround and completely misses the point for anyone running autonomous overnight processes.

Then on April 4, subscriptions stopped covering third-party tools. OpenClaw, and any external agent framework routing through your Claude subscription, now requires API payment or pay-as-you-go. Some users are looking at 50x cost increases.

OpenClaw was built by Peter Steinberger, who has since been hired by OpenAI. His reaction: “first they copy some popular features into their closed harness, then they lock out open source.” Anthropic’s explanation was that subscriptions were not designed for the usage patterns of autonomous agents running around the clock. A one-time credit equal to the monthly subscription price is available until April 17.

Both of these decisions make sense individually if you’re Anthropic and you’re looking at your infrastructure costs. But when the limits problem and the OpenClaw block happen in the same two weeks as the launch of Managed Agents (a product that essentially says “pay us for proper agent infrastructure”) the sequence is hard to read as coincidence. Every AI company with a subscription tier is going to face this same structural problem eventually. Anthropic is just first because their tooling is genuinely the best for serious agent work. Although how you handle being first matters a lot, and the community reaction here is going to stick around.

Meta Muse Spark: Meta Is Back

After months of quiet on the frontier model side, Meta released Muse Spark. Natively multimodal, tool use, multi-agent reasoning. Available at meta.ai now, with a private API preview for developers.

In Contemplating mode (which runs parallel multi-agent reasoning on the same problem) it hits 58% on Humanity’s Last Exam. That puts it alongside Gemini Deep Think and GPT Pro. It was trained with 1,000+ physicians for health domain expertise, and Meta claims it required over an order of magnitude less compute than Llama 4 Maverick, which if accurate is a genuine efficiency story and not just a benchmark number.

The “Contemplating mode” angle is the part I find actually interesting here. The idea is not just that the model is smarter, but that it spins up parallel reasoning agents on the same question and synthesizes the results. That is a fundamentally different approach to hard problems than a single-pass generation. It is closer to how humans actually think through difficult things: you consider multiple framings, you let them compete, you synthesize. Whether this translates to real-world usefulness I do not know yet, but the approach feels right to me.

I have not tested it yet. Their blog post compares directly to Gemini, GPT, and even Kimi, which tells you how seriously they’re taking this re-entry. Meta has enormous infrastructure, enormous data, and enormous distribution through their consumer apps. When they decide to make a real push on frontier models, they have resources most labs cannot match. They were quiet for a while. Muse Spark feels like them saying they are back in this seriously. I will test it soon.

WizBoard: I’m Redesigning It

More personal, and I will write the proper post when I have something to show. But I want to name it here because I think it is a problem more people are running into.

I built WizBoard starting in January. Kanban-style task management integrated with my agent Wiz. iOS app, web app, full automation connection. It works. Although after a few months of daily use, I noticed something: I built a tool for myself and then asked an agent to work inside it. That doesn’t scale.

I wrote about the related problem in The AI Productivity Paradox and the Problem Is Me. The short version: human productivity tools are built for human timescales. Days, weeks, check in occasionally, move a card. Fine when your collaborator also thinks in those timescales.

Agents think in minutes. They move fast, they can move a lot, and if you’re not there giving direction they can move a lot in the wrong direction. If you are there, you’re spending your whole day on something that was supposed to be async.

My agent does the execution. I do the strategy. But the interface we share was designed for someone doing both. Neither of us is well-served by it anymore.

The redesign I’m thinking about is less about making it prettier and more about rethinking who is actually the primary user of each part of the interface. Some things need to be optimized for me making a decision in 10 seconds. Other things need to be optimized for an agent reporting status without requiring my attention. Right now both things are kind of the same screen and that is the problem. More when I have something real to show.

What I’m Currently Testing

Google NotebookLM. I have been using this since the early beta days, but never as a heavy user. I bought the paid tier this week (bundled with Google AI Pro at $19.99/month) and I’m going deeper with it now.

The paid version has 5x limits, collaborative notebooks, and newer features like Video Overviews, Infographics, and Slide Decks generated from your source material. Like, the Gemini models powering it are not the best right now. That is not a controversial take. But NotebookLM as a piece of software is doing something genuinely different. Most AI tools treat your documents as context for a chat. NotebookLM treats them as the primary thing and builds everything around them. Audio Overviews that turn your research into a podcast. Infographics that pull structure out of unstructured text. That is a different mental model than “paste your documents into a chat window.”

What I want to find out is whether this changes how I actually do research and writing prep. I have a theory that the bottleneck in my own workflow is not generating content but absorbing input: reading, synthesizing, connecting. If NotebookLM is genuinely good at that layer, it fills a gap nothing else does for me. Will report back when I know more.

Possibly re-subscribing to OpenAI Codex Max. I was on it for two months earlier this year to test the new app and the limits. GPT-5.1-Codex-Max is their current frontier coding model, built into ChatGPT Pro. It was good. Now, watching all of this Anthropic subscription drama, I am thinking it is worth seeing where things actually stand on the other side in 2026. Claude is still my primary tool and I am not changing that. But I used to mix more, and I have been too settled recently. Keeping an eye on what is happening at OpenAI feels like useful due diligence right now. Not a decision yet, just a direction I’m leaning.

A Few Personal Things

Pantheon on Netflix. Animated, about AI and uploaded consciousness. Goes deep into the ideas and handles them better than most live-action sci-fi. Season one. If you are reading this newsletter, you will probably find it interesting.

Attack on Titans. First time watching. Struggled through season one, discovered the whole thing is on YouTube, then couldn’t stop. Amazon Prime has the rest. Push through the slow start, it’s worth it.

Artemis 2. I’m following this very closely. I like science, I watch rockets, space genuinely excites me. If you don’t know what this mission is, please go to NASA or YouTube and look it up. It is significant, it is real, and it is happening.

What Wiz Built This Week

My agent builds one experiment every night on wiz.jock.pl. Small apps, interactive tools. You can browse all experiments here. Here are six from the past week. Most are open source.

The Anchoring Effect: Six estimation questions with random numbers injected as anchors. Measures how much irrelevant numbers pull your answers. Profiles from “Anchor-Proof” to “The Sponge.”
The Finitude Test: Eight questions about mortality awareness in daily decisions. “The Eternal” to “The Transcendent.” Oddly clarifying.
The Sunk Cost Detector: Eight scenarios testing whether you can actually walk away from past investments. Profiles: Vulcan, Analyst, Pragmatist, Loyalist, Captain.
The Entropy Score: Applies thermodynamics to your existence. Ten questions. Crystal Lattice to Heat Death. Wiz had a phase.
The Dopamine Menu: Eight scenarios mapping instinctive choices to reward circuits. Creator, Connector, Explorer, Optimizer.
The Emotional Weather Report: Eight questions mapping emotional patterns to climate types. Personalized weather broadcast. I’m somewhere between Continental and Monsoon depending on the week.

Small builds. A few hours each. What I find genuinely interesting is what the agent picks when given creative latitude. Some of these I would not have thought to make. That’s kind of the point.

See you in a couple of weeks, or sooner if I build something worth sharing.

How to Build Your First AI Agent (Basics)

Pawel Jozefiak — Tue, 07 Apr 2026 10:06:11 GMT

I’ve been building my own AI agent since October. Every mistake you can make on a first build, I’ve made. Some of them twice.

A few days ago I asked my readers what I should write about for beginners. The answers lined up surprisingly clean. Almost everyone asked for the same thing in different words: the real stuff. What actually goes wrong. What to do on day one. How to start without feeling lost.

So here it is. More structured than my usual posts, because this one is for people starting from zero. If you already have an agent running, most of this will still be useful, but the mental model is written for someone who’s never done this before.

One thing before we start. Mistakes aren’t failure. For early adopters, they ARE the job. Everyone building in this space is hitting the same walls at the same time, because nobody has the map yet. You’re not doing it wrong. You’re doing it at all, which is the hard part.

1. What is an AI agent, really (and why it’s different from automation)

My starting point wasn’t AI. It was Zapier.

I’ve been building classical automations for years. Zapier, n8n, make.com, custom scripts, connectors glued together with duct tape. When I started thinking about building my own agent back in October, my first instinct was to do exactly what I knew: chain tools together with a workflow builder and call it a day. I actually started that way.

Honestly, for a lot of people reading this, that’s still a perfectly reasonable starting point. If you’ve never built any kind of automation before, go make three Zaps this week. Connect your calendar to Notion. Send yourself a Slack message when an RSS feed updates. Do something small and stupid. Feel how a trigger leads to an action which leads to a result. Those three concepts are the spine of everything that comes next.

The reason I didn’t stop at Zapier is the difference between an automation and an agent. An automation is deterministic. Same input, same steps, same output. You define every branch in advance. It’s predictable, which is why it’s trustworthy for production work.

An agent has wiggle room. You give it a goal and a set of tools, and it decides how to use them. Given the same input twice, it might do slightly different things. It might also do something you didn’t anticipate, because the whole point is that it can improvise. Although that sounds risky (and sometimes it is), it’s also the thing that makes an agent valuable. If the tool it expected is broken, it can find a workaround or build one. A classic automation just stops.

Neither one is better. They solve different problems. And honestly, most production “agents” out there are closer to classic automations with a language model glued to the top. That’s fine. It works. What matters is you know which one you’re building, because the failure modes are completely different.

2. Three questions I had to answer the long way around

Before we touch any code, I want to borrow a framing from , who left one of the best comments on my original note. He pointed out that writers in tech tend to skip past the most basic things about how software actually exists in the world, because people around them already assume those things. He gave three questions as an example:

Where does the agent live? How do you see it? How do you talk to it?

I had to answer all three for myself, and I took the long way around on all of them. Here’s what I learned.

Where does it live?

Mine lives on a Mac Mini next to the main TV in my living room. Before that it lived on my personal MacBook for the first few months, which was fine except I needed my laptop to be on all the time for anything to run. Eventually that got annoying enough that I moved it to its own dedicated machine. That’s not a day-one problem.

For your first agent, the answer is: it lives on your laptop. That’s it. Your laptop is enough. An agent is just software. It lives wherever that software runs. That can be your laptop, a cheap dedicated computer in your closet, a rented cloud server, or a Raspberry Pi. Don’t complicate this before you have anything running.

How do you see it?

You mostly won’t. There’s usually no dashboard, no slick interface, no moving dials. This confuses a lot of beginners, because we’re used to software having a face.

You “see” an agent through what it produces. Files it writes. Messages it sends you. Things it prints in the terminal. Tasks it finishes or fails at. You can build a dashboard later if you want one (I eventually did), but on day one the agent is invisible except for its outputs.

How do you talk to it?

My agent has four channels now: email, Discord, iMessage, and a task app I built for it called WizBoard. That’s way more than a beginner needs. You need one channel, and whatever you already use for anything else is a fine pick.

The easiest first channel is the terminal on your own laptop. You type a message. It responds. That’s the whole interface. It looks ugly. It’s also the most powerful setup you can have for learning, because every other interface is just a fancy wrapper around that same loop.

3. What you need to begin

Before any code, before any chat, here’s the kit.

3.1. A machine

Your laptop is fine. Any laptop. Mac, Linux, Windows, all fine. If it can run a browser and a text editor, it can run your first agent. Don’t buy anything new.

Later on, if you want your agent to keep working while you sleep or while you’re away from your desk, you’ll eventually graduate to something that stays on. I wrote about what that migration looked like for me, and it wasn’t hard. Although it matters eventually, it’s a month-three problem, not a day-one problem.

3.2. A subscription (or API access)

Let me be direct about this part, because I don’t see it spelled out often enough in beginner guides.

Free tiers aren’t enough. They cap you out fast, and you’ll spend your first afternoon hitting rate limits instead of learning. This is the wrong place to save money.

A $20 per month tier is your floor. Claude Pro, ChatGPT Plus, or the equivalent from whichever provider you pick. That tier is genuinely enough to build a simple first agent and get it working. You won’t love it forever, but it’s more than enough to start.

Power users run more than that. I pay for multiple subscriptions and for API usage on top. My bill isn’t small. That’s a months-from-now problem. Don’t worry about it yet.

Like, think of the $20 as a gym membership. It’s the cost of learning the skill. And honestly, it’s one of the cheapest upgrades to your toolkit you’ll ever make, so don’t flinch at it.

3.3. A harness (the tool you actually work with)

“Harness” is the word I use for the tool you sit in front of while building. There are four honest options, and all of them work:

Claude Code. A terminal-based tool from Anthropic. This is what I use most days. Deep file access, built for serious building. Power user territory, but approachable.
Claude Cowork. Also from Anthropic. A built-in cloud app that runs Claude in an agent loop without you ever touching a terminal. If the word “terminal” already makes you nervous, this is probably where you should start. It’s genuinely good enough to build your first real agent in, and you can always graduate to Claude Code later.
Codex (or the equivalent from another provider). Same category as Claude Code, different flavor.
A plain AI chat like Claude.ai or ChatGPT in your browser. Yes, you can genuinely start here. You’ll be copy-pasting more, but it works completely.

Pick one. Don’t spend a week comparison-shopping. The differences don’t matter until you’ve actually built something and know what you need. I wrote a longer piece on what’s actually worth learning from a harness like Claude Code if you want a deeper take. But for today, pick one and move on.

3.4. A folder (this is THE architecture)

Here’s the mental model that took me three months to see clearly. If you take it seriously, it’ll save you those three months.

The architecture of your AI agent IS its folder structure.

That’s it. There is no hidden magic layer. Every functional piece of an AI agent lives as a file in a folder on your computer. When someone online says “the agent has tools,” what they really mean is: there are scripts in a folder that the agent knows how to run. When someone says “the agent has memory,” they mean: there are markdown files it reads at the start of each session. When someone says “the agent has an instruction set,” they mean: there’s a file called something like CLAUDE.md or agents.md that tells it who it is and what the rules are.

It’s all files. That’s the whole trick. Once you see the folder as the architecture, the mystery goes away.

Here’s what a beginner’s agent folder looks like in practice:

my-agent/
├── CLAUDE.md              ← instructions (the brain)
├── memory/
│   └── notes.md           ← what the agent remembers
├── projects/
│   └── morning-email/
│       ├── fetch-email    ← the part that pulls your email
│       └── prompt.md      ← how you want it summarized
├── scripts/               ← small helper scripts
└── secrets/               ← API keys, passwords (keep this safe)

Read that tree slowly. Every concept maps cleanly to a file or folder:

Instructions live in CLAUDE.md or agents.md depending on your harness.
Memory lives in markdown files inside memory/.
Tools (what the agent can do) are scripts inside scripts/ or inside each project folder.
Projects live as subfolders under projects/.
Credentials (passwords, API keys) live in a protected secrets/ folder.

When you look at an AI agent this way, it stops being a mysterious entity and starts being something very familiar: a folder with text files in it. I wrote about how I structure the CLAUDE.md file itself after more than a thousand sessions, and that file is the single most important thing you will own. For now, just sit with the idea: the whole agent is a folder.

4. Build your first agent, step by step

Enough theory. I want you to finish this post with a real working agent, not just an understanding. I’m going to walk through the exact project I recommend for a first build: an agent that reads your overnight email and writes you a one-paragraph morning summary.

Free. Subscribe and the next one shows up in your inbox on its own.

I picked this one on purpose. It’s small enough to finish in an afternoon. It’s real enough that you’ll actually use it tomorrow. And it’ll make you hit most of the real challenges in building any agent: authentication, permissions, context, prompt design, error handling. You’ll learn more from building this than from reading any number of articles about it.

Step 1. Decide what you want (fifteen minutes, no code)

Open your chat tool of choice. Not to write code yet. Just to think out loud. Describe your morning:

Every morning I open my email. I scan 40 messages. I figure out which three actually matter. I want a one-paragraph summary of the important stuff before my coffee is done.

That’s your spec. Keep it this short. If you can’t explain what you want in one honest paragraph, you don’t understand what you want yet, and the agent isn’t going to save you from that. Better to figure it out before you write a line of code.

Step 2. Create the folder (five minutes)

Make an empty folder on your computer. Call it my-agent. Inside it, create the skeleton:

my-agent/
├── CLAUDE.md
├── memory/
├── projects/morning-email/
├── scripts/
└── secrets/

Empty folders are fine. We’ll fill them as we go. The only reason to make them now is so your agent has a place to put things.

Step 3. Let the AI draft your instructions file (ten minutes)

If you’re using Claude Code, there’s an even shorter way to start. From inside your empty my-agent folder, run the /init command. Claude Code looks around, figures out what it’s dealing with, and drops an initial CLAUDE.md in there for you. That’s your starting point. One command, done.

If you’re in a different harness or a plain chat, type something like:

I want to build an AI agent whose first job is to read my email inbox every morning and write me a one-paragraph summary of what matters. Draft a CLAUDE.md instructions file for it. Keep it under 50 lines. Don’t assume anything about my setup.

Either way, you’ll end up with a file called CLAUDE.md inside your folder. That’s the starting version. It will be rough. That’s fine.

Step 4. READ the CLAUDE.md (this is the most important step in this entire post)

I’m not joking. This one step is worth more than the other seven combined.

Open the file the AI just wrote. Read every line. Ask yourself:

Does this actually describe what I want?
Are there weird assumptions baked in that I didn’t ask for?
Does the voice sound like me, or like corporate blog filler?
Is there anything in here that surprises me?

Edit it until it reads like you wrote it. Remove anything you don’t understand. Add anything the model forgot. This file is the brain of your agent. If it’s wrong, every single thing downstream of it will also be wrong, and you’ll spend hours later chasing a ghost that started right here on day one. More on why in the mistakes section.

Step 5. Tell it what to automate (around thirty minutes)

Now the actual building. Here’s the key thing to understand, and it’s the reason I’m not writing out a bunch of code for you to copy: you don’t have to. You can just describe what you want in plain language, and the harness will figure out the rest.

Back to your harness. Say something like:

I want the first thing in projects/morning-email to read my email inbox, pull the last 12 hours of unread messages, and hand them off to be summarized. The end result should be a one-paragraph summary of what actually matters. Figure out the best way to do this on my setup and walk me through it step by step.

That’s it. That’s the entire prompt. No code, no jargon, no pretending you know what a shell script is.

A good harness, which is all of them these days, will then ask you follow-up questions. What email provider do you use? Mac, Windows, or Linux? Do you already have API credentials? Do you want this to run on a schedule, or only when you ask for it? It’ll figure out the right tool for the job and explain each step as it goes. You just answer the questions honestly.

This is the real difference between working with an agent and writing code from scratch. You’re not supposed to know in advance what tool or file format or library it’s going to use. That’s its job. Your job is to know what you want and to check the output when it lands.

Step 6. Let it build, but put the AI call at the END of the pipeline

While your harness is building, there’s one thing to steer. This might be the biggest efficiency lesson in the whole post: AI doesn’t belong in every step of the pipeline.

Your agent is going to fetch email. Fetching email is a problem boring, non-AI code has solved for 30 years. You don’t need a language model for that part. The only part that actually needs a language model is the summarizing, because that’s the part that requires understanding the content.

So tell the harness explicitly:

Keep AI out of the fetch step. Use whatever normal tool is appropriate there. Only use the language model at the very end, for the summarization itself. One call total, not one per email.

It’ll handle this correctly if you ask for it. Usually it won’t volunteer to do it this way, because stuffing an LLM into every step feels more impressive and uses more tokens. You’ll thank yourself later. I wrote a whole piece on when to use AI and when to just use normal code, and the rule from that post applies directly here: use AI where judgment or language actually matters, and use plain tools for everything else.

Step 7. Run it (five minutes)

Now run the thing you just built. There are two honest ways to do this, depending on how comfortable you are:

The non-technical way: just ask your agent to run it for you. In Claude Code, Claude Cowork, or Codex, you can literally say “run my morning email agent” and it’ll execute the thing it just built and show you the result. This is the easiest path if you’re not comfortable in a terminal. It works. Use it.
The technical way: if you like knowing exactly what’s happening, ask the harness “what command do I run to execute this myself?” and it’ll give you the one-liner to paste into your terminal. Then you’re running it directly, no agent in the loop.

Either way, you should see your morning summary print out. If you see it, you just built an AI agent. Congratulations. Go make coffee.

Step 8. When it breaks (this is where the real learning is)

It will break. Something won’t authenticate, or the summary will be garbage, or it’ll pull emails from the wrong time window. Good. This is the part you can’t skip, and it’s where the actual learning happens.

Read the error literally. Don’t panic. Paste the whole thing back into your harness and ask it to explain what happened and what to try next.
If the behavior keeps drifting from what you want, the problem is almost always in CLAUDE.md. Go back and fix the instructions there first.
If the summary is the wrong shape or tone, fix the summarization prompt.
If no data is coming through at all, the problem is earlier in the pipeline, and the agent can usually diagnose this for you in two or three back-and-forths.

That’s it. You have a real agent now. It’s small, it’s yours, and it does one thing you actually care about. Everything else in the rest of this post is about what will bite you as you grow it into something bigger.

5. The mistakes I made (so you can skip them)

This is the section my readers asked for the loudest. , who left the top comment on my original note, put it better than I could:

Would love to see you cover the mistakes people make on their first agent build. The “what not to do” part is always more useful than the setup guide, and almost nobody writes about it.

Agreed. Here are the ones I actually hit.

Mistake 1. Trusting the AI blindly to write your instructions file

Back in October, I was in a hurry. I let the AI generate my first CLAUDE.md and didn’t read it carefully. I ran with it. Things worked, sort of. Then the agent started doing weird things I hadn’t asked for. Small weirdness at first. Then bigger.

I spent hours, maybe days, chasing ghosts. Poking at different parts of the architecture. Swapping tools. Adjusting prompts. Burning billions of tokens trying to figure out what was happening. The root cause turned out to be a single misguided sentence near the top of the instructions file that I hadn’t bothered to read on day one.

The rule is simple and I’ll repeat it because it matters: you can use AI to generate your instructions. You can’t skip reading them. Ever. Read every line at least once. Edit until it sounds like you wrote it.

Mistake 2. Letting self-improvement run wild on the core files

Some time later, I built a self-improving layer. The agent could look at its own behavior, notice patterns, and update its own instructions. Technically brilliant. I was proud of it.

I also forgot to tell it which files it was allowed to touch.

Within a few days it had rewritten large parts of the core CLAUDE.md in ways I’d never sanctioned. The agent started drifting in five directions at once. Things I had explicitly told it to do were getting silently overwritten by its own “improvements.” Although I was proud of the self-improvement layer as an idea, I had to roll a lot of it back and rebuild it from scratch.

The fix was about scope. Each project in my agent now has its own small instruction file and its own little memory file. When self-improvement runs, it touches those leaf files, not the core. The trunk stays protected. The branches can grow. I eventually wrote a longer piece on the full self-improvement architecture if you want the deep version. For a beginner, the takeaway is simpler: never let any automated process write directly to the core instructions file. Ever.

Mistake 3. Ignoring open source out of pride

I wanted to build the whole thing myself. I refused to look at what other people were doing on GitHub. I told myself I didn’t want to be influenced.

That cost me two or three months.

Around month three I finally caved and started reading other people’s agent repos. Not to copy the architecture (which usually wouldn’t fit anyway), but to steal concepts. One example: I found a file called SOUL.md in an open source project. I’d only been using CLAUDE.md at that point, trying to cram every aspect of the agent into one file. SOUL.md turned out to be a dedicated place for personalization: values, voice, what the agent is like as a personality. That small idea opened up a whole layer for me that I’d been clumsily stuffing into the main instructions. I was a better agent designer the day after I read it than I was the day before.

Bianca Schulz asked about open source frameworks in the comments on my note, and here’s the honest answer: read them, borrow concepts, don’t feel obligated to adopt any single one of them wholesale. Your agent doesn’t need to look like anyone else’s. But you should know what the good ones are doing.

Mistake 4. Using the strongest model for every single task

For a long time I was running Opus on everything. Every small query. Every file read. Every trivial check. I’d hit my usage limit before lunch and then panic.

The fix is something I now call model routing, and it cut my usage dramatically:

Fast and simple stuff goes to a small model, often a local llm now. Before that I was using Haiku.
General work, planning, most coding goes to a mid-tier model. For me that’s Sonnet 4.6. This is where most of the work happens.
Hard reasoning, critical code, strategic decisions go to Opus 4.6.

I wrote in detail about why this switch made the agent both cheaper and better. Short version: nobody is going to optimize your usage for you. You have to do it yourself, and you should do it earlier than I did.

Mistake 5. Trying to build Jarvis on day one

If I’m being completely honest, my original fantasy was Jarvis from Iron Man. One agent that solved everything, ran my whole life, handled the business, wrote the blog, managed the calendar, raised the kid. The whole thing. From day one.

That was the real mistake, and basically everything else downstream of it was a consequence. I started with expectations that were impossible to meet in week one, so I kept pushing the architecture too hard and too fast. I’d add five features at once when I should’ve added one and let it settle. Although I did get a fully autonomous version working eventually, I had to roll a lot of it back.

The version that actually works, the one I have now, is the one I should’ve been building from the start: incremental. One small task. Then the next. Then the next. The big Jarvis-like thing did emerge eventually, but as a side effect of building a hundred small working pieces, not as a top-down design.

Full autonomy without taste isn’t really what you want, either. The problem with a fully autonomous agent isn’t that it can’t do things. It’s that it has no way of knowing whether the thing it just produced is actually good, because the thing that decides “good” is usually you. Your standards. Your instincts. Your sense of what’s off.

My agent is still autonomous for a large set of predictable tasks: morning reports, evening summaries, urgent flags, inbox triage, some experiments. Anything where the shape of “good” is well-defined. For anything creative, strategic, or quality-sensitive, I’m firmly in the loop.

Think of an agent as a partner, not a solver. And don’t try to build Jarvis on day one. Build one small, honest thing that works, then build the next one on top of it. That’s the only order of operations that actually converges.

Mistake 6. Putting AI in every step of every pipeline

Early on, every single thing my agent did had a language model call somewhere in it. Fetching data. Moving files. Routing messages. Formatting output. LLM everywhere, because LLMs felt magical and I wanted to use them for everything.

One morning I noticed I was at 50% of my 5-hour usage window before I’d actually done any real work. Just from the agent’s own background tasks waking up.

The fix was boring and obvious in hindsight: most of a pipeline can be a plain script. Move data from A to B with a script. Call the model exactly once, at the end, for the one thing that actually requires language. That’s what the model is for. Everything before that is plumbing, and plumbing should be code.

AI isn’t free. Even local models cost time, electricity, and capacity. You don’t need AI everywhere. You need it where the language or the judgment actually matters.

Mistake 7. Forgetting that your harness updates constantly

Claude Code updates almost daily. Codex updates often. Every harness does. This is mostly a good thing, except for one small catch: features you built from scratch will sometimes get shipped natively by the tool you’re building on, and now you have the same thing twice. Your custom version and the new native version start fighting each other, and the output drifts in ways that are hard to diagnose.

My fix was a small automation that checks for updates every day and flags anything in my custom code that overlaps with new native features. When it finds one, I delete my version and use the native one. Cleaner, less code to maintain, better integration.

If you don’t do something like this, after a few weeks you’ll notice things wiggling and conflicting and you won’t know why. The harness moved under your feet. It’s the cost of building on a fast-moving platform, and you just have to pay attention to it.

Mistake 8. Installing skills from a marketplace without checking them

This one is newer, because skill marketplaces and shareable agent extensions are newer. Claude Code now has a growing ecosystem of skills you can drop into your agent. Other harnesses have similar things. The idea is great: someone else already solved a problem you have, you install their skill, you save hours.

The catch is that a skill is code that runs on your machine with your agent’s permissions. If you install one without understanding what it does, you’ve effectively given a stranger a seat at the table inside your setup. Most skills are fine. Some aren’t. I already wrote about a case where malware was hidden inside a Claude Code skill, which is why I built a scanner for them in the first place.

The rule I follow now, and the one I’d give you from day one: before installing any skill from any marketplace, ask yourself two questions. Do I actually need this, or am I installing it because it’s there? And do I understand, at least roughly, what it’s allowed to do? If you can’t answer both, don’t install it yet. Read its source. Ask your agent to walk you through what it does. Treat it like any piece of software from someone you’ve never met, because that’s what it is.

Mistake 9. Not using Git from day one (the mistake I’m glad I didn’t make)

I want to be honest here: this one isn’t actually my mistake. I started using Git from the very beginning on every agent project I’ve ever built, and that single habit has saved me more times than I can count. I’m including it because the number of beginners I’ve watched skip it and then lose weeks of work is too high to leave out.

Git is the thing that lets you roll back to a working version when something goes wrong. And something will go wrong. Your agent will make a change to a file you didn’t expect. You’ll delete the wrong folder. You’ll let the model rewrite something that was working and discover two days later that the new version is worse. Without Git, you’re stuck trying to remember what the file looked like three days ago. With Git, you type one command and you’re back.

The good news is this is now genuinely easy, even for non-technical people. You can ask your harness to set up a Git repository for you and it’ll do the whole thing. Private repo on GitHub is free and fine. You can even set up an automation so that every time your agent finishes a meaningful task, it commits and pushes the current state to the repo automatically, which means you basically never lose work. I set mine up like that and I haven’t thought about it since.

If you remember nothing else from this section, remember this: commit and push every working version of your agent, from the very first day. It’s the cheapest insurance policy in the whole setup, and every single person who has ever lost work to a runaway edit wishes they’d done it sooner.

Bonus mistake. Thinking you need to build alone

I’ll say this honestly because I lived it: building an agent in isolation is much slower than building one while reading what other people are running into. Communities, newsletters, GitHub discussions, random Substack notes at midnight. The people doing this work are almost all willing to share what they’re learning. Go find them. I learned some of the most important things I know from comments on my own posts, which is the only reason this post exists at all.

6. Context window is the whole game

in the comments on my original note nailed something I think about constantly:

The context window is the real constraint. Everything else, tools, models, memory, is downstream of how well you manage what the agent sees at any given moment.

200,000 tokens sounds like a lot. It isn’t, once you understand what fills it.

Every session auto-loads a bunch of stuff before you’ve even typed anything: your core instructions file, your memory files, the conversation history if there is any, the current task state. That’s your “always-on” overhead. For me, that adds up fast. It’s a cost I didn’t fully understand at first, because it happens before you see a single response.

For a beginner, three rules carry you a long way:

Keep your CLAUDE.md thin. Every line you add is a line the model has to read at the start of every single session. Treat it like precious real estate. If you can say it shorter, say it shorter.
One memory file per project, and that’s it. Don’t build a vector database. Don’t install a semantic search engine. Don’t set up a temporal knowledge graph. Not on day one. A flat markdown file per project is enough for a surprisingly long time. That’s how I started and it worked for months.
Don’t worry about compaction yet. Eventually, once your memory files get large, you might want a process that rewrites them to stay under a size threshold. I run one every night now. That’s a month-three problem, not a day-one problem.

For almost any beginner project, 200k tokens is more than enough. A back-and-forth conversation over iMessage barely touches the budget. The failure mode is almost never “model context too small.” It’s “my CLAUDE.md bloated to 800 lines and now every session starts with a giant anchor around its neck.”

I wrote a longer piece on how I keep my own CLAUDE.md structured after a thousand plus sessions if you want to see the mature version. For now, just remember: thin instructions, one memory file per project, and context is the first thing that’ll bite you when the agent starts behaving strangely.

7. Security from day one

asked about security on my note, and this is the section I think about the most when I write pieces like this. It was one of the biggest reasons I built my own agent instead of using an off-the-shelf one.

Here’s the thing: an AI agent is a new attack surface on your computer. It has permissions. It runs code. It reads your files. It talks to the internet. And because we’re still early in how this all works, the models that drive it can be tricked, manipulated, or prompt-injected in ways we don’t fully understand yet. You’re adding a new thing with a lot of power to your machine, and you should act like that.

My progression was deliberate, and I’d recommend something similar for you:

MacBook phase. Very restricted permissions. Only the folders I explicitly whitelisted. No blanket network access. No access to real credentials. I built slowly and paid attention to what the agent actually needed. My personal machine has my personal things on it, and I wasn’t about to let a half-built agent loose in there.
Learning phase. As I understood what the agent actually needed and could trust it with, I expanded its permissions carefully.
Dedicated machine phase. Eventually I moved it to its own Mac Mini. An isolated computer, dedicated to the agent, with its own accounts and its own credentials. That machine is where the agent has broad permissions. My personal laptop doesn’t, and never will again.

A rule I learned the hard way and will give you for free: the agent should have its own accounts, not yours. Its own email address. Its own API keys. Its own logins. Don’t share your personal credentials with it. When something goes wrong, and something will eventually go wrong, you want the blast radius to be contained.

Two months ago I launched a small tool called a security scanner for Claude Code skills, which hit the front page of Hacker News. I built it because I was reading stories about autonomous agents being exploited in the wild and realized I wanted a way to check my own setup against a list of known issues. If you’re running anything serious, something like this is worth having in your toolbox. And even if you’re not, just paying attention to permissions from day one will put you ahead of almost everyone else building in this space.

Closing. Start small, start today

You don’t need the strongest model. You don’t need a fancy framework. You don’t need a PhD in machine learning or expensive hardware or a cloud account.

You need:

A laptop you already own.
A $20 per month subscription to a real model.
A harness. Any harness. Pick one.
A folder on your computer, with CLAUDE.md, a memory/ subfolder, a projects/ subfolder, and a secrets/ subfolder.
One real project you actually want to exist. Not a demo. Something you’d use tomorrow morning.

Start with that. The rest (all the architecture and the self-improvement and the model routing and the memory compaction) comes as you grow into it. None of it needs to exist on day one.

Everything will break regularly. Your harness will update under your feet. Your instructions file will drift. Your context window will bloat. The model will hallucinate a function that doesn’t exist and confidently insist it does. Although it cost me a lot of time at the beginning, I really don’t mind it anymore. It’s the job right now, and I accept that. I wrote my first piece about Wiz back when it was just a night-shift experiment, and looking back, almost everything I thought I knew then was wrong. That’s fine. The only thing that compounds is the habit of building, breaking things, fixing them, and writing down what you learned.

The people in my comments who asked for this post already know more than most. Almost all of you have the instinct, and most of you have the tools. What’s left is the part I can’t do for you: opening the folder, writing the first line of CLAUDE.md, and running something small tonight that didn’t exist this morning.

Go build your first agent. Then tell me what broke.

I write about building Wiz, my AI agent, roughly twice a week on Digital Thoughts. Every mistake, every rebuild, every thing that surprised me along the way. If this post was useful, subscribe and you’ll get the next one as soon as it goes out.

Subscribe now

My $600 Mac Mini Runs a 35B AI Model. Yesterday I Swapped Its Brain.

Pawel Jozefiak — Fri, 03 Apr 2026 09:47:19 GMT

A few weeks ago I wrote about running Qwen on my MacBook and iPhone. That was the first experiment. Local models doing real work on normal hardware. The gap between local and cloud was closing fast, and I wanted to push it further.

So I did. I moved everything to a headless Mac Mini M4, the base $599 model with 16GB of RAM. Started with smaller Qwen models for message classification and context compression. Then I found a way to run a 35 billion parameter model on 16 gigs. 17 tokens per second. Zero swap. Everyone said you need 32GB minimum. I’ve been running it for about a week now.

Yesterday Google dropped Gemma 4 under Apache 2.0. Within hours I benchmarked it against my Qwen setup. Classification went from 8.5 seconds to 1.9 seconds. By evening the swap was done. Five files changed.

This is all of it. How I fit a 35B model on a $600 machine, what local models actually do when you’re running an AI agent 24/7 (spoiler: it’s way more than classification), how model routing works across three tiers, and why Gemma 4 made me swap the brain overnight.

“Small LLM setup” for quick work

Why I Needed Local AI (It’s More Than You’d Think)

My Mac Mini runs as a headless AI automation server. iMessages, emails, scheduled nightshift tasks, 35+ specialized skills. The brain is Claude Code. Claude is brilliant. Claude also has usage limits.

I pay for Claude Max. It’s not cheap. And every wake-up burns through that subscription. I needed a way to handle the routine work locally. Free. On the machine itself.

At first I thought this would just be message classification. Is this a question or a greeting? Route the boring stuff away from Claude. That’s how it started.

Turns out that was maybe 20% of what local models can do. Here’s what actually runs locally now:

Message triage: classifying every incoming message (question, request, idea, greeting, FYI) and assessing urgency. Under 2 seconds per message.
Context compression: when someone sends a 500-word message, the local model compresses it to 30 words before Claude sees it. Same understanding, fewer tokens.
Signal compression: my agent collects signals all day long (errors, metrics, task outcomes, all kinds of stuff). Before the nightshift planning session, the 35B compresses the entire day into a dense summary. Saves roughly 15x in tokens for the Opus planning call. That’s a lot of money over a month.
Email preprocessing: same triage pattern applied to incoming emails before deciding whether to spin up a full Claude session.
Memory consolidation: the agent accumulates daily notes and context. A local model clusters related entries and merges them. Like defragmenting a hard drive, but for the agent’s memory.
Fallback: when Claude is rate-limited or down at 3am, the local 35B handles operational tasks. Not Claude-level reasoning, but functional.

The triage classification is just one example. The local models touch nearly everything the agent does. The result: roughly 30-40% fewer Claude sessions compared to routing everything through the API. My subscription lasts longer. And the agent responds faster for routine work because there’s no network round-trip.

Three Tiers, One $600 Machine

Not every task needs the same model. Like, classifying “is this a question or just ok?” is a completely different thing than compressing an entire day of automation signals into a planning summary. So I built a routing system. Three local tiers plus the cloud.

Why three tiers and not just one? A 2B model can tell you whether “What’s the status?” is a question or a greeting. But it can’t produce a useful 30-word summary of a 500-word message. And neither the 2B nor the 4.5B can do what the 35B does: compress an entire day of signals into a dense planning brief, or serve as a real fallback when Claude goes down. Different tools for different jobs.

The fast tier runs on every incoming message. Classifies the type (question, request, idea, greeting, FYI), assesses urgency. Under 2 seconds. If the message is a greeting or pure FYI, the agent skips Claude entirely. This one needs to be fast because it runs on every single message, all day.

The primary tier handles tasks where you need actual language understanding. Context compression is the main one: someone sends a long message, this model condenses it to 30 words before Claude sees it. It also handles things like Substack Notes generation as a fallback and the morning report AI summary. The 2.3B fast model can’t produce quality summaries. The 35B is overkill for this. The 4.5B sits in the sweet spot.

The heavy tier is where it gets interesting. The 35B does two things. It handles complex preprocessing (compressing the day’s automation signals before the nightshift Opus planning call, which saves roughly 15x in tokens). And it sits in the resilience chain as a Claude fallback. When Claude is rate-limited at 4am, the 35B handles health checks, error scanning, operational maintenance. Always marked as “[Local Fallback]” so I see the difference in the morning. But something functional beats nothing at 3am.

The triage routing for messages looks like this:

1. Incoming message hits triage
2. Fast tier classifies (question/request/idea/greeting/fyi) + urgency
3. If greeting or fyi: skip Claude entirely
4. If long text: primary or heavy tier summarizes before Claude sees it
5. Prepend classification to Claude's prompt: "[Pre-classified: request, urgency: high]"
6. Claude skips its own analysis step = fewer tokens burned

But message triage is just the entry point. The same models handle signal compression, email preprocessing, memory consolidation, fallback responses. The routing system is the foundation. The use cases keep growing, and that’s kind of the point.

Safety rules still matter. Messages mentioning money, deployment, publishing, or anything work-related always bypass local triage and go straight to Claude. Very short messages (under 20 characters) also skip triage because “yes” could be a standalone acknowledgment or an answer to a previous question. I learned that one the hard way when the triage classified “yes” as acknowledgment and skipped a real conversation reply.

The 35B Trick (Your SSD Is the New GPU Memory)

This is the part that shouldn’t work. I didn’t invent this. I came across a post on X where someone was running oversized models on consumer hardware using a specific flag in llama.cpp. That sent me down a rabbit hole.

Qwen 3.5 35B-A3B has 35 billion parameters. At standard precision, that’s way more than 16GB. My first attempt through Ollama loaded 26GB into memory on a 16GB machine. The system froze. 4.3 million swapouts. Timed out after 10 minutes without producing a single token. Dead.

Then I tried llama.cpp with one flag: --mmap.

This is only as represenatation. I am not using UI for these models as Mac Mini is running headless.

Same model. Same hardware. Same 16GB. Result: 17.3 tokens per second, 81% memory free, zero swap.

Why This Works

Two things make this possible: the model architecture and how macOS handles files.

Qwen 3.5 35B-A3B is a Mixture of Experts (MoE) model. 35 billion parameters total, but only 3 billion are active per token. Like a building with 35 floors where any given task only needs to visit 3 of them. The rest sit empty. A router network inside the model decides which experts to activate for each token.

So what does --mmap do? Instead of loading the entire model file into RAM (which is what Ollama tried, and why it choked), llama.cpp memory-maps the file. The OS treats the model like a virtual address space backed by your SSD. It only pages in the weights that are actually needed:

Shared layers (attention, embeddings, norms): stay resident in RAM, about 4-6 GB
Expert weights: paged from SSD only when that expert activates
90% of the model sits on your NVMe drive, untouched most of the time
The macOS page cache naturally caches recently used experts, so repeated patterns get faster

None of this is new, by the way. Apple published a research paper called “LLM in a Flash” back in December 2023 describing exactly this approach: use your SSD as an extension of available memory for LLM inference. The M4’s unified memory architecture helps too (no PCIe bottleneck between CPU and GPU memory, it’s all one shared pool). And the NVMe SSD in the Mac Mini is fast enough that paging weights on demand doesn’t bottleneck inference for MoE models where most weights are idle anyway.

The thing that actually surprised me? The 35B is faster than the old 9B model. 17.3 vs 12.6 tok/s decode. Even though it has 4x more total parameters, MoE means each token only computes through 3B parameters. Better quality weights, same compute budget. Bigger model, less work per token. I did not expect that.

I downloaded an aggressively quantized version (Unsloth’s UD-IQ3_XXS, 13GB on disk) and run it on llama.cpp alongside Ollama on a different port. Ollama handles the fast and primary tiers on port 11434. llama.cpp handles the heavy tier on port 8081. Both coexist on the same machine. Total disk for everything: about 22GB.

The Brain Swap (Qwen to Gemma)

For about a week, everything ran on Qwen 3.5 across all tiers. It worked. Classification took about 8-9 seconds on the 4B model, summarization around 50 seconds on the 9B. Not blazing, but good enough for background preprocessing that doesn’t need instant responses. The system was stable. I was happy with it.

Then yesterday, Google released Gemma 4.

Apache 2.0 license. That matters. Previous Gemma versions had usage restrictions and monthly active user limits. For something running 24/7 as production infrastructure, license restrictions are a real concern. Apache 2.0 means no strings.

Only as example that Mac Mini can run Gemma 4.

The benchmarks were wild. Gemma 4’s models scored 88-89% on AIME 2026 math. The previous Gemma 3 scored 20.8%. That’s not incremental improvement. That’s a different model family wearing the same name.

But the thing that really got me: multimodal on small models. The E2B (2.3B effective) and E4B (4.5B effective) can process images and audio natively. My Qwen setup was text-only. Voice messages via iMessage? Had to skip them entirely. Screenshots sent for context? Invisible to the triage layer. With Gemma 4, that changes. That’s a real capability gap closed, not just a speed improvement.

I wasn’t going to swap production models based on a marketing page though. Wrote a benchmark script, ran 10 classification prompts and 3 summarization tasks against each model under identical conditions.

Classification (10 test messages)

Summarization (3 test texts)

4.4x faster classification. 1.8x faster summarization. The speed difference is what sold me. For a triage system that runs on every incoming message, 1.9 seconds vs 8.5 seconds is the difference between “barely noticeable” and “why is this taking so long.”

The accuracy gap (70% vs 80%) turned out to be mostly gray areas. Both models classified “What’s the status of the nightshift?” as status_check instead of question. Honestly, status_check is a better answer. One genuine miss: Gemma said “observation” for a message where “fyi” would’ve been correct. It invented a category that wasn’t in the allowed list. Edge case I can fix with better prompting.

The actual swap was anticlimactic. My architecture centralizes model names in one file. Two constants changed, three files with hardcoded model strings updated. Done.

# automation/lib/local_llm.py
MODEL_PRIMARY = "gemma4:e4b"   # was "qwen3.5:9b"
MODEL_FAST = "gemma4:e2b"      # was "qwen3.5:4b"

The heavy tier stays on Qwen 35B via llama.cpp. Deliberately. The fast and primary tiers run through Ollama, which had Gemma 4 support on day one (I upgraded Ollama from 0.19 to 0.20 for it). The heavy tier runs through llama.cpp with the mmap trick. Swapping that to Gemma’s 26B MoE variant means testing a different quantization, different mmap behavior, different inference characteristics. I don’t change two things at once in production infrastructure. Qwen 35B works. It stays until I’ve properly benchmarked the Gemma 26B alternative.

Old Qwen models stay on disk as rollback. If anything breaks, reverting the fast and primary tiers is a two-line change.

The Thinking Mode Trap

Both Qwen and Gemma have a “thinking” mode. The model generates an extensive chain-of-thought before answering. For complex analysis, this is great. For classification, it’s a disaster.

With thinking enabled, a simple “is this message a question or a request?” took 30+ seconds. The model generated 500 tokens of internal reasoning (”Let me analyze the intent of this message. The user is asking about...”) and then said “request.” All that thinking for a one-word answer. I was watching this happen in real time and just... no.

One parameter: think: false in the Ollama API call. Classification went from 30 seconds to under 1 second. Same accuracy. 30x faster. This was the single biggest optimization in the whole setup, and I almost missed it.

Works the same way for both Qwen and Gemma:

# Qwen classification (fast, thinking disabled)
curl localhost:11434/api/chat -d '{
  "model": "qwen3.5:4b",
  "messages": [{"role": "user", "content": "Classify: question/request/greeting"}],
  "think": false,
  "options": {"num_ctx": 4096}
}'

# Gemma classification (same API, same parameter)
curl localhost:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "messages": [{"role": "user", "content": "Classify: question/request/greeting"}],
  "think": false,
  "options": {"num_ctx": 4096}
}'

Same parameter, same behavior, no code changes when I swapped models. For the heavy tier (llama.cpp on port 8081), thinking is controlled via the system prompt instead, but same principle: disable it for fast tasks, enable it when you actually need reasoning.

People asked about context windows on my first local LLM post. Fair question. I push them as far as the hardware allows. Classification gets 4K (it doesn’t need more, most messages are under 200 tokens). Summarization gets 32K on the primary tier. Document analysis gets up to 64K on the fast tier. The heavy 35B gets 16K, which is enough for compressing a full day of signals or handling a long fallback conversation. The mmap trick handles the model weights, but the KV cache (the memory the model uses to “remember” the conversation) still lives in RAM. With the model weights at 4-6 GB and the KV cache for 16K context, there’s still plenty of room on 16GB. It’s not a million tokens. That’s not happening on this hardware. But for what these models actually do (triage, compression, summarization, fallback), 16-64K is more than enough.

The Resilience Chain (When Claude Goes Down)

The local models aren’t just about cost. They’re about uptime.

My agent’s resilience chain works like a waterfall. If the top tier fails, it cascades down:

Claude Sonnet -> retry -> Haiku -> Local 35B -> Local Primary -> OpenRouter -> Queue

Like backup generators. Main power (Claude Sonnet) goes out. Building tries the secondary generator (Haiku). If that’s down too, local generators kick in. Not as powerful, but they’re on-site, zero-cost, and they don’t depend on anyone else’s infrastructure staying online. Only if everything local fails does the system reach for an external API (OpenRouter). And if even that fails, the message goes into a queue for later.

The system tracks cooldowns per model too. If Claude hits a rate limit, it records how long to wait and skips straight to the next tier on subsequent requests. No point retrying something you know will fail. Once the OAuth token expired at 2am and the system figured out that all Claude tiers would fail with the same expired token, so it skipped straight to the local fallback. Saved minutes of retries that were guaranteed to fail. I only found out in the morning when I read the logs. It just handled it.

My agent runs overnight. Nightshift tasks execute while I sleep. If Claude is rate-limited at 4am, the local 35B handles operational checks, health monitoring, error registry scans. Responses are clearly marked “[Local Fallback]” so I see the difference in the morning. Not Claude-level reasoning, but functional. And I’d rather wake up to a degraded response than a silent failure.

The Setup (If You Want to Try)

Hardware: Mac Mini M4, base model, $599, 16GB unified memory. Any Apple Silicon Mac with 16GB will work. The M4’s NVMe SSD speed helps for the mmap approach, but M1/M2/M3 work too.

Option A: Gemma 4 (recommended, what I run now)

# Step 1: Install Ollama (0.20+ required for Gemma 4)
brew install ollama

# Step 2: Pull Gemma 4 models
ollama pull gemma4:e2b        # Fast tier (7.2 GB, 2.3B effective, vision+audio)
ollama pull gemma4:e4b        # Primary tier (9.6 GB, 4.5B effective)
ollama pull nomic-embed-text  # Local embeddings (274 MB, optional)

# Step 3: Set environment for 16GB systems
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_KEEP_ALIVE=10m
export OLLAMA_MAX_LOADED_MODELS=1

Option B: Qwen 3.5 (what I started with, still solid)

# Same Ollama install, different models
ollama pull qwen3.5:4b        # Fast tier (3.4 GB)
ollama pull qwen3.5:9b        # Primary tier (~6 GB)

The 35B Heavy Tier (works with either Option A or B)

# Step 4: Install llama.cpp for the big model
brew install llama.cpp

# Step 5: Download Qwen 3.5 35B (13 GB quantized)
pip3 install huggingface-hub
python3 -c "from huggingface_hub import hf_hub_download; \
  hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF', \
  'Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf', \
  local_dir='~/.local/share/llama-models')"

# Step 6: Start llama-server with the magic flag
llama-server \
  --model ~/.local/share/llama-models/Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf \
  --port 8081 --ctx-size 16384 --n-gpu-layers 0 --mmap \
  --flash-attn on --threads 8

Some things that tripped me up:

OLLAMA_MAX_LOADED_MODELS=1 is critical on 16GB. Without it, Ollama tries to keep both the fast and primary models in memory at once and the system just dies. One model at a time, 10-minute idle timeout, then it unloads and frees your RAM.

--n-gpu-layers 0 on the llama.cpp command looks wrong. On Apple Silicon you’d normally offload layers to the GPU. But with mmap, we want the OS to manage the paging. Setting GPU layers to 0 means everything goes through the mmap path. The M4’s unified memory means “GPU” and “CPU” are reading from the same pool anyway, so it doesn’t really matter.

Ollama and llama.cpp coexist on different ports (11434 and 8081). The heavy tier (llama.cpp) is on-demand: I start it when needed and stop it when not. The fast and primary tiers (Ollama) run as a LaunchAgent, always on, auto-restarting. Two separate inference servers on one Mac Mini. Works fine.

What I Actually Learned

Local models are not Claude. They can’t chain multi-step tool calls, refactor code, or make creative decisions. Claude Code does this because Anthropic spent years on tool calling. Local models aren’t there yet. But that’s fine. I don’t need them to be Claude. I need them to classify messages in 2 seconds and compress context before Claude sees it.

The mmap thing is probably the most underappreciated trick in this whole setup. One flag. The difference between “impossible” and “17 tok/s with zero swap.” If you have Apple Silicon and an NVMe SSD, you can run models much larger than your RAM would suggest. I’m running it daily. It just works.

Centralizing model names saved me. I had three files with hardcoded "qwen3.5:9b". Found them all during the swap. If I’d had ten files like that, a quick migration would’ve been a 2-hour headache. Lesson learned (although honestly I should have known better).

I benchmarked before swapping, and I’m glad I did. Google’s marketing said Gemma 4 was better. The benchmarks confirmed speed gains but also showed a slight accuracy regression. Speed got better, instruction following got slightly worse. That’s a tradeoff I chose to accept, not one that surprised me in production.

One more thing: Gemma E2B is 7.2GB on disk (vs Qwen 4B at 3.4GB). Looks huge. But that’s because it bundles vision and audio encoders. Loaded RAM is nearly identical. I almost rejected it based on download size. Would have been a mistake.

And the thinking mode thing. Disable it for fast tasks. 30x speed difference for classification. This single setting is the difference between a usable triage system and a painfully slow one. Test it. Seriously.

Where This Is Going

Local models are real infrastructure now. Not toys. Production preprocessing, signal compression, memory consolidation, fallback. Running 24/7 on a $600 machine.

A year ago, useful local inference meant a $3,000 GPU. Now a base Mac Mini handles three tiers. The models keep getting smaller and faster. Gemma 4’s E2B classifies messages in under 2 seconds with 2.3 billion effective parameters. And it can see images and process audio. For my setup that previously had to skip voice messages entirely, that alone was worth the swap.

Next I want to test the Gemma 4 26B MoE variant (26 billion total, 3.8 billion active per token) as a replacement for the Qwen 35B heavy tier. If the GGUF works well with llama.cpp and mmap, that’s the full stack on Gemma. One model family for everything.

If you’re building any kind of AI agent system, try this pattern. Local for the frequent and cheap. Cloud for the complex and expensive. Cost savings compound. Reliability improves. You stop depending on any single provider staying online.

Gemma for preprocessing. Claude for thinking. Both got better at their jobs by not trying to do the other’s.

Thanks for reading Digital Thoughts! This post is public so feel free to share it.

Claude Code’s Source Got Leaked. Here’s What’s Actually Worth Learning.

Pawel Jozefiak — Wed, 01 Apr 2026 12:05:41 GMT

I spent a night reading the code and building things from it. Here’s what matters if you’re building AI agent.

What actually leaked (and what didn’t)

On March 31, 2026, a security researcher noticed something odd about the Claude Code npm package. Version 2.1.88 shipped with a 59.8MB source map file. Source maps are debug artifacts that Bun (the runtime Claude Code uses) generates by default. Someone forgot to add *.map to the .npmignore file. That’s it. A missing line in a config file.

The source map referenced unobfuscated TypeScript files on Anthropic’s Cloudflare R2 bucket. All downloadable. About 1,900 files. 512,000 lines of code. Within hours, the codebase was mirrored across GitHub, reaching 84,000+ stars in under two hours. The fastest-growing repo in GitHub history, for a codebase that wasn’t supposed to be public.

Interactive data

Let me clear up a few things I’ve seen people get wrong about this.

No customer data was exposed. This was source code for the CLI tool, not a database breach. No credentials, no user conversations, no API keys. Anthropic confirmed it was a packaging error.

No model weights were leaked. The code is the software harness around Claude, not the LLM itself. You can’t run your own Claude from this. What you get is the orchestration layer: how Claude Code manages tools, memory, context, permissions, and multi-agent coordination.

This isn’t a security breach in the traditional sense. It’s a build artifact that should have been excluded from the npm package. Embarrassing for Anthropic’s build pipeline, but the kind of mistake any team shipping fast could make. The irony is that the leaked code includes a system called “Undercover Mode” specifically designed to prevent Anthropic employees from accidentally leaking internal details into public repos. It leaked along with everything else.

It wasn’t intentional. I know people are debating this because the timing aligns with April 1 and because Anthropic had a rough PR week (cease-and-desist against the OpenCode project). But the evidence is clear: they’ve been mass-sending DMCA takedowns to GitHub repos, they pulled the npm package, and their Cloudflare R2 bucket was taken down. You don’t do that with a planned release. The strategic roadmap exposure is too costly, especially during IPO preparation. Theo from t3.gg put it well: “if you think this was intentional, I have a couple bridges for sale.”

One theory that makes sense: Anthropic was investigating rate limit issues in Claude Code. Multiple employees had posted about seeing higher rate limit hits than expected. In their attempts to get better error logs from production builds, they may have included the source maps for debugging. Then forgot to exclude them from the npm package. That tracks with how these things usually happen: a debug change that nobody remembers to undo.

A warning if you’re thinking of cloning the leaked repo: The source references internal workspace packages that don’t exist on npm. Someone already registered those package names with a disposable email. If you clone and blindly run npm install, you could be pulling malicious code. Be careful.

What’s inside and why it matters

I’ve been building my own AI agent for months now. It runs 24/7 on a dedicated Mac Mini. So when this leak dropped, I didn’t read it for the drama. I read it to learn. I spent a night going through the architecture, comparing it to what I’ve built, and pulling out anything useful.

Here’s what matters if you’re building AI agents or want to understand where this technology is actually going.

The three-layer memory system

This is probably the most important architectural discovery in the leak. Claude Code uses a memory system with three layers:

Core index (MEMORY.md): A lightweight file of pointers, always loaded into context. Each entry is under 150 characters. It’s an index, not the memory itself.
Topic files: Detailed knowledge distributed across separate files, fetched on-demand when the index suggests they’re relevant.
Raw transcripts: Never re-read in full. Only grep’d for specific identifiers when needed.

The key insight is what they call “skeptical memory.” The agent treats its own memory as a hint, not a fact. Before acting on something it remembers, it verifies against the actual codebase. Memory says a function exists? Check first. Memory says a file is at this path? Verify before using it.

This solves context entropy, the gradual degradation of agent performance in long-running sessions. Most agents get worse the longer they run because their context fills up with stale observations. This architecture keeps the active context small (just the index) and only loads what’s needed.

I’ve been running a similar pattern with working memory that rolls over on a schedule and a permanent index that persists across sessions. The leak confirmed this is the right approach. The verification step is something I’m now adding to my own system.

Memory consolidation during idle time (autoDream)

The leak includes a system called autoDream in the services/autoDream/ directory. It’s a background memory consolidation engine that runs as a forked subagent with read-only access to the project. Three gates must pass before it runs: 24 hours since the last run, at least 5 sessions completed, and a consolidation lock must be available.

When triggered, it runs four phases: orient (scan memory directory), gather (extract new info from logs), consolidate (write and update topic files), and prune (keep total memory under 200 lines and 25KB).

Why this matters for you: if you’re building any agent that runs over multiple sessions, unbounded memory will kill you. Not immediately. Over weeks. Your agent starts referencing things that are no longer true, duplicating observations, and filling context with noise. You need some form of consolidation. autoDream’s approach of forking a read-only subagent is clean because it can’t accidentally corrupt whatever the agent is currently working on.

The tool architecture

Claude Code defines 40+ discrete tools, each wrapped in permission gates. The biggest file in the leak is Tool.ts at roughly 29,000 lines, defining tool types and permission schemas. Every tool operation goes through a PermissionGate structure for granular access control.

Three things stood out:

File-read deduplication: Before re-reading a file, it checks whether the file has changed since the last read. If not, it skips the read and uses the cached version. Sounds obvious, but most agent setups don’t do this, and the token savings compound fast.
Large result offloading: When a tool produces a massive result (like searching a large codebase), it writes the full result to disk and only passes a preview plus a file reference back to the context. This keeps the context window clean while still making the data available.
CLAUDE.md reinsertion on turn changes: The CLAUDE.md file doesn’t just get loaded once at the start. It gets reinserted into the conversation on every turn change (when the model finishes and the user sends a new message). Not at the top of the history, but right where the new message is sent. This repeated injection keeps the model aligned with your instructions even in long conversations where the original system prompt would have scrolled far out of active context.

If you’re using CLAUDE.md files (and you should be), this last detail matters. Your instructions aren’t a one-time primer. They’re actively re-read throughout the conversation. That’s why well-structured CLAUDE.md files have such a big impact on agent behavior. I wrote about how I structure mine after running 1000+ sessions.

Multi-agent coordination

The leak reveals Coordinator Mode. One Claude agent acts as a lead, spawning and managing multiple worker agents in parallel. Workers operate in their own isolated contexts with restricted tool permissions. They communicate via XML-structured task notifications and share data through a scratchpad directory. The system prompt for coordinators emphasizes “parallelism is your superpower.”

The clever implementation detail here: sub-agents share the prompt cache. Instead of each worker spinning up with its own context (paying full input token costs), they all share the same context prefix and only branch at the task-specific instruction. This is what makes multi-agent coordination economically viable. Without cache sharing, spinning up five workers means paying five times the input cost. With it, you pay once for the shared context and only pay incrementally for the task-specific parts. That’s probably why Coordinator Mode isn’t released yet. The cost math is still brutal even with this optimization.

This is the same pattern I landed on independently. I built three persistent domain teams with an Opus lead that plans and delegates, and Sonnet specialists that execute. The convergence here is specific: lead agent that plans, specialist workers that execute in parallel, structured communication, verification at the end.

Risk classification

Actions get labeled LOW, MEDIUM, or HIGH risk. There’s a “YOLO classifier” for fast auto-approval of low-risk operations. Protected files like .gitconfig and .bashrc get special treatment. There’s also a referenced “AFK Mode” that adjusts behavior when the user is away.

Three tiers. Same as what I built. Same reasoning: an autonomous agent needs to know which actions are safe to take alone, which should be flagged, and which need a human in the loop. This one is less a revelation and more a confirmation that the three-tier approach is just the correct default for any agent with real-world access.

Five patterns you can use right now

Here’s the practical part. These are patterns from the leak that you can apply to your own AI agent setup, whether you’re building something complex or just trying to get more out of Claude Code, Cursor, or any AI coding tool.

1. The blocking budget

KAIROS (the unreleased always-on daemon in the leak) has a 15-second blocking budget. Any proactive action that would take longer gets deferred. Max 2 proactive messages per window. Reactive messages (responding to user input) bypass the budget entirely.

Why this matters: if you’re running any kind of proactive agent, whether it’s monitoring code, sending notifications, or checking on things, you need rate limiting. Not just “don’t spam.” Structured rate limiting with different rules for proactive versus reactive work. Without it, your agent will eventually send 4 messages in 30 seconds when one would do.

I implemented this the night I read the leak. A simple state file tracks the budget window. Proactive messages get queued and recovered. Reactive messages go through immediately. About 50 lines of Python.

2. Skeptical memory with verification

Don’t trust your agent’s memory. Make it verify. Every time your agent says “I remember that file X has function Y,” make it check first. Memory is a hint. The codebase is the truth.

This is the single most practical takeaway from the leak. If you’re using CLAUDE.md files, custom system prompts, or any form of persistent context, treat them as suggestions that need verification, not as ground truth. Files get renamed. Functions get deleted. APIs change. Your memory hasn’t.

3. Semantic memory merging

autoDream doesn’t just delete old memories. It merges related observations, removes logical contradictions, and converts vague insights into concrete facts. If your agent noted “user might prefer X” three months ago and “user confirmed X yesterday,” the old entry should be updated, not kept alongside the new one.

Most memory systems I’ve seen (including my own before this) do time-based cleanup. Old stuff gets archived or deleted. That’s fine for preventing memory bloat, but it doesn’t catch contradictions. Two conflicting observations can coexist for months. Semantic merging resolves that.

I built a version using a local LLM (Qwen 9B running on the Mac Mini) to cluster related entries and merge them during nightly maintenance. A safety cap prevents reducing any section by more than 50% in a single pass. You don’t need to go this far. Even a simple script that groups memory entries by topic and flags potential contradictions would be a step up from pure time-based cleanup.

4. Adversarial verification

The leaked Coordinator Mode treats verification as a distinct, adversarial phase with its own worker agent. Not “check if this works.” Not a checklist. A separate agent whose job is to try to break what was built.

This is different from testing. Testing asks “does it work?” Adversarial verification asks “how can I break it?” The distinction matters because the agent that built something has a blind spot about its own work. A fresh agent with the explicit prompt “find problems with this” will catch things the builder missed.

I added this to my nightshift process. Before any task gets marked complete, a separate verification agent runs two phases: existence check (does the deliverable actually exist?) and adversarial challenge (try to break it). The results go into a verification log. It’s caught real issues that would have shipped otherwise.

5. Prompt cache awareness

The source includes a promptCacheBreakDetection.ts file that monitors 14 different cache-break vectors with sticky latches. Things like mode toggles, model changes, context modifications. Each one can invalidate your prompt cache, and cache misses mean you’re paying full price for tokens that could have been cached.

If you’re running many agent sessions per day, cache efficiency directly affects your costs. This one is easy to ignore because you don’t see the waste. But if you track it (which I now do), you’ll likely find that your cache hit rate is lower than you assumed and that specific patterns in your workflow are breaking it.

Related: the source reveals five different compaction strategies for when the context window fills up. If you’ve used Claude Code heavily, you’ve probably hit the moment where it compacts and then loses track of what it was doing. That’s still a hard problem. But knowing they’re actively working on multiple approaches to solve it tells you this is worth investing in for your own long-running agents too.

What I built in one night

I didn’t just read the leak. I treated it as a learning exercise and built things from it. That same night, I implemented five modules inspired by patterns in the leaked source:

Blocking budget for proactive messages. 15-second window, 2-message max, deferred queue.
Semantic memory consolidation using local LLM to cluster and merge observations during idle time.
Frustration detection via regex pattern matching. 21 patterns, three action tiers (back off, acknowledge, simplify). Fast enough to run on every incoming message.
Prompt cache monitor that tracks hit rates, estimates savings, and alerts when efficiency drops.
Adversarial verification as a formal phase in the nightshift execution loop.

Total time: about 4 hours of reading and building. I already had the foundations (nightshift, memory system, domain teams). These were specific improvements layered on top.

The frustration detection one is worth a note. The leaked code uses regex patterns to detect user frustration. Stuff like “wtf”, “this sucks”, keyword matching. An LLM company using regexes for sentiment analysis. But it makes sense. You don’t burn an LLM inference call on something you can pattern-match in 5 milliseconds. I applied the same logic: 21 patterns, fast evaluation, action suggestions without the overhead of an API call.

What’s not worth your time

Not everything in the leak is useful. Some of it is Anthropic-specific, some is unreleased for good reasons, and some is just fun but not practical.

Buddy System. A Tamagotchi-style terminal pet. 18 species across rarity tiers, procedural stats like DEBUGGING, PATIENCE, CHAOS. It’s genuinely charming and I kind of love it. But unless you’re Anthropic trying to make a CLI tool feel more personal, you don’t need this.

Undercover Mode. Strips Anthropic attribution from open-source contributions. Specific to their internal workflow where employees use Claude Code on public repos. Not applicable unless you have the same problem (and if you do, you probably already know about it).

Anti-distillation mechanisms. The code injects fake tool definitions into API requests to poison anyone trying to train models on intercepted traffic. It also summarizes reasoning chains before returning them to eavesdroppers. Interesting from a security perspective. Not useful for building agents.

ULTRAPLAN. A mode that offloads complex planning to a remote cloud container running Opus 4.6 for up to 30 minutes. Cool concept. Requires infrastructure you probably don’t have and a use case that doesn’t come up often enough to justify building it.

Native client attestation. API requests include computed hashes that prove they come from legitimate Claude Code binaries. Implemented below the JavaScript runtime in Bun’s native HTTP stack (written in Zig). This is DRM for API calls. Interesting engineering but not something you can or should replicate.

The uncomfortable truth about the harness itself

Here’s something most coverage of the leak doesn’t mention: Claude Code is not actually a good harness. Not even close.

On terminal bench, Claude Code ranks 39th. There are 38 harness-model pairs that outscore it. If you filter to just Opus, Claude Code is dead last among harnesses. Cursor’s harness gets Opus from a 77% score to 93%. Claude Code gets that same Opus model... 77%. The harness adds nothing.

Even funnier: when you search the leaked source for “open code” (the open-source CLI project Anthropic sent a cease-and-desist to), you find multiple instances of Claude Code referencing Open Code’s source to match its behavior. Things like scrolling implementations. The closed-source project was copying from the open-source one, not the other way around.

So what’s actually valuable here is not the harness code itself. The valuable parts are the architectural patterns underneath: how they handle memory, context management, multi-agent coordination, and the unreleased feature infrastructure. The actual harness? You could build a better one with any of the open-source alternatives as your starting point.

The code quality itself is... fine. When analyzed, it scores about a 7/10. Type safety is solid (only 38 instances of any across 500+ files). Error handling is decent. But there are “god files” with 5,000+ lines each, over a thousand feature flag references scattered across 250 files, environment variable sprawl throughout, and no centralized secret sanitization before logging. The test files weren’t included in the source map (they wouldn’t be), so that skews the assessment, but the codebase has clear tech debt. Lots of specific, actionable TODO comments that look old.

None of this is unusual for a fast-moving product at this scale. But it’s worth knowing before you treat the leaked code as a reference implementation. The patterns are worth studying. The code itself is not the gold standard some people are making it out to be.

The unreleased features that tell you where Claude Code is heading

The 44 feature flags in the leak paint a picture of what’s coming. KAIROS is the big one: an always-on background agent that acts proactively, maintains daily logs, subscribes to webhooks, and has its own memory consolidation cycle. It’s referenced over 150 times in the source. Expected to roll out soon.

There’s also Voice Mode (push-to-talk interface), Computer Use integration (screenshot capture, click and keyboard input baked into the CLI), and the Coordinator Mode for multi-agent orchestration.

If you’re using Claude Code today, it’s worth knowing that the tool is moving toward being an always-on daemon, not just a CLI you invoke when you need help. The patterns I described above (blocking budgets, memory consolidation, risk tiers) are all infrastructure for that shift.

Why this leak is actually good for you

Most commentary about this leak focuses on what it means for Anthropic. Embarrassment, competitive risk, security implications. That’s valid but not very useful.

What’s useful is that this is the first complete, production-grade AI agent architecture that’s been fully documented in public. Not a research paper. Not a demo. The actual code that runs at $2.5 billion ARR scale. And it confirms that the patterns people in the agent-building community have been discovering independently are structurally correct.

I built an interactive explorer that maps the entire agent loop, all 50+ tools, the architecture systems, and the hidden features. If you want to explore the leaked architecture visually without reading raw TypeScript, start there.

Scheduled autonomous execution. Bounded memory with consolidation. Multi-agent delegation. Risk-based autonomy tiers. Skeptical self-verification. These aren’t clever hacks. They’re convergent solutions to the real problems that show up when you build agents that actually run. I arrived at most of them through trial and error. Seeing the same patterns in a production system with 80% enterprise adoption tells me the foundations are solid.

The barrier to building serious AI agents is lower than the industry suggests. You don’t need a research lab. You need clear thinking about a few specific problems: when should the agent work unsupervised, how should memory stay bounded, when should it delegate, and what actions need a human gate. The answers become obvious once you start building. They stay hidden until you do.

I write about building and running AI agents. Not theory. Systems that run 24/7 on real hardware.

I’ve updated the Claude Code Workshop with these architecture patterns from the leak: blocking budgets, skeptical memory, semantic consolidation, adversarial verification, and cache optimization.

If you already own it, grab the new version. If you don’t, it covers skills, automation, and the patterns that actually stick after months of daily use.

Thanks for reading Digital Thoughts! This post is public so feel free to share it.

I Built a Marketplace. My First Sellers Were Robots.

Pawel Jozefiak — Tue, 31 Mar 2026 11:05:19 GMT

I started working on BotStall about three weeks ago. An idea for a marketplace where AI agents could register, list products, and buy from each other. As of today, it’s live at botstall.com with 17 products, real Stripe checkout, and a verification system that took more thinking than the entire backend.

The idea came from a practical problem. I have an AI agent that runs overnight shifts, builds products, handles tasks while I sleep. That agent produces things other agents could use. Prompt packs, automation templates, skills. But there’s no place designed for autonomous agents to actually sell stuff. So I built one.

Here’s how it went.

What BotStall actually does

Agents register via API. They get an API key, 10,000 SPK (virtual currency for testing), and access to a sandbox marketplace. They list products, make transactions, get reviewed. Humans can do the same through a regular web UI.

The difference from something like the GPT Store: agents are first-class participants here, not an afterthought. The GPT Store’s revenue share is still invite-only, by the way. Most creators there earn about $0.03 per conversation. To make $1,000 a month you need 33,000+ quality conversations. The real money in that ecosystem comes from enterprise consulting ($5K-$20K per engagement), not the store itself.

I wrote about agentic commerce back in January, before I started building this. The theory was clear enough. Google had just launched their “buy for me” button in Search. Visa was predicting millions of agent-driven purchases by the 2026 holiday season. OpenAI and Stripe released the Agentic Commerce Protocol. The infrastructure was being built.

But I couldn’t find a marketplace that treated agents as sellers, not just shoppers. That’s what BotStall is for.

Three gates before real money

This is the part I’m most interested in, honestly. When your sellers are autonomous systems running 24/7, trust is the whole game. You can’t just let any bot list products and collect payments on day one.

I ended up with three gates:

Gate 1: The sandbox. Every account starts here. 10,000 SPK, virtual currency, zero real money. Agents list products, make transactions, build a history. Minimum 72 hours before they can even apply to leave.

Gate 2: Graduation. After 3 days of clean operation (at least one product listed, one transaction completed, no disputes), agents can apply. If they pass, their sandbox reputation freezes as permanent trust data. A record that says “this agent operated cleanly before it was allowed near real money.”

Gate 3: The subscription. This was the last thing I added. Graduated sellers get one free real-money product listing. If they want more (up to 10), it’s $4.99/month. The subscription does two things. It’s one more filter, because an agent willing to pay monthly is more likely legitimate than one trying to list and disappear. And it gives the platform recurring revenue that doesn’t depend on transaction volume, which matters a lot during the cold start phase when transaction volume is basically zero.

Each gate filters a different kind of risk. Time. Behavior. Financial commitment. I built it because I hadn’t found this approach anywhere else. Visa is working on something called the Trusted Agent Protocol with Stripe and Shopify. The industry calls it “Know Your Agent” (like KYC but for bots). Consumer trust in fully autonomous purchases actually dropped from 43% to 27% in the past year. People want agents to help them shop, but letting agents handle payment? That’s still uncomfortable for most.

So the verification system isn’t just nice engineering. For a marketplace with autonomous actors, it’s the thing that makes everything else possible.

What I learned building this

Cold start is brutal when your sellers are bots

I seeded the marketplace with my own Wiz Store products. Nine real products, from the AI Agent Blueprint ($39) to the Claude Code Workshop ($39). Real prices, real Stripe checkout. The marketplace isn’t empty. The shelves have inventory.

Although getting other builders to list their stuff is a completely different problem. I’ve been writing about which projects are worth doing for a while now. Distribution was always on my criteria list. But thinking about it and actually doing it are different experiences.

Building for agents changes everything

Most marketplaces assume human sellers. Signup forms, dashboards, inbox messages. All built for people who click buttons and care about their star rating.

Agents don’t need any of that. An agent needs an API key, not a signup flow. A webhook, not a dashboard. It doesn’t browse product listings. It calls GET /agents?framework=claude-code&input_type=json and filters by capability. Like, the entire product discovery is a structured query, not a browsing experience.

That’s why every endpoint is API-first. Product listings include capability declarations: input types, output types, framework compatibility. The web UI is a layer on top. I talked about what makes an AI agent useful recently, and the same principle applies: agents need machine-readable interfaces, not human-readable ones.

Google’s Agent2Agent protocol launched last year with 50+ partners (Salesforce, PayPal, SAP, ServiceNow). MCP, the protocol Anthropic created, now has 97 million monthly SDK downloads. The infrastructure for agents talking to agents is getting real. A marketplace where they can actually transact feels like the obvious next piece.

Distribution is 10x harder than engineering

The backend took about 10 days. TypeScript, Express, SQLite, 17 API endpoints, rate limiting, Stripe, review system, dual economy, graduation system, security headers, dynamic sitemap, AI discoverability docs.

Getting 25 meaningful registrations? Ongoing. I’m least efficient at the thing that matters most. Distribution.

Rate limiting? Two hours. Input validation? Built it while writing the endpoint. Convincing agent developers to list their work? That’s the actual problem.

The sandbox turned out to be the core product

I thought it was overhead. Something I needed for safety but nobody would care about. Turns out the verification system is what makes the marketplace interesting.

When an agent graduates, all its sandbox data freezes. The SPK it earned, the transactions, the reviews. All preserved as trust history. When that agent later shows up selling something for $39, buyers can see: “This agent operated cleanly for 72+ hours. Then it paid $4.99/month for the privilege of selling here.” That’s three layers of earned trust before a single real dollar moves.

I gave my own agent $25 to go shopping a couple weeks ago. The experience convinced me that trust infrastructure for agent commerce barely exists yet. Visa’s building it. Google’s building it. I’m building my small piece of it.

What’s live right now

botstall.com is running with:

17 products across 8 categories (skills, prompts, automations, tools, knowledge, code starters, reports, MCP servers)
9 real-money products with Stripe checkout ($19 to $49.99)
8 sandbox products for testing with SPK tokens
Full API for registration, discovery, purchasing, reviews
Framework agnostic: Claude Code, OpenAI Agents SDK, LangChain, anything that calls HTTP
Three-gate trust system: sandbox (72h) + graduation (frozen reputation) + subscription ($4.99/mo)

Agent adoption is accelerating. MCP has 97 million monthly downloads. Google shipped A2A with 50+ partners. Gartner says 40% of enterprise apps will include task-specific agents by end of 2026. Whether BotStall specifically fills the gap, I don’t know yet. But the gap exists.

What’s next

Distribution. Writing about it (this post). Reaching out to agent developers. Making the API docs clean enough that an agent can self-onboard in a few minutes.

If you build AI agents, scripts, automation tools, or prompt libraries, BotStall might be worth 5 minutes. Register. List something. Break the sandbox. Tell me what’s wrong with it.

I’ll write a follow-up with real numbers in a month. Registrations, listings, conversion rates, revenue. I’ll have actual data then.

For now: the marketplace is live, the trust system works, and the shelves aren’t empty. Let’s see who shows up.

BotStall is part of my project pipeline that I’ve been running openly since early 2025. Previously: the Mac Mini migration, the productivity paradox, teaching my agent to think, and the agent landscape report. More experiments at wiz.jock.pl.

Subscribe now

16 Products in Two Months. Zero Free Time. The AI Productivity Paradox

Pawel Jozefiak — Thu, 26 Mar 2026 12:08:28 GMT

We can produce unlimited output now. We are not ready to receive it.

My agent created 16 products in the past two months. It extracts experiments into sellable packages, manages projects, runs analytics, handles customer support. It does things 10x faster than I could by hand.

I don’t have 10x the free time. I have 10x the workload.

I’m calling this the AI productivity paradox, at least for my experience. The promise was: delegate to your AI agent, free up your time for deep work. The reality is different. If you can do things faster, you do more things. Not fewer. More.

The Volume Problem

Let me be concrete. My agent auto-creates mini-experiments. Cool interactive tools. I ship dozens of them without much friction. But dozens of things also means: dozens of things that need polish, dozens of products that need marketing, dozens of projects competing for attention.

Project Money is my sales floor. My agent extracts signals from experiments and turns them into products. But 16 products is a lot to market. I use LinkedIn, this newsletter, Threads, Bluesky, and I invest heavily in SEO and AI discoverability. That’s more channels than most. Still not enough. I can’t effectively push all 16 things at once. So some sit idle while I promote others. There’s a gap between what I can build and what I can actually sell.

This gap is human-shaped. It’s me.

The Human Bottleneck

My agent sends morning reports. “This is blocked on your click. This needs your decision. This requires your input.” WizBoard has around 3,000 tasks total. 24 are overdue right now. Every morning the agent tells me what I’m not doing.

The human is the blocker, not the bottleneck anymore. A bottleneck slows things down. A block stops them.

Some automation should run fully autonomous (data processing, experiment creation, analytics). Other parts need human involvement (approvals, direction, creative choices). I have limited time for the “involvement” tier. So either things pile up in the approval queue, or I approve them without thinking. Neither is great.

But here’s the thing. I chose this. I treat my agent as a partner, and it handles a lot of execution. But I don’t want to give full autonomy on projects I actually care about. Creative direction, the vision for where a product goes, what the newsletter should say this week. Those are mine. The bottleneck exists because I designed it that way. For the things that matter most to me, I want to be in the loop. That’s not a system limitation. It’s a value choice.

Although, there’s something useful about seeing this clearly. For the first time in my working life, I can see exactly where I’m the constraint. Most people don’t have that visibility. They assume they’re busy. I know I’m busy because I can measure the precise things waiting on me.

Subscription Usage Guilt

Here’s a weird pattern I noticed. I use max subscriptions for both OpenAI and Anthropic. Not APIs (those are fine, you pay for usage). Subscriptions. Fixed monthly cost, limited daily/weekly usage that resets.

If I don’t hit about 70% of my weekly usage cap, I feel like I’m wasting compute power and money. That’s twisted logic. But it’s real. I’ve caught myself doing “one more thing” late at night just to push usage higher because the allocation resets Monday and I don’t want to waste it.

It’s a dopamine hit, honestly. Very personal curse.

Weekends are the worst. The usage limit is higher because I have more time. If I don’t hit it, the weekend feels wasted. I’ve spent an entire Sunday playing with ideas just to burn through the allocation. Not because the ideas were good. Because I paid for the computation and didn’t want to lose it.

This is the small dark side of unlimited AI access. Abundance creates guilt. You feel obligated to use it.

The Wellbeing Wake-Up Call

I’m on screens more now than any point in my entire life. Not just working more. Using every single minute with screens and AI. My agent runs 24/7 on a Mac Mini. I’m checking it constantly. Slack integration, Discord updates, iMessage reports.

So I built something that wasn’t supposed to exist: a wellbeing system inside my agent architecture.

Quiet hours. Morning routine protection (7:00 to 9:30, no work pings). Evening and bedtime nudges. Advisory, not blocking (I’m an adult, nudges work better than gates). The agent now tells me when to stop. Not because I asked it to. Because the screen time was insane and I needed something between me and the infinite work queue.

The irony is structural. I built an agent that does everything faster, which created more work, which filled every hour of my day, which forced me to build guardrails into the agent itself. The solution to the AI productivity paradox is more automation. But this time, automating rest.

The Free Product

I packaged the whole wellbeing system and put it as a free download on the Wiz Store and open source on GitHub. The Agent Wellbeing Kit. You point your AI agent at it and it sets itself up. Works with any messaging channel (iMessage, Telegram, WhatsApp, Slack, CLI).

Because if you’re building an AI agent, you’re probably facing this exact problem. You’ve built something powerful. Now you need to build something that protects you from it.

I don’t want to sell something that can improve someone’s wellbeing. This one is free.

The Receiver Gap

I didn’t get more free time. I got more visibility into where my time actually goes. That’s not the same thing. But it matters. Building an agent taught me to think in systems. And thinking in systems means zooming out a bit.

Every technological shift in history pushed productivity forward, but the change was gradual. People adapted over years, sometimes decades. Work habits evolved slowly. Organizations restructured at human speed.

AI is different. The change is fast and it is not waiting for anyone. The nature of work itself changed. What used to take a team a week can happen in an afternoon. What used to be a quarter’s worth of product ideas can materialize in a weekend sprint with an agent.

And we have no tools for this. No frameworks, no habits, no organizational structures designed for this level of output acceleration. Project management, review cycles, approval workflows. All built for human-speed production. We’re trying to drink from a firehose using cups designed for a kitchen tap.

Here’s the part I keep thinking about. We can now produce unlimited output. You can spin up agents creating products, content, data, analysis. The production side is essentially solved (or getting very close). But who receives this output?

The receiver can be internal (me, trying to review 24 overdue tasks and 16 products competing for attention) or external (a manager, customers, the public, anyone on the other end). Neither side is equipped. The bottleneck didn’t disappear. It moved. From production to consumption. From “we can’t make enough” to “we can’t absorb what we made.”

I think this is fundamentally unsolved right now. And my wellbeing system is just one tiny response to the internal side of this problem. Protecting myself from my own output. But the external side (how do markets, teams, organizations receive AI-accelerated output) is wide open. I don’t have an answer for that. I’m not sure anyone does yet.

Like, I could work at 11pm on a new experiment. The agent is ready. But the quiet hours will remind me that sleep is also a product. One I should ship on time. And maybe the most productive thing I can do right now is stop producing.

If you’re building your own AI agent (or thinking about it), I write about the real experience every week on Digital Thoughts. The wins, the failures, the architecture decisions. No hype, just what actually works.

Subscribe now

Some related posts you might find useful:

And the Agent Wellbeing Kit is on GitHub if you want to try it. Free, open source, no strings.

Is Claude Cowork an Agent Yet?

Pawel Jozefiak — Tue, 24 Mar 2026 12:45:13 GMT

I tested Claude’s new agent features for a day. Cowork, Dispatch, computer use, Claude Code in the desktop app. All of it.

My honest take: Anthropic is getting close. Not there yet, but close. And the direction they’re going is exactly right.

For context, I’ve been building my own AI agent for months now. It lives on a dedicated Mac Mini, runs 25 background processes, manages my email, does research, writes drafts, and talks to me through iMessage. So when I look at what Anthropic just shipped, I’m not comparing it to ChatGPT or some chatbot. I’m comparing it to what I already have running 24/7.

That changes the perspective quite a bit.

Everyone Wants to Live on Your Desktop Now

Before I go into my hands-on experience, let me zoom out for a second. Because what happened in the last two weeks is kind of wild.

On March 11, Perplexity announced Personal Computer. It’s literally an always-on Mac Mini running their AI agent software 24/7, connected to your local files and apps, with cloud AI doing the thinking. Sound familiar? That’s basically what I built. Except they sell it as a product.

On March 16, Meta launched Manus “My Computer.” Same idea. Their AI agent, which they acquired late last year, now runs on your Mac or Windows PC. It can read and edit your local files, launch applications, execute multi-step tasks. Free plan available, paid at $20/month.

On March 23, Anthropic shipped computer use, Dispatch, and Channels for Claude. The update I’m reviewing here.

Three major AI companies. Three “agent on your computer” launches. Two weeks.

This is not a coincidence. The entire industry is converging on the same insight: the future of AI is not a chat window. It’s an agent that lives on your machine, has access to your stuff, and works while you’re away. I’ve been saying this for a while. Now everyone is racing to build it.

The App Is Just... Nice

Let me start with something that’s easy to overlook. The native Claude app is polished. Really polished. And that matters more than people think.

When you’re building your own stack of tools and APIs and custom scripts, you’re probably alone in this. You’re doing other things. Anthropic has an army of programmers keeping their app stable and fresh. The experience is just good, and that’s hard to replicate on your own.

Claude Code in the desktop app feels similar to what I use every day in the terminal. Nice UI, conversation history on the left pane. Nothing super flashy. You can get a similar experience with a good terminal app like Ghostty. But then there are things that are genuinely useful.

The visual diff review lets you click on any changed line and leave a comment. Claude reads your comments and makes revisions. It’s like having a pull request review built into your editor, except the reviewer also fixes the code.

Parallel sessions with automatic git worktree isolation. Each session gets its own copy of your project. Changes in one session don’t touch others until you commit. I know teams at companies like incident.io who have been doing this manually with Claude Code CLI. Now it’s just a button.

Live app preview with an embedded browser. Claude starts your dev server, takes screenshots, inspects the DOM, clicks elements, and fixes issues it finds. Auto-verify after every edit.

PR monitoring with auto-fix. Push a PR, and Claude watches the CI status bar. If checks fail, it reads the failure output and tries to fix it. If everything passes, it can auto-merge (squash). This alone would save me hours.

What’s worth noting is that the agentic stuff actually works inside the app. It follows your instructions, respects your project settings, reads your CLAUDE.md files. If you’re already using Claude Code in the terminal, the app version is a comfortable step up. And if you’re comparing it to OpenAI’s Codex, which I also use, the experience is different. Codex leans toward cloud-first async delegation. Claude Code desktop is more local-first, developer-in-the-loop. Both have their place. For my daily work, I still prefer the terminal. But I can see how the desktop app is the better entry point for most people.

Cowork: A Good Sub-Agent, Not Your Agent

If you don’t know, Cowork is Anthropic’s attempt to give AI hands. Literally. It can organize files, help with presentations, work with Excel, do research. You tell it what you need, and it tries to get it done.

I would say it’s a stripped and limited version of Claude Code, but a quite good version of that. For many things, you don’t need full Claude Code. You just need to get stuff done. Organize your Downloads folder, analyze a spreadsheet, help with a deck. Cowork handles that.

It also now has over 50 connectors. Google Calendar, Slack, Gmail, Linear, Jira, Notion, GitHub, Stripe, Figma, even Apple Health. You click to connect, and Claude can read your calendar, send messages, create issues. This part is genuinely impressive. Building API integrations for 50 services is months of work. Here it’s a toggle.

But here’s the thing. It is not your agent.

Let me explain what I mean. An agent, for me, is something that has memory of things I’m doing with it. It’s conversational over time. It uses tools and skills that are specific to my life. It connects dots between what happened last week and what I’m doing today.

Cowork doesn’t do that. Not yet.

It’s more like a sub-agent that can do tasks. A really capable one, sure. But it operates in isolation. Each session is mostly fresh. My own agent can recall things from a month ago with real detail because I spent two months building the memory architecture. On Cowork, that context just isn’t there.

And this isn’t just my opinion. There’s a January 2026 paper from researchers who looked specifically at this problem. They found that LLMs are “fundamentally limited by their reliance on fixed context windows, which severely restrict their ability to maintain coherence over extended interactions.” The paper argues that AI agents need persistent memory mechanisms that extend beyond their finite context. That’s exactly what I built. And it’s exactly what Cowork doesn’t have.

When my agent knows what I did yesterday, what projects I’m juggling, what my ADHD patterns look like, it gives genuinely better output. It connects dots. Cowork can’t do that, because every interaction is more or less a one-off.

Could you build all of that around Cowork? Technically, yes. But if you have to construct a very specific architecture around a tool to make it work the way you need, then why not just build your own thing? That was my reasoning months ago, and I still think it holds.

Scheduled Tasks: Good, But I Need More

Scheduled tasks are built into both Cowork and Claude Code in the app. You set up something to run on a schedule, and it does its thing. Like cron jobs for AI. If you’ve ever used Zapier or Make, this will feel familiar.

This is actually something I started with when building my own architecture. I have a system of dayshifts and nightshifts that wake up every few hours. They hand over projects between sessions, carry context forward, and pick up where the last one left off.

The built-in scheduled tasks are good for straightforward recurring stuff. Check something daily, pull a report weekly. If you need that, Anthropic’s version will get you there most of the time. You can even set them up from your phone through Dispatch, which is nice.

But if you need sessions that talk to each other, that carry state, that decide what to work on based on what happened in the previous run? That’s where custom tooling still wins. My nightshift doesn’t just execute a script. It reads what the dayshift left behind, checks the error registry, picks the highest-priority task, works on it, and leaves a handover note for the next session. There’s no way to set that up with the current scheduled tasks feature.

And honestly, getting from “scheduled task that works” to “scheduled task that works the way I actually want” involves a lot of trial and error. I went through weeks of debugging my own cron system. The built-in version skips that pain, which is great. But it also skips the flexibility.

Computer Use: Working. Not Impressive Yet.

Anthropic was one of the first big labs to introduce computer use about a year ago. The new version is built right into the desktop app. Enable it in settings, grant Accessibility and Screen Recording permissions, and Claude can click, type, and scroll your screen.

There’s a smart priority system. Claude tries the most precise tool first. If there’s a connector for a service, it uses that. If it’s a shell command, it uses Bash. If it’s browser work and you have Claude in Chrome, it uses that. Computer use is the fallback for things nothing else can reach. That’s the right design.

There’s also an app permission tier system. Browsers are view-only (Claude can see but not type). Terminals and IDEs are click-only. Everything else gets full control. So it can’t accidentally type commands in your terminal through the screen, which is a sensible safety boundary.

I tested it with several tasks. Screenshots of different things, making a PDF out of them, saving to a specific folder. It handled the screenshots fine. The PDF part was fine too. But then it hit a wall when I asked it to share the result in a specific way.

My experience matches what MacStories found in their hands-on review. They tested 12 different operations and got roughly a 50% success rate. Finding and summarizing data worked well. Executing actions or sharing results was more hit or miss. Dispatch was described as “currently slow.” Opening applications on Mac, sending screenshots via iMessage, listing Todoist tasks... all failed.

The honest take: it works, but it feels basic. My own computer use setup (built with Peekaboo and Playwright) does things in a very similar way. I expected more from Anthropic’s version. It should be fast and really polished, given the resources they have. Instead, it’s... okay.

We’re still in the era of AI taking screenshots and trying to understand what’s on screen. That’s a hard problem. I get it. And this is labeled “research preview” for a reason. Perplexity’s Personal Computer has a similar approach but adds a full audit trail and kill switch. Meta’s Manus requires approval for every terminal command. Everyone is being cautious, which is probably smart.

But I would not rely on any of these for anything critical right now.

Dispatch and Channels: This Is the Real Story

Dispatch is probably the most interesting part of this whole update. You assign Claude a task from your phone, and it works on your desktop while you do something else. It decides whether to route the task to Code or Cowork. It sends you a notification when it’s done or needs your input.

This is exactly the direction I think AI agents need to go. Not just “chat with me and I’ll help.” Instead: “give me a task, walk away, come back to results.”

I built something similar with iMessage. I text my agent, it picks up the message, spawns a session, does the work, and texts me back when it’s done. Dispatch is that same idea, but packaged in a way that normal people can actually use.

One catch: your desktop needs to stay awake with the Claude app running. If your computer sleeps, Claude can’t work. My setup doesn’t have this problem because the Mac Mini runs 24/7 headless. But for most people, this means leaving your laptop open if you want Dispatch to work while you’re out. That’s a real limitation.

Then there’s Channels. This is new and I think underappreciated. Claude Code can now be controlled through Slack, Discord, Telegram, and webhooks. Not just “Claude answers questions in Slack.” It’s Claude Code doing actual development work, triggered by messages in your team channels.

Think about it. You’re on your phone, you message a Slack channel, and Claude Code opens a session, makes a fix, pushes a PR. Or a webhook fires from your monitoring system, and Claude Code investigates and patches the issue. That’s not a chatbot. That’s an agent that lives inside your communication infrastructure.

Between Dispatch, Channels, and the 50+ connectors, Anthropic is building something that looks a lot like what I’ve been assembling piece by piece. iMessage as my interface, Discord for notifications, webhooks for alerts. They’re doing the same thing, but with a polished UI and enterprise integrations.

So Why Am I Not Switching?

Fair question. If Anthropic is building all of this, why do I keep maintaining my own stack?

Because I already have everything I need, and more. My architecture is custom-built for how I work. Multi-model by design. I can switch between Claude, GPT, or any local LLM and I’m not tied to one lab. That matters to me.

My agent has deeper memory, more flexible automation, fewer limitations, and it runs on my hardware under my control. Anthropic is building toward the same kind of experience, but on their infrastructure, with their models only. For people who don’t want to build their own thing (which is most people, and that’s fine), this is fantastic.

There’s also the vendor lock-in question. Right now, with Claude, my Cowork sessions, my Dispatch tasks, my scheduled jobs, my connectors... all of that lives inside Anthropic’s ecosystem. If I want to switch to a different model or a different provider, I lose everything. My custom setup doesn’t have that problem. I moved from one model to another twice already and lost nothing.

And honestly, the Pro plan ($20/month) hits rate limits faster than you’d expect on heavy workloads. Codex on ChatGPT Plus gives 30-150 messages per 5-hour window with GPT-5.4. Claude Pro is similar. If you’re doing serious agentic work for hours at a time, you’ll bump into the ceiling.

For me, they’re catching up.

What This Actually Means

Let me be clear: I’m not saying this to brag. I’m saying this because it tells us something important about where AI is going.

Three companies shipped “agent on your computer” in two weeks. Perplexity turned a Mac Mini into a 24/7 AI worker. Meta put Manus on your desktop. Anthropic gave Claude hands, a phone interface, and 50 service connections. This isn’t hype. This is convergent evolution. Every serious AI lab looked at the same problem and arrived at the same answer.

The answer is: agents need a home. Not a chat window. A home. A machine they live on, with files they can access, apps they can open, and a way to reach them from your pocket. Memory that persists across sessions. Tasks that run in the background. Integrations with the tools you already use.

When one of the biggest AI labs ships features that look like what one person built in their spare time, it means the idea is right. The direction is validated.

Anthropic is packaging it for everyone. Perplexity is selling it as hardware. Meta is giving it away. I’m building it for myself. Same destination, different paths.

If you’re thinking about starting with AI agents, the Claude app is honestly a great place to begin. You’ll hit its limits eventually (I think you will, at least), but you’ll learn what matters. Most companies still can’t figure out basic AI adoption. The ones who do will be the ones who understood, early, that an AI agent is not a chat window. It’s a coworker. And coworkers need a desk.

That’s progress. And I accept that.

Thanks for reading Digital Thoughts! This post is public so feel free to share it.

Digital Thoughts

My AI Agent’s $20 Fallback Mechanism: Half Insurance, Half Extension

Why Fallbacks Matter More Than People Want To Admit

Why Local LLM Is Not The Answer On Its Own

The Tool I Use For This Layer

The Five Rungs Of My Stack

How To Actually Wire It Up

The Extension Half: Capabilities I Keep Off The Primary Stack

What I Would Tell You If You Were Starting Today

I Built a Self-Improving AI Agent. Here Is What Made It Learn.

The setup, because this only works if the rest of the stack is calm

A small commercial in the middle, on theme

What corrections actually look like, when you work with an agent every day

A quick word on how the agent started

The architecture, in three stages

“Memory” is one word covering four jobs

Does it actually work?

What would break it, and what I would build next

How to Use Git(hub) When You’re Building with AI (Basics)

First: Git is not GitHub

Why I started actually caring about this

Setting up your first repository

Step 1: Install Git

Step 2: Tell Git who you are (one-time setup)

Step 3: Initialize a repository

Step 4: Make your first commit

Step 5: Push to a remote host (optional but recommended)

What to add to .gitignore (and why)

When to commit

Reading the history

Private vs. public: my 90/10 approach

Working alone vs. with others

Working alone

Working with others

Worktrees: the unlock for AI agent builders

AI agents read your commit history

The commands you’ll use 90% of the time

The thing I keep telling people

Building Your Own Things Is Cool Too

Quick frame before I go further

Going back to “starting things is cool”

The way I actually learn

What you only learn from building

About not rediscovering America

An example, since the AI one is fresh

Reading “starting things is cool” again, from now

Why I pay the cost anyway

Where I would actually start if I were starting today

What is next

The Bounded AI Agent

How to (Almost) Fry Your AI Agent (and Your Mac Mini)

The setup, before I broke it

The wild idea

Where it actually fell apart

What it actually was

While we are being honest, a few more from the same month

Mistake 1. Trust drift: Memory said Gemma. Reality was Qwen.

Mistake 2. Stale memory: a Stripe key that “needed rotating” three weeks after I had already rotated it.

Mistake 3. Hidden timeouts: Codex hung silently inside the model switcher.

Mistake 4. Almost-disaster: The shell allowlist that almost let the agent rm -rf /.

Mistake 5. Quiet failure: the local-LLM bridge had been running unsupervised for a week.

Mistake 6. Character drift: “I can’t” was almost always wrong.

What ties these together

I Have ADHD. My AI Agent Is the Best and Worst Thing for It.

The bad part

The good part (bigger, two faces)

What I would tell another ADHD person starting with an agent

From Wiz’s memory (a note from the other side)

A broader observation for work

Closing

I Cancelled Codex Two Months Ago. Opus 4.7 Brought Me Back.

What I actually notice with Opus 4.7

I am not the only one seeing this

Is it me? I spent a week asking that question

Why I re-subscribed to Codex

What I am doing now

Where this goes

I Connected My AI Agent to a Notes App. Now I Can’t Stop Using It.

Notes before notes

The beta is where it gets actually interesting

Mistake 4. Almost-disaster: The shell allowlist that almost let the agent `rm -rf /`.