The 47-File Incident

I asked the AI to reorganize a directory.

What I meant was: move a few files, rename some things, clean up the structure. A ten-minute task. What the AI did was restructure the entire project in a single pass—47 file changes, touching directories I hadn’t mentioned, renaming things I hadn’t asked it to rename, creating new folders based on what it thought the structure should be.

None of it was wrong, exactly. The new structure was arguably better. But I hadn’t reviewed any of it before it happened. Forty-seven changes, no checkpoints, no “does this look right?” in between. I spent an hour untangling what it had done to figure out what I wanted to keep.

That was the moment I understood something: memory and capability aren’t enough without boundaries. A single AI doing everything will occasionally do everything at once. I needed a team, not one smarter assistant.

From Files to Workforce

For the first week, stonerOS was just files. A folder of markdown, some naming conventions, an instruction file that Claude read at session start. Useful. Better than nothing. But I was still doing everything manually—loading context, writing session notes, deciding which files to update.

The system got interesting when I started delegating. Different tasks in a memory system have different characteristics. Searching for a specific memory doesn’t need the same capability as writing a cover letter. Validating that an output is correct before saving it doesn’t need the same model as generating the output in the first place. Logging that a session ended doesn’t need the same intelligence as analyzing what the session revealed.

A team of specialized agents, each running on the appropriate model for its specific task, works better and costs significantly less than one agent doing everything.

The Cheapest Model That Does the Job

Before I describe the agents, the principle that governs them: use the cheapest model that can do the job well.

stonerOS agent tiers: 7 Haiku agents for fast/cheap tasks, 10 Sonnet agents for heavy lifting, Opus reserved as a cost gate, and a /job-sprint pipeline showing how 6 agents chain across tiers

This sounds obvious. In practice, it requires resisting the instinct to run everything on the best model.

Current pricing at the time I built this:

Haiku is 12x cheaper than Sonnet. Sonnet is 5x cheaper than Opus. When you have a task that Haiku can handle adequately, routing it to Sonnet is a 12x cost multiple for no quality gain.

The key word is “adequately.” Haiku is not a downgrade for everything—it’s the right tier for classification, routing, deduplication, validation, and simple lookups. These tasks don’t benefit from more reasoning capacity. They benefit from speed and cost efficiency.

I have agents that run on Haiku specifically because I want them to run on Haiku. Not because I’m being cheap. Because the quality difference between Haiku and Sonnet for “is this entry a duplicate of something already in the file?” is zero.

The Agent Roster

Twenty-one agents, organized by what they do:

Memory Operations (Haiku Tier)

memory-search

Find specific memories, history, or insights without loading every file into context. Takes a search query, returns relevant entries. Haiku handles this well because it’s pattern matching over structured text, not reasoning.

session-writer

The only agent allowed to write to learnings/, preferences/, and context/. Every other agent that wants to update memory routes through this one. Single-writer constraint enforced by convention, not code—but convention held consistently is equivalent to enforcement.

dedup-validator

Called by session-writer before every write. Returns UNIQUE, SUPERSEDE, or DUPLICATE. Prevents the learnings files from filling with semantic near-duplicates. Haiku tier because it’s classification: does this observation already exist in a meaningfully similar form?

context-router

Invoked with the first message of every session. Returns one of three tiers: transactional (skip memory loading, answer directly), partial (load preferences and corrections only), or deep-work (full memory load). This gate prevents the system from loading 50KB of context for a one-line question. Haiku decides routing; Sonnet handles whatever comes next.

qa-validator

Quality gate. Called after significant Sonnet-tier outputs before returning them to me. Returns PASS, FAIL, or NEEDS_REVISION with specific feedback. If it returns a revision request, the generating agent gets one more attempt. After two failures, the task escalates to me. Max two retries—never three.

skill-curator

Reads session context and checks an activation log of which skills were last used. Surfaces skill suggestions at session start and flags skills that haven’t been used in 30+ days for archiving. Haiku because it’s a lookup-and-compare task.

session-analyst

Weekly meta-analysis. Reads session files and QA logs, surfaces systemic friction—repeated subagent failures, tasks that keep getting carried forward, stalls that appear across multiple sessions. Stages findings to .claude/context/ for backlog review. Runs once a week; identifies patterns I’d miss reviewing individual sessions.

Heavy Lifting (Sonnet Tier)

archivist

Reads large data sources and distills them into high-density summaries. Large PDFs, long transcripts, bulk data imports. When I get a new document I want the system to know about, the archivist reads it and writes a structured summary. It never writes directly to memory—it stages output for session-writer to integrate.

research-analyst

Deep web research. Tavily first, Firecrawl for deeper page pulls. Up to seven parallel searches when the task warrants it. I use this for job market research, competitive analysis, and any question that requires going beyond what I already know. It synthesizes across sources rather than summarizing each one.

talent

Career coach and portfolio manager. Scores job postings against my actual skill inventory and history, drafts cover letters, writes resume bullets, surfaces portfolio gaps. Knows my application pipeline, my past interview feedback, and my stated career direction. For job search, this is the most-used agent in the system.

code-architect

Scripts, automation, technical implementation. For tasks with more than ten steps, it phases the work and checks in rather than running to completion—a direct consequence of the 47-file incident. The phased check-in is the safeguard: no more unconstrained runs to completion.

docs-sync

Compares documentation files against the actual repo state and patches gaps. Agent counts, slash command lists, architecture descriptions—all the things that go stale as the system evolves. Reads the actual directory structure and agent definitions, then patches the docs to match. Runs on structural changes, not on every session.

finance-analyst

Investment research. Pulls fundamentals, runs parallel web research, synthesizes into an action plan. Integrates with my holdings file and runway model.

security-auditor

macOS security review. Checks system config, network exposure, app permissions, credential hygiene. Outputs prioritized findings to a temp file. Runs on demand when I want a health check.

repo-janitor

File cleanup, archiving stale content, enforcing naming conventions. When the directory structure needs pruning, this agent does it without me having to manually move files.

apple-designer and apple-developer

Paired agents for iOS/macOS development work. The designer produces UX artifacts—flows, screens, component specs—that the developer implements in Swift/SwiftUI. Design QA runs between them. Three-agent pipeline for anything that touches an Apple platform.

design-qa

Quality gate specifically for UI work. Validates implementations against my design charter and Human Interface Guidelines. Returns PASS, FAIL, or NEEDS_REVISION. Never fixes—only flags. The actual fixes go back to apple-developer.

How Agents Compose

No agent works in isolation. The value is in how they chain.

A typical /job-sprint run—full application pipeline in one command—looks like this:

1

context-router classifies the session as deep-work

2

Full memory load including career profile, pipeline status, and skill inventory

3

research-analyst runs background research on the target company and role

4

talent scores the posting against my profile, drafts resume bullets and a cover letter

5

qa-validator reviews the talent output before returning it to me

6

session-writer logs the new pipeline entry and any learnings captured

Six agents, two model tiers, one command. The entire pipeline runs in the background while I do other things.

The orchestration rule: the main session is the coordinator. It delegates tasks to agents, waits for outputs, and synthesizes. It doesn’t do the heavy work itself—it routes. This keeps the main context window clean and makes the session recoverable if any agent fails.

The Automation Layer

Agents handle tasks when asked. Hooks handle tasks automatically.

Five hooks fire at key moments in Claude Code’s lifecycle:

SubagentStart and SubagentStop—Log every agent launch and completion to a background agent log. No manual tracking required. When something goes wrong, I have a full audit trail of what ran and when.

SessionEnd—Triggers the end-of-session protocol: review the conversation for insights, stage updates for session-writer, finalize the session file.

PreCompact—Fires before Claude Code compresses the context window to stay within token limits. When compression happens, the current conversation state is about to be partially lost. The pre-compact hook captures key insights before they compress out. Without this, context compression is silent information loss.

PostToolUse—Runs after every tool call. Currently used for lightweight monitoring—tracking file write patterns and flagging anything that looks like it’s touching protected paths.

Beyond hooks, two autonomous scheduled tasks run without any input from me:

A Sunday-morning /process-inbox pass that classifies anything I’ve dropped into memories/inbox/ during the week—mobile captures, voice transcripts, links, unstructured notes—and routes each item to the right memory file.

A Monday-morning /weekly-review that synthesizes the previous week’s sessions into patterns, learnings, and a summary. By the time I open Claude on Monday, the weekly review is already done.

Both run through autonomous-runner.sh, which wraps them in a six-layer safety system: file-based kill switch, command allowlist, 10-minute timeout, 30-turn cap, daily run cap, and a git-based write sandbox that automatically reverts any changes outside the approved writable paths.

What Agents Are Actually For

Twenty-one agents sounds like a lot. And it is, in the sense that I spent time building and testing each one. But none of them are magic. Each one is a scoped task with a model, a prompt, some constraints, and a defined output.

The reason to build them out this way isn’t sophistication—it’s reliability and cost. A specialized agent that does one thing well is easier to reason about, easier to test, and easier to fix than a general assistant doing everything. When the dedup-validator fails, I know exactly what broke and what it affects. When a monolithic assistant does something unexpected, the failure mode is opaque.

The cost argument is more concrete: routing the right tasks to Haiku instead of Sonnet saves real money. My monthly AI cost runs under $75. For the volume of work the system handles—job applications, financial analysis, research, writing, code—that’s not expensive. It’s cheap because the model-routing is deliberate.

The agents are boring infrastructure. That’s the point.

Next: the automation that runs without you.