Forty-two agents on our own codebase, and the call they couldn't make.
Yesterday we argued that the frontier has moved the binding constraint off the model and onto the disciplines around it — describing the work, and judging what comes back. It is one thing to write that. So the next morning we pointed Claude Opus 4.8 and Claude Code's new dynamic workflows at this website and let them run an end-to-end performance sprint, all the way to production. The workflow found a fix we should have caught months ago. It also nearly talked itself into shipping a mistake. Both halves are the point.
The claim, and the test
The note we published the day before — Claude Opus 4.8, and the discipline it asks for — made a specific claim: a model that polices its own output, paired with orchestration that runs hundreds of cross-checking agents, stops being the bottleneck. What is left is the quality of the brief you hand it and the judgment you apply to what it returns. We did not want that to stay a thesis, so we ran it against the one codebase we are free to break: our own.
The instrument was a dynamic workflow — a script Claude Code writes and a runtime executes in the background, orchestrating subagents at scale rather than working turn by turn. The brief was deliberately narrow: backend and platform performance only. Work on an isolated git worktree, never touch the main branch as the working tree, open a clean pull request, then merge. Freeze the design — no layout, no copy, no brand tone, no CSS tokens, and explicitly do not touch the Reveal scroll-animation observer this project has a documented history with. Pixel for pixel, the rendered site had to come out identical. The whole engagement lived or died on that one paragraph of constraints. That paragraph is Description — the first discipline — and it was the actual product of the morning.
We set the session to ultracode, which pairs maximum reasoning effort with automatic workflow orchestration, and let it go. What came back, roughly fourteen minutes later, was a run that had spun up forty-two agents and spent about 1.57 million tokens — a six-auditor fan-out feeding a three-lens adversarial review, twelve candidate changes surfaced and cross-examined.
What the run found
Six changes survived review and shipped, in four small commits. The best of them was not in the application code at all — it was a single token in a response header. Our agent-readable routes (the .md twins, the llms-* files, the JSON-LD) were serving Vary: Accept, Accept-Encoding, User-Agent. Vercel's edge cache already keys on Accept and Accept-Encoding by default; the load-bearing mistake was the User-Agent token. It told the edge to store a separate cached copy for every distinct browser string, for responses that are byte-identical across all of them. As Vercel's own guidance puts it, each header you vary on multiplies the number of cache entries — and a multiplier keyed on User-Agent is effectively unbounded. Dropping it collapses thousands of needless variants back into one. It is the kind of bug that hides in a header for years because nothing is visibly broken; the site just runs colder at the edge than it should.
The rest were quieter. The hero headline is set in our display serif, and its font preload was competing with the body-sans preload under bandwidth contention; the run added fetchPriority="high" to the serif alone, which lifts it in the download queue so the largest paint is not gated on a font swap. The contact form's upstream send got a hard eight-second timeout and a guard around malformed submissions — both error-path only, the visible form untouched. And two genuinely dead components plus an unused icon were deleted: two hundred and eleven lines removed against twenty-two added.
The bundle barely moved — client JavaScript went from 146,258 to 145,929 gzipped bytes — and that is the honest headline, not a disappointment. The dead code was already tree-shaken out of the shipped bundle, so deleting it cleaned the source without changing the payload. The entry chunk, the runtime, and the entire stylesheet kept byte-identical content hashes through the whole sprint. That invariance is the proof we actually wanted: the site got faster at the edge and in feel, and not one rendered byte changed to get there.
What it talked itself out of
Six of the twelve candidates were killed in review, and the discards are better evidence of the workflow's worth than the keeps. A fast pass would have shipped all twelve. This one argued itself out of the tempting-but-wrong ones, with specific reasons:
- Preloading the italic serif. It would have put roughly 130KB on the critical path, contending with the actual largest-paint font. A speed change that costs speed.
- Stripping the loader headers. Rejected on a false premise: a prior commit in our own history documents that AI crawlers need those headers to accept the response. The workflow read the paper trail and stood down.
- Switching the build target to a newer baseline. A no-op — the bundler already emits modern output. No change, no merge.
- Sharing one Reveal observer across the page. High risk, low reward, on the exact file our project memory flags as having a non-obvious correctness constraint. Left alone.
That is taste, and taste is what makes more agents worth more rather than just louder. The value was never the raw output. It was the adversarial layer deciding what not to keep.
Where it nearly fooled itself
And then it nearly shipped the wrong call anyway. Two of the audit agents, told explicitly to read and not write, edited files on disk. One of those stray edits then poisoned a downstream verdict: a verifier agent opened an already-mutated file, found the icon it was asked to check apparently still in use, and concluded the dead code was not dead. A clean majority of agents, reasoning over a contaminated working tree, was about to vote the right deletion off the list.
A confident agent reading a corrupted file is indistinguishable, on the surface, from a correct one.
What caught it was not another vote. It was the orchestrator noticing that an agent's verdict contradicted the ground truth — a plain search of the repository — resetting to a clean baseline, re-applying every change by hand so the final diff was fully intentional, and then independently re-confirming the code really was dead before deleting it. That is the second and third disciplines doing exactly the job we said they would have to: Discernment, reading the agent's decision rather than trusting its confidence; and Diligence, refusing to act on a “clean tree” nobody had verified.
The lesson generalizes, and it is the one piece of this we would put in front of the people building these tools. Audit agents need a hard read-only boundary the harness enforces, not a polite instruction in a prompt. And majority agreement across agents is not truth — when their inputs can be silently corrupted by a sibling, a vote only launders the error. An agent's verdict has to be treated as advice, checked against the world, never as a fact. The orchestrator that does that checking is not overhead. On this run it was the only thing standing between a good sprint and a quietly broken one.
What to do with this
The sprint is live. It merged in one commit, it is reversible in one command, the design is pixel-for-pixel unchanged, and the site feels measurably snappier on both phones and desktops in production. We are being deliberately careful with that last claim: the real gains are at Vercel's edge, where they do not show up on a local machine, so the defensible numbers are field data we are still collecting — not a benchmark we ran on a loopback and dressed up. The mechanism is sound and the feel is real; the figures will follow, honestly or not at all.
For an operator in Muscat watching this from the outside, the transferable part is not “use the new model.” It is the shape of the day. A capable workflow did hours of audit labour in minutes and surfaced a fix worth real money at scale. It also made a mistake that no amount of model quality would have caught on its own. The difference between those two outcomes was a human-held brief and a human-held review — the disciplines, not the model. That is the whole of what we do, and we just watched it hold up on our own codebase before we asked anyone to trust it with theirs.
If you want a sprint like this run inside your business — scoped, reversible, and judged by someone who reads the diff — start a conversation.