[technical] 2024-10-08 [14 min read]

Riding The Token Storm: Notes From A Codex Fleet Wrangler

An operations diary covering how we tuned prompts, instrumentation, and workflows to keep a multi-agent AI fleet shippable.

[RSS]

Laptop screens glowing with dashboards and system diagrams in a dark room.

[state_machine_v4.svg] schema

λ.dispatch(Action.Initialize)

Expect a detailed tour through plan hygiene, prompt surgery, token economics, and cross-project whiplash. It is technical, occasionally snarky, and aimed at anyone trying to bend a multi-agent workflow to their will.

0. Setup: Why I Even Tried This

OpenAI gives Codex an almost ridiculous runway. Our noisiest implementer log chewed through 461,139 tokens in a single pass and the quota manager never flinched. That headroom encouraged bold experiments: rebuild an entire product surface with agents, let five roles negotiate plan items live, and freeze the roadmap to keep everyone honest.

Platform choice decided the outcome. On an earlier project, a 100% React web app, I ran the same fleet. The planner, implementer, tester loop flew. Hot reloads, Jest snapshots, and Storybook diffs gave the agents instant feedback. The Compose Multiplatform stack I am working on now behaved very differently. Gradle builds, multiplatform previews, and mysteriously missing imports humbled the automation. Hence the investigation.

1. Archaeology: Every Prompt Left A Fossil

I started by walking the history of our automation scripts. Every exasperated message ended up as a change in the orchestrator, and every change taught the fleet a new trick.

Launch Phase — The orchestrator started as a strict planner that demanded exactly two roadmap entries per iteration and forced every role to run the heaviest validation command. The upside was predictability; the downside was a 15-minute tax on even the tiniest documentation tweak.
Import Amnesty — After a streak of broken builds, we updated the implementer instructions to insist on adding imports immediately, to log the exact lint and test commands they ran, and to stop dumping entire files when a surgical snippet would do.
Smoke Signals — The first time the app crashed on launch we added a dedicated integration checklist, then wired the implementer and tester prompts so they could not exit without recording desktop and web smoke notes whenever navigation changed.
Freeze Era — Once the roadmap ballooned to fifteen UI overhauls, we introduced a frozen plan document, added an environment flag that told the planner to reuse it, and taught the orchestrator to skip heavy validation when only prose changed.

Each stage is a direct translation of someone shouting “stop doing that” during a review.

2. Anatomy Of The Orchestrator

After a few close readings the script finally made sense. Here is the cheat sheet so you do not have to stare at bash for an afternoon.

2.1 Planner Discipline

The planner reads the roadmap, respects any items already checked off, and, when the freeze flag is set, simply announces the next unfinished task instead of drafting new ones. That single environment variable eliminated thrash.

2.2 Implementer Checklist

Run make lint and make test, only escalate to heavier builds if packaging actually matters.
Keep shared code multiplatform-friendly, stash platform specifics behind interfaces.
Confirm imports and dependencies exist before exiting, no parking missing symbols for the next person.
Use targeted reads: find the right file path first, then peek at the exact span you need.
When a feature touches several modules, build an allowlist file and iterate through it instead of hammering rg repeatedly.
Record desktop and web smoke results whenever interactive flows change, following the integration checklist.

2.3 Tester Playbook

Inspect the diff first. If only text changed, say so. Otherwise, run the narrowest Gradle tasks possible before defaulting to the full suite, and echo every command with its exit code.
After UI changes, run the desktop bundle, attempt a web build, walk through the smoke checklist, and report anything you observe.

2.4 Reviewer Routine

Reviewers lean on the implementer and tester summaries, confirm the roadmap item can be checked off, stage everything, and leave a checkpoint commit even if validation fails. Nothing disappears silently.

3. Instrumentation: Listening To The Logs

The fleet was noisy long before it was helpful. Building a log summarizer changed that. The script parses every role’s transcript, captures duration, maximum tokens consumed, and the top repeated commands. A typical run looks like this:

Implementer
  Runs: 24
  Average duration: 10:51
  Longest run: 24:28
  Average tokens: 163,550
  Max tokens: 461,139
  Top commands: make lint (32), git status (31)
Planner
  Average duration: 02:46
  Average tokens: 84,184

Three insights landed immediately:

Eleven-minute implementer passes were caused by redundant file reads and repeated lint cycles.
Token peaks north of 400k were tolerable only because Codex’s quota is generous; without instrumentation we never would have noticed.
Command spam proved the implementer was not sharing results. Once we forced them to include command summaries, downstream roles stopped re-running make lint out of paranoia.

4. Roadmap Hygiene And Freeze Discipline

The roadmap now mirrors our UI overhaul checklist. Each entry lists the goal, the surfaces it touches, the expected deliverables, and the exact commands we consider mandatory. When a task is complete, the orchestrator marks the matching goal as done and leaves the roadmap in place for the next pass.

Skipping heavy validation when the diff only touches texts shaved minutes off documentation updates. More importantly, freezing the roadmap stopped the planner from inventing new work mid-iteration. Focus went up, token burn went down.

5. Integration Testing Went From Afterthought To Law

The integration checklist covers the flows that kept breaking:

Create a report with media and location, submit, and confirm it lands in the feed.
Follow and unfollow an issue while watching counts adjust.
Trigger an escalation from the activity view and ensure the timeline updates.
Visit the account hub, switch between saved items, rotate the anonymous ID, and return without crashes.
Open the notification bell and the digest preview to confirm they render.

Until we automate those flows, every implementer and tester documents the desktop and web passes manually. Reviewers refuse to merge without evidence. The process is slower than I would like, but the crash rate plummeted after we adopted it.

6. Token Economics And Quota Reality

Codex’s generous limits let us experiment without fear. Multi-role iterations ship even for documentation changes, gigantic screen rewrites flow through in one request, and the prompts themselves can stay verbose.

That freedom masks inefficiency. Without tracking, we were burning hundreds of thousands of tokens per iteration without accountability. The log summarizer gave us the numbers we needed to justify prompt surgery, freeze the roadmap, and tighten expectations for every role.

7. Cross-Project Reality Check: React Web vs Compose Multiplatform

Aspect	React Web Fleet	Compose Multiplatform Fleet
Build/Run	`npm run build`, Vite dev server, Playwright	`make desktop`, `make web`, Gradle tasks, emulator smoke
Test Loop	Jest/Vitest plus Cypress in roughly 3–5 minutes	`make lint` + `make test` + manual desktop/web smoke, often 15–25 minutes
File Scale	Dozens of small components	Monolithic screens (300–1400 LOC) plus multiplatform utilities
Import Issues	Rare because ESM handles it	Frequent due to platform-specific targets
Fleet Velocity	Two planner passes per day, implementer/tester done in under an hour	One frozen plan item per day, implementer passes routinely exceed ten minutes
Token Use	Around 80k per run	Around 160k per run

The React project thrived because hot reload tightened feedback loops, TypeScript and ESLint caught mistakes early, and snapshot tooling gave testers near-instant diffs. Compose punished every misstep with another Gradle build. Same token budget, wildly different velocity.

Conclusion: Codex fleets shine when build and test cycles are short and the component graph stays modular. They struggle when multiplatform packaging or binary builds enter the mix.

8. War Stories And Favorite Log Lines

An implementer once ran the same search command eight times in a single log. After the prompt started recommending allowlists, repetition vanished.
Planner logs during a freeze became blissfully quiet: “Using existing plan, freeze flag enabled” and nothing else. Sometimes the best automation knows when to stay silent.
Reviewer checkpoints ensured even failed runs left an audit trail. No more mystery regressions.

9. Lessons I Will Not Forget

Translate pain into prompts. Every rule—from import hygiene to smoke testing—started as a complaint.
Freeze plans to fight thrash. Frozen roadmaps plus a friendly environment flag kept the fleet aligned.
Instrument everything. Without token and duration metrics we were flying blind.
Respect platform reality. Compose needs heavier smoke coverage; React did not.
Generous quotas are not an excuse to waste them. The capacity exists for exploration, not sloppiness.
Documentation is infrastructure. The speed plan, integration checklist, and this post are part of the system.

10. To-Do List For Future Me

Automate the smoke checklist. Build a shared Compose UI harness, add a desktop headless run, wire Playwright into the web build, and let testers trigger all of it with one command.
Diff-driven command planner. Teach the orchestrator to map file changes directly to the tests they require.
Token anomaly alerts. Flag any role that blasts past 250k tokens so we can intervene quickly.
Parallel doc/test mode. Once builds stabilize, experiment with overlapping roles to cut wall-clock time.
Back-port the lessons. The React fleet should inherit the good parts—freeze plans and logs—even if it does not need heavy smoke checks.
Share the playbook. New contributors should not have to rediscover these guardrails the hard way.

11. Closing Thoughts

The fleet now obeys a frozen roadmap, keeps imports tidy, logs every detour, and refuses to ship UI changes without smoke evidence. We have also learned to admit that some ecosystems, particularly TypeScript and React, are simply better suited for high-frequency agent iteration than a multiplatform Kotlin stack.

OpenAI’s permissive Codex limits gave us the runway. Constant feedback supplied the urgency. This write-up captures the messy, detailed process of turning “the fleet sux at working at tasks” into a disciplined automation pipeline.

If the Compose branch keeps fighting tomorrow, at least we have metrics, documentation, and a frozen plan. And if all else fails, there is always that React project waiting for another lightning-fast iteration.

[prev_entry]

Weekend Photo Experiments: Paradise Mist, Dobsonian Moon, and Lake Pukaki

arrow_back read_entry

[next_entry]

Case Study: OCBC Mobile Banking Delivery

read_entry arrow_forward