PART 01 · FIELD JOURNAL

33 min read read·2026-03-06

23,479 Sessions: What Actually Works in Agentic Development

What 11.6GB of session data across 27 projects reveals about building production software with AI agents

23,479SESSIONS

11.6 GBDATA

3.47MLINES

42DAYS

27PROJECTS

View companion repo

#AgenticDevelopment #ClaudeCode #AIEngineering #MultiAgent

I averaged 559 AI coding sessions per day for 42 days straight. Not prompts. Sessions. Each one a self-contained agent with its own context window, its own task, its own tools.

23,479 total. 3,474,754 lines of interaction data across 27 projects. This series is what I learned.

Here's the short version: AI agents fail in predictable ways. They forget across sessions. They declare victory without evidence. They build features that look correct but do nothing. They pick expensive models for trivial tasks. They corrupt each other's work when they edit the same file. Every system I built over those 42 days — consensus gates, functional validation, cross-session memory, orchestration loops, enforcement hooks — exists because one of those failures hit me in production. Building with agents. Building tools for agents. Every claim traces to a real session. Every system backed by a companion repo you can clone and run.

The Numbers

23,479 sessions. I started 4,534 of them. The other 18,945 were agents spawning agents. An orchestrator delegates to a reviewer, the reviewer spawns a verifier, the verifier reports back up the chain. That's a 1:4.2 ratio. Every time I kicked off a session, the system spawned roughly four more on its own.

The tool leaderboard tells you what agents actually do with their time:

Chart · animated barsTop 10 tool invocations across 23,479 sessions

245,004 total

Read

0

Bash

0

Grep

0

Edit

0

Glob

0

Write

0

TaskUpdate

0

Task

0

idb_tap

0

ToolSearch

0

Read leads the rest by 4.4x over Edit. Understanding dominates changing.

Read leads everything. 87,152 file reads versus 19,979 edits, a 4.4:1 ratio. Throw in Bash (82,552, mostly commands to understand state) and Grep (21,821 searches), and the picture gets starker: agents spend roughly 80% of their tool invocations understanding code and 20% changing it.

That ratio is the thesis of this entire series. Agents that read before they write produce fewer regressions than agents that jump straight to editing. The most productive thing an AI agent does isn't writing code. It's understanding the code that already exists.

“Agents aren't generators. They're readers that occasionally write.”

TOOL CATEGORY BREAKDOWN

Category	Tools	Invocations	%
Understanding	Read, Bash, Grep, Glob	203,294	79.1%
Changing	Edit, Write	29,045	11.3%
Coordinating	Task*, SendMessage, Agent	13,920	5.4%
Validating	idb_, simulator_, browser_*	10,053	3.9%
Other	Skill, WebSearch, TodoWrite	5,658	2.2%

Chart · radialWhere the work actually goes

Understanding

of all tool invocations

Changing

11.3%

Coordinating

5.4%

Validating

3.9%

Other

2.2%

79.1% of agent effort: reading, searching, listing. Writing is the smallest category that produces the artifact.

The coordination column is where things get wild. 2,827 Task spawns. 4,852 TaskUpdates. 2,182 TaskCreates. 1,720 SendMessages. That's an entire organizational layer. Agents creating teams, assigning work, reporting status. None of that existed when I started. The 929 inline Agent calls are ad-hoc delegation: an agent decides mid-task that it needs a specialist and spins one up on the spot. I didn't design that behavior. It emerged.

Five Failure Modes

Every system in the rest of this series exists because something broke. These five failure modes showed up in the first week and never stopped.

FAIL

Amnesia

Same bug, different session.

FAIL

False Confidence

Build passes, feature broken.

FAIL

Completion Theater

Correct structure, empty body.

FAIL

Wrong Model

Opus on a typo fix.

FAIL

Coordination Collision

Two agents, one file.

Amnesia is fixed by a cross-session memory store that records observations and re-injects them into future contexts. That system reduced repeated mistakes by 73% across the projects where it was deployed (Post 12).

Confidence without evidence is fixed by functional validation. No mocks, no test files. Build the real system, run it, exercise it through the actual UI, capture screenshots. Across all sessions, the block-test-files hook fired 642 times, preventing agents from writing tests that mirror their own assumptions instead of exercising real features (Post 3).

Completion theater is caught by the three-layer validation stack: 7,985 iOS simulator MCP calls (taps, gestures, accessibility queries, screenshots) and 2,068 browser automation calls (clicks, navigations, screenshots). Real buttons. Real forms. Real validation.

Wrong model for the job is fixed by three rules: lookups go to Haiku, implementation goes to Sonnet, architecture review and complex debugging go to Opus. No machine learning. No classifier. Just three rules (Post 7).

Coordination failures are fixed by file-ownership maps with glob patterns. Two agents literally can't edit the same file. That system and its 2.3x speedup over sequential execution is the subject of Post 2.

Five failure modes mapped to systems

AmnesiaSame bug across sessions

False ConfidenceBuild passes, broken

Completion TheaterEmpty body, correct shape

Wrong ModelOpus on a typo

CoordinationTwo agents, one file

Cross-Session MemoryPost 12

Functional ValidationPost 3

Three-Layer StackPosts 3 + 16

Model RoutingPost 8

Consensus + WorktreesPosts 2 + 6

Every system in this series was born from a production failure. Amnesia drove cross-session memory; false confidence and completion theater drove functional validation; wrong-model routing cut costs 82%; coordination collisions drove consensus + worktree isolation.

From Autocomplete to Operating System

The turning point was a framing shift: stop using AI as autocomplete and start treating it as a team of specialized workers.

Autocomplete operates inside a single context window. A team operates across multiple context windows with coordination protocols between them. The context window isn't just a limitation. It's an architecture boundary. Each agent gets a fresh window, a specific role, and a defined scope. The orchestrator coordinates across those boundaries using the filesystem, not shared memory.

4,534 human-initiated sessions versus 23,479 total tells the story: 81% of all sessions were agents spawning other agents. The coordination infrastructure (2,827 Task spawns, 4,852 TaskUpdates, 2,182 TaskCreates, 1,720 SendMessages) is an organizational layer running on top of Claude Code. I didn't plan it that way. It emerged because single-agent workflows kept hitting the five failure modes above.

Here's what that looks like in practice. Session 33771457 in the ils-ios project: the orchestrator needed to consolidate five incomplete iOS specifications into one production spec. It spawned 13 different team configurations over the course of the session. First a design team — one architect and three validators. The architect drafted; the validators reviewed independently and voted. When consensus was reached, the orchestrator dissolved the team and created an implementation team: one executor, three new validators. When implementation gates passed, a final consensus checkpoint team produced the unanimous PASS/FAIL verdict. Eighty agent operations total. I typed one sentence to start it.

Four Patterns That Survived

Across 23,479 sessions, four patterns survived contact with real codebases. Everything else? Good ideas that didn't hold up.

Consensus Gates

Three reviewers, unanimous vote. $0.15 per gate.

Functional Validation

No mocks. Run the real thing, screenshot evidence.

Fresh Context

Short-lived agents with exactly the files they need.

Filesystem Persistence

Agents cannot share memory. They can share files.

The consensus-gate pattern caught the +=-vs-= bug that had been hiding for three days. Alpha flagged the operator as inconsistent with the API's full-message response format. Bravo flagged the index reset as a state management hazard. Lead flagged both as violations of the streaming module's own documentation comments. Three reviewing agents, three different lenses, one root cause.

Functional validation caught a stale .next cache bug that next build said didn't exist. The agent ran 674 Playwright tool calls in a single validation pass. I'm still annoyed about that one — I'd spent two hours blaming my code before the agent proved it was a build cache issue.

Fresh context wins over accumulated context. Have you ever watched an agent confidently reference code it read 30 minutes ago that's since been rewritten by another agent? I have. The discipline is short-lived agents with one job each.

Filesystem persistence is what lets agents collaborate without shared memory. VG1.2 from session ad5769ce: “EventBus emits events” — evidence: curl emit&count=10 returns {“emitted”:10, “subscriberCount”:1, “ringBufferSize”:10}. Not “it works” but “here is the exact JSON proving it works.”

The Economics

The ils-ios project is the largest in the dataset: 4,241 session files, 1,563,570 lines of data, 4.6GB. 149 Swift files, 24 screens, a macOS companion, 13 visual themes. Total Claude API cost: approximately $380.

That cost only makes sense with model routing:

COST PER 26 INVOCATIONS

All Opus$8.40

All Sonnet$3.12

Routed (Haiku/Sonnet/Opus)$1.52

82% savings. A project with 200 consensus gates costs $30 with routing versus $168 without. Three rules: lookups go to Haiku, implementation goes to Sonnet, architecture review and complex debugging go to Opus.

Chart · sparklineCost per 26 invocations across routing strategies

0 cents · routed mix

Same workload, three model mixes. Routing collapses cost by 82% versus the all-Opus baseline.

RALPLAN, the adversarial planning system, showed why planning consensus pays for itself. A Supabase auth migration got decomposed into 14 tasks by the Planner. Looked clean. The Architect vetoed it. Supabase Row Level Security policies reference auth.uid(), which returns Supabase's internal user ID, not a custom JWT's subject claim. Seven of the 14 tasks assumed RLS compatibility. They would've compiled. They would've passed type checks. They would've failed silently at runtime, allowing unauthorized data access. Three rounds of adversarial review caught it. Cost of those review rounds: under $2. Cost of shipping a silent auth bypass: I don't want to think about it.

What the Rest of the Series Covers

The posts are organized by problem, not chronology. Post 2 starts with the bug that started everything: line 926, += instead of =, and the three-agent consensus system that caught it.

The series organized by problem domain

FoundationPosts 2, 3, 7

PlatformPosts 4, 5, 6

OrchestrationPosts 8, 14, 17, 18, 19

IntelligencePosts 9, 12, 13

DisciplinePosts 10, 11, 15, 16, 20, 22

Economics & EdgePosts 21, 23–32

Foundation posts (2, 3, 7) establish core disciplines — consensus, validation, prompt engineering — that everything else builds on. Platform, Orchestration, Intelligence, Discipline, and Economics & Edge clusters cover the 32 systems that emerged when single-agent approaches hit their limits.

Each post has a companion repo. Each repo has working code. Each claim traces back to one of 23,479 real sessions generating 3,474,754 lines of data over 42 days. No fabricated examples. No mock data. Just what actually works when you run AI agents at scale.

Diagram · pipelineMining 23,479 sessions · 11.6 GB · 3.47M lines5 miners · 1 consolidator

What You’ll Walk Away With

01A consensus gate framework that catches bugs single-agent reviews miss ($0.15 per gate)
02A functional validation protocol that replaces unit tests with real UI interaction
03An orchestration system that coordinates multiple agents without file conflicts
04A cross-session memory store that keeps agents from repeating the same mistakes
05A model routing strategy that cuts API costs by 82%
06A prompt engineering stack that composes seven layers of context
07Enforcement hooks that stop agents from cutting corners

Next post starts with the bug that started everything. Line 926, += instead of =, and the three-agent consensus system that caught it.