PART 19 · FIELD JOURNAL

20 min read read·2026-04-23

Anneal: Three Planning Variants for Agentic Development

Linear, tournament, and convergence pipelines that turn rough ideas into executable plans

#AgenticDevelopment #ClaudeCode #Planning #Anneal

23:36:08 — Metis dispatched to the codebase. 23:37:21 — Prometheus returned a synthesized strategy. 23:41:06 — Momus filed an architectural critique. 23:42:47, 23:44:26, 23:45:44 — Red-Team Trinity (Security, Scope, Assumptions) fanned out in parallel and came back with adversarial findings. 23:48:02 — Oracle synthesized everything into a unified design. 23:49:08 — Hephaestus validated the design against constraints. 23:53:44 — Atlas wrote the final executable plan.

Total wall-clock: 17 minutes 36 seconds. Final score: 98. Depth: 2 of 2. Verdict: SAFE.

That transcript is real. It's from the Temper run that productized the very plugins this post is about. The plan it produced is the plan I'm executing right now — including writing this paragraph.

Diagram · gauge

Temper score

depth 2/2 · verdict SAFE

Wall-clock

17m 36s

Depth

2/2

Verdict

SAFE

Iterations

convergence

“Plan: 27 minutes (Cast variant) or as little as 6 (Alloy). Plan ships with receipts.”

The Problem That Started Everything

Across 23,479 sessions of agentic development, I started tracking which sessions blew up and which ones converged. The answer surprised me. The execution agents weren't the problem. Sonnet writes code reliably. Opus reviews it competently. The orchestration patterns from Ralph (Post 8) keep them on rails.

The failures were upstream. They were plan-quality failures.

A typical bad outcome looked like this: I'd hand Claude a rough idea, get back a confident-looking plan, dispatch a builder, and 45 minutes later watch the agent fight the plan. The plan had hallucinated an API surface that didn't exist. The plan had missed a security boundary. The plan had assumed greenfield in a brownfield repo. The agent did its job — it executed the plan it was given. The plan was wrong.

I started measuring. Roughly 60% of session 'failures' in my logs traced back to the planning step, not the execution step. The agent didn't get worse. The instructions did.

The fix isn't to make the executor smarter. The fix is to refuse to execute a plan that hasn't survived adversarial review.

That's what Anneal is.

The Three Architectures

Anneal is three Claude Code plugins that share one philosophy: no plan ships until it's been red-teamed. They differ in how they get there.

Rough ideavague brief

Pick variantCast / Alloy / Temper

Red-team gateTrinity + Oracle + Hephaestus

Battle-tested planscore ≥ 90 · SAFE

ANNEAL VARIANTS

Variant	Strategy	Spawns	Wall-clock	Best for
Cast	Linear pipeline	9 agents · 8 batches	~27 min	Scoped work with a clear spec
Alloy	Tournament consensus	18 spawns (5 biased planners + synth)	~6 min	Greenfield work, framing unclear
Temper	Fixed-point deepen loop	Iterative until convergence	Variable (depth 5–10)	Ambiguous problems, unstable requirements

Chart · animated barsAnneal variants · wall-clock minutes per plan

51 total

Cast (linear)

0

Alloy (parallel)

0

Temper (deepen)

0

Pick the variant whose tradeoff matches the problem. Alloy parallelises planners; Cast walks the gate sequence; Temper iterates until convergence.

Cast is the linear pipeline. Nine agents, in order, ~27 minutes wall-clock, 8 spawn batches. Best for scoped work with a clear spec — you know what you want, you just don't trust yourself to have thought through the edge cases. Cast walks the plan through Metis (research), Prometheus (strategy), Momus (audit), the Red-Team Trinity (Security/Scope/Assumptions, in parallel), Oracle (synthesis), Hephaestus (validation), Atlas (final plan). Eight of those nine spawns are Task dispatches; one is the parallel batch. The pipeline is deterministic and short enough to trust as a default.

Alloy is tournament consensus. N parallel planners (default 5, each biased toward a different lens — correctness, minimalist, defensive, performance, UX) draft independently. A synthesizer reads all N drafts and blends them into a single plan. Total: 18 spawns, ~6 minutes wall-clock because the planners run in parallel. Best for greenfield work where you don't yet know which framing wins. The biased planners are the load-bearing trick: a 'minimalist' planner left alone will under-engineer; a 'defensive' planner alone will over-engineer; the synthesizer's job is to find the plan that survives both critiques.

Temper is the fixed-point deepen loop. It iteratively refines a plan, runs the red-team review at each depth, and stops when the plan stops changing — convergence on variance under 0.3 OR score delta under 0.15 between depths. Max depth 5–10. Best for ambiguous problems where you can't even articulate the requirements yet. Temper is also the variant that wrote the plan for productizing Anneal itself (more on that below).

The three aren't ranked. Pick the one whose tradeoff matches your problem. If the spec is clear, Cast is fastest. If the framing is unclear, Alloy explores. If the requirements themselves are unstable, Temper converges.

What 'Battle-Tested' Means

Every Anneal variant routes the candidate plan through the same gate sequence. The names are theatrical (Greek gods sell better than agent-7), but the roles are concrete:

01Metis — research. Reads the codebase, surfaces actual constraints, returns a context bundle. This is the only agent that touches the filesystem in read-mode before strategy is set.
02Prometheus — strategy. Takes Metis' bundle and proposes an approach. First articulation of 'what we're going to do.'
03Momus — audit. Critiques the strategy on internal consistency, naming, scope creep, and design smells. Momus is allowed to reject.
04Red-Team Trinity — three parallel adversaries. Security asks 'what's the attack surface?' Scope asks 'what are we sneaking in?' Assumptions asks 'what are we taking for granted?' All three run in one batched message — that batching is the only reason the wall-clock is reasonable.
05Oracle — synthesis. Reads Prometheus' strategy and all red-team findings, produces a unified design that incorporates the criticism.
06Hephaestus — validation. Checks Oracle's design against the original constraints. Did we drift? Did we over-correct on red-team feedback?
07Atlas — final plan. Writes the executable artifact: phases, file paths, success criteria, validation gates.

The verdict schema is binary plus a score: {verdict: "SAFE" | "REVISE" | "BLOCK", score: 0-100, depth: N}. SAFE means the plan can ship to an executor. REVISE triggers another deepen pass (in Temper) or a synthesizer re-roll (in Alloy). BLOCK kills the run and surfaces the blocking finding to the human. There is no 'approve with concerns' — that's the Ralph anti-pattern from Post 8 reborn. Concerns either get fixed or they kill the plan.

“This is the same principle as the consensus engine in ValidationForge: agreement is not a tiebreaker, it's a requirement.”

A plan with a Red-Team finding it didn't address is not a plan. It's a draft.

A Concrete Example: Anneal Productizing Anneal

The plan I'm executing right now was written by Anneal. Specifically: a Temper run named anneal-temper-20260422-205948-productize-anneal. The brief was 'ship Anneal as a product on withagents.dev' — vague on purpose, because I wanted to see whether Temper could converge on a plan when the requirements were genuinely fuzzy.

Two depth iterations, score 98, verdict SAFE. The agent timestamps from the top of this post are this run's actual transcript.

The output was a 10-phase plan:

// anneal-temper · productize-anneal · 10 phasestext

phase-00-preflight
phase-01-content-acquisition
phase-02-products-graph-update
phase-03-catalog-card
phase-04-product-site-pst
phase-05-anneal-blog-post-19
phase-06-cross-domain-navigation
phase-07-vercel-deploy
phase-08-functional-validation-curl-matrix
phase-09-redaction

Every phase has a phase file with explicit file paths to touch, validation gates, and success criteria. Phase 08 is the one that mattered most — a 12-curl matrix asserting that every link from the blog post to the catalog card to the product site to the GitHub repo returns 200 (or the right 301/410 for tombstones). The matrix passed 12/12 on first run. Visual evidence captured at three viewports (375 / 768 / 1440) per the Product Site Template contract.

The recursive bit is the part I keep circling: Phase 05 is this post. The plan that productized Anneal includes a phase called 'write the launch post that announces Anneal.' Anneal validated Anneal's own launch. That recursion is not a stunt — it's the whole pitch. If the planning system can't survive being applied to itself, it has no business planning anything else.

Here's what one phase's success criteria block looks like, lifted directly from the executing plan:

// phase-04 · product site PST · success criteriatext

Phase 04: Product site PST
- Hero copy + diagram render at /products/anneal
- "What it does" reads in 3 sentences
- ≥2 inline usage examples (Cast + Alloy + Temper one-liners)
- Architecture diagram present
- Feature matrix lists all 3 variants
- Install command copyable
- Links: GitHub, Post 19, back-to-brand
- Visual evidence at 375/768/1440 viewports
- Validation: curl /products/anneal → 200 + assert hero text present

That's not a wishlist. That's an exit gate. The phase isn't done until every line is checked.

The Plugin Lessons

I shipped three plugins. I broke them five different ways before they shipped clean. The lessons cost real time, so I'll lead with the most expensive one.

01Static validation is not runtime validation. I wrote a validate-plugin.py script that checked plugin structure, frontmatter, agent definitions, hook schemas. All three plugins passed. Then I loaded one in a real Claude Code worker with --plugin-dir and it crashed. The structural validator was checking the shape of the JSON; it wasn't checking that the runtime could actually load it. The Cast E2E verification on 2026-04-22 was the first time I had real evidence that the plugin worked end-to-end — a probe agent dispatched through the full 9-agent pipeline wrote /tmp/anneal-probe/readme.txt containing exactly hello world (12 bytes). Until that probe wrote that file, every PASS on validate-plugin.py was theater.
02plugin.json is not a catch-all. Extra fields silently broke loading. No error message, no warning — the worker just refused to recognize the plugin. I added a description field at the top level (intuitive) and the plugin became invisible. The schema is strict in ways the docs don't surface. Fix: keep plugin.json minimal and load metadata from commands/ and agents/ instead.
03Skills are model-autonomous; agents must be Task-spawned. This one cost me half a day. The docs implied that defining an agent in agents/ was enough — that the model would invoke it when relevant. It isn't. Skills get auto-considered based on the description match (Post 15 covers this). Agents have to be explicitly dispatched via the Task tool. The orchestrator script in each Anneal plugin is what makes the dispatch happen; without scripts/orchestrate.sh, the agents sit there like unfired neurons.
04Parallel execution comes from batching, not flags. I built Cast assuming run_in_background: true on individual Task calls would parallelize them. It doesn't. Each call still serialized. Parallelism in Claude Code happens when you batch multiple Task calls in one assistant message. The Red-Team Trinity is fast because Security/Scope/Assumptions are dispatched in a single message; if I'd put them in three messages, the wall-clock would have tripled. I caught this defect mid-session and rewrote the orchestration around it.
05Path-agnostic vendoring is load-bearing. The original Anneal repo had 141 hardcoded references to ~/Desktop/anneal-runs across 42 files. The plugin worked perfectly on my machine and broke instantly anywhere else. Fix: replaced every hardcoded path with the ANNEAL_RUNS_ROOT env var (defaulting to PWD/.anneal/runs) and copied the shared _shared/ schema into each plugin root at package time so relative references survived the install path. Plugins must be portable. If your plugin only works in your home directory, it's not a plugin — it's a script you've put on airs.

Install One Thing, Pick a Variant

// anneal · installbash

claude plugin marketplace add krzemienski/anneal
claude plugin install anneal-cast@anneal-umbrella-dev    # or anneal-alloy / anneal-temper

You can install all three. They don't conflict. Each registers its own slash command (/anneal-cast:anneal, /anneal-alloy:anneal, /anneal-temper:anneal) and its own runs directory. Pick the variant per problem.

Full source, the recursive Temper plan that produced this post, and the verification evidence are at github.com/krzemienski/anneal. Product page (itself a Phase 04 artifact of the plan that wrote this post): anneal.withagents.dev.

Three planning variants. One install. Then compose with your validation pipeline — because a battle-tested plan still has to survive a battle-tested execution. That's what Posts 8, 15, and ValidationForge are for. Anneal sits upstream of all of them. It refuses to let a bad plan reach a good agent.

“The plan has to anneal before the agent forges anything from it — controlled heating, slow cooling, no internal stress. Anneal is the kiln; everything downstream is what comes out of it.”