Shannon v1.2.0: Multi-Stage Agentic Work as a Sequence of Provable Steps
A Claude Code plugin that refuses to say 'done' until the evidence is on disk. 36 skills, 11 agents, 22 commands, 7 hooks — and a doctor that reads its own contract.
View companion repoThe gap between "it compiled" and "it works"
Every agent framework I have shipped eventually hits the same wall. The model reports success. The build is green. The task list is all checkmarks. And the feature does not work.
The failure is not the model being dishonest. It is the framework accepting the wrong proof. A successful build proves the code compiled. A passing unit test proves a mock behaved the way the test author imagined. Neither proves the real system did the real thing. When an agent runs unattended for an hour across planning, implementation, and validation, that gap compounds: each stage inherits the previous stage's unverified claim.
Shannon is the plugin I built to close that gap. Its one non-negotiable rule: a verdict requires real-system evidence on disk, or it is a refusal. No mocks. No stubs. No test files. The plugin would rather stop and write a REFUSAL.md citing exactly what is missing than emit a green checkmark it cannot back.
This post is the v1.2.0 release. It is also a usage manual. I walk every command path with real invocations, because a framework you cannot drive is a framework you will not use.
What ships in v1.2.0
The surface, verified against the repository at release (scripts/doctor.py, 10/10 mechanical checks):
- 36 skills:
skills/<name>/SKILL.md, each with progressive-disclosurereferences/*.md. - 11 agents:
agents/<name>/AGENT.md, with each agent's skills embedded inline at build time. - 22 commands: the
/shannon:*slash-command surface. - 7 hooks:
hooks/*.js, registered throughhooks/hooks.json, with arequired_hooksdependency contract.
Five pillars hold it together:
- Embedded sub-agent skills. When the lead dispatches a sub-agent, the sub-agent's skill content is already inlined into its agent manifest by
build/embed-skills.py. A spawn can never silently fail to load the skill it needs, because the skill is not a separate fetch — it is part of the agent. - Orchestration. Single-message multi-Task dispatch, with sequential, parallel, and competitive modes — and an explicit escalation to Anthropic's Dynamic Workflows when the fan-out outgrows a single conversation turn.
- Iron Rule validation. Real-system evidence on disk for every completion claim. Enforcement hooks refuse to write
*.test.*, mock, or stub files at the moment of the Write. - Meta-judge consensus. A weighted rubric YAML is generated before any judge scores, with the overall pass threshold hidden from the judges to mitigate score-anchoring. Non-unanimous verdicts escalate to multi-round debate, never to silent averaging.
- Self-instrumented. The plugin observes itself.
scripts/doctor.pyvalidates the contract;scripts/harness/load_check.pyis a real Agent SDK probe that confirms every command is addressable when the plugin loads.
Install
# In Claude Code — install from the published GitHub repo:
/plugin marketplace add krzemienski/shannon
/plugin install shannon@shannon
Shannon is standalone. It uses Context7 and sequential-thinking MCP servers when they are present and degrades gracefully when they are not. No required MCP servers, no external services.
Activate enforcement per-project, then confirm the contract:
/shannon:enforce on # <on|off> [--force]
/shannon:doctor --verbose # [--verbose] for the full per-check breakdown
/shannon:enforce on writes a .shannon/active sentinel. Shannon's hooks are no-ops in any project that has not opted in, so installing the plugin globally does not impose the Iron Rule on every repository you touch. You choose where it applies.
/shannon:doctor is the mechanical contract check. At v1.2.0 it reports:
{
"report_schema": "1",
"plugin_version": "1.2.0",
"summary": { "checks_pass": 10, "checks_fail": 0, "mismatches": 0 },
"checks": [
{ "id": "plugin-manifest", "status": "PASS" },
{ "id": "skills-count", "value": 36, "status": "PASS" },
{ "id": "agents-count", "value": 11, "status": "PASS" },
{ "id": "commands-count","value": 22, "status": "PASS" },
{ "id": "hooks-count", "value": 7, "status": "PASS" },
{ "id": "hooks-json", "status": "PASS" },
{ "id": "required-hooks-contract", "status": "PASS" },
{ "id": "build-state", "status": "PASS" },
{ "id": "skill-body-refs", "status": "PASS" },
{ "id": "agent-body-refs", "status": "PASS" }
]
}
The version is not hard-coded in the doctor; it is read live from .claude-plugin/plugin.json. That was a deliberate design choice: a doctor that hard-codes the version it is supposed to be auditing is a doctor that lies the moment the manifest drifts.
Planning: four modes, one command
/shannon:plan is the entry point for every plan. If a codebase is present, it runs codebase analysis and a skill inventory first (no opt-in needed) so the plan is grounded in what actually exists, not in what the model assumes exists. Use --greenfield to skip that pre-step for a brand-new project.
The mode flag picks the planning strategy:
--mode linear (default)
A single hierarchical plan: plan.md plus phase-NN.md files, each phase with measurable, transcript-provable success criteria and an embedded validation gate.
/shannon:plan "Add rate limiting to the public API" --mode linear
--mode converge
Iterative refinement. A draft is written, a critic red-teams it, the draft is rewritten with the critique folded in, repeated for N rounds (default 3). Each round is journaled:
/shannon:plan "Migrate auth from sessions to JWT" --mode converge --rounds 3
# → plans/converge-<run-id>/round-1/{draft.md, critique.md}
# → plans/converge-<run-id>/round-2/{draft.md, critique.md}
# → plans/converge-<run-id>/round-3/{draft.md, critique.md}
Use converge when the shape of the plan is the risk — when a single pass would miss a failure mode that an adversarial reread would catch.
--mode tournament
N independent candidate plans (default 3), each written from a different angle, then scored by a judge against a meta-judge rubric:
/shannon:plan "Design the offline-sync layer" --mode tournament --candidates 3
# → plans/tournament-<run-id>/candidate-1/ (e.g. correctness-first)
# → plans/tournament-<run-id>/candidate-2/ (e.g. minimalist)
# → plans/tournament-<run-id>/candidate-3/ (e.g. defensive)
# → judge ranks; winner is promoted
Use tournament when the solution space is wide and one attempt would anchor on the first idea.
--mode deep
The full treatment: tournament, then feed the winner into converge, then synthesize with the consensus engine. The most expensive mode, and the one I reach for when the cost of a wrong plan dwarfs the cost of planning.
/shannon:plan "Re-architect the event pipeline for exactly-once delivery" --mode deep
--mode deep is where Shannon's autonomy lives. It chains tournament → converge → consensus without a human between stages, and only surfaces the synthesized plan once the internal debate has settled. If you have used /shannon:plan-deep from an earlier prototype, that capability is now folded into this one flag.
Executing: from a plan to verified done
Once a plan exists, /shannon:cook executes it end to end. It also accepts a bare task description (skipping the plan step) plus --auto (small-task fast path), --fast (skip refinement loops), --no-validate, and --greenfield:
/shannon:cook plans/<date>-rate-limiting/
/shannon:cook "Add OAuth login to settings screen" --auto
Cook spawns an executor agent with embedded validation skills. It runs each phase, captures evidence, and routes the result through completion-gate for the mechanical final check. If a gate cites a blocker, refusal-discipline writes a REFUSAL.md with the specific cited blockers — and there is no --force flag to override it.
Autopilot: refusal-driven retry
For unattended runs, /shannon:autopilot wraps cook in a retry loop:
/shannon:autopilot "Add rate limiting to the public API" --max-attempts 3
The loop is the whole point:
- Run
/shannon:cook. - Read the
completion-gateverdict.COMPLETE→ exit success.REFUSED→ parse the cited blockers fromREFUSAL.md, build a remediation prompt targeting only those blockers, try again.
- After
--max-attemptsstill refused → emit a finalREFUSAL.mdand exit failure.
Autopilot never force-completes. A run that cannot produce evidence ends in an honest refusal, not a fabricated success. That is the difference between an agent you can leave alone and one you have to babysit.
Dispatching sub-agents
/shannon:dispatch fans work out across sub-agents. One command, three modes via --mode (v1.2.0 consolidated the old dispatch-parallel and dispatch-competitive commands into this single entry point):
# sequential (default): tasks that need review between each
/shannon:dispatch "Refactor the three parser modules in order" --mode sequential
# parallel: independent tasks, results needed together this turn
/shannon:dispatch "Review auth, billing, and search subsystems for security issues" --mode parallel --max-parallel 3
# competitive: same task, N attempts, judge picks/synthesizes the winner
/shannon:dispatch "Write the landing-page hero section" --mode competitive --candidates 3
Parallel mode is load-bearing on one rule: every Task call must be in a single assistant message. Claude Code's dispatcher runs tools concurrently only when they share a turn; tools across turns are strictly sequential. Shannon's dispatch-parallel skill enforces this verbatim, because a "parallel" dispatch spread across turns is just a slow sequential one.
When fan-out outgrows a turn: Dynamic Workflows
Anthropic shipped Dynamic Workflows in Claude Code — a JavaScript control program the runtime executes in the background, outside the conversation turn. Shannon's orchestration cluster knows when to reach for it.
In-turn dispatch is right for 2–8 independent targets whose results you need together now. But when the work-list is large or unbounded (one item per changed file, per failing journey, per cluster), or when the run must survive a context interrupt, Shannon now recommends emitting a Workflow instead:
| Signal in the task | Right substrate |
|---|---|
| 2–8 independent sub-tasks, results needed this turn | /shannon:dispatch-parallel |
| Sequential tasks needing review between each | subagent-driven-development |
| Dozens of agents, or fan-out over an unknown-size work-list | Dynamic Workflow pipeline() |
| Must resume across a context interrupt | Dynamic Workflow (journaled, resumeFromRunId) |
| Loop-until-dry / loop-until-budget discovery | Dynamic Workflow while + budget |
The escalation is advisory: Shannon plans the orchestration and names the Workflow shape; you run it. The Iron Rule survives the handoff — a Workflow's agent() calls produce the same real-system evidence, and completion-gate still cites it.
Validation: the Iron Rule in practice
/shannon:validate runs functional validation against the real system and emits per-criterion verdicts, each citing a specific evidence file:
/shannon:validate --mode standard --platform web
# detects the platform (override with --platform ios|web|api|cli|fullstack),
# captures screenshots / API responses / logs to e2e-evidence/<run-id>/,
# emits PASS/FAIL per criterion, each citing a file:line in the evidence tree.
# --mode consensus spawns 3 isolated validators + a confidence-scored synthesis.
The enforcement is not a guideline — it is hooks that fire at the moment of the Write:
- The fab-file hook refuses any Write to
*.test.*,*.spec.*,tests/,__tests__/,mocks/, orstubs/. You cannot quietly retreat to a mock, because the file never lands. - The post-action hook reminds, after every build command, that compilation is not validation.
- The evidence-gate hook fires on
TaskUpdate(status=completed)and checks for fresh evidence. No evidence, no completion.
When validation cannot pass, Shannon refuses. A REFUSAL.md looks like this:
# REFUSAL — user signup flow
## Cited blockers
- MSC-3 (email confirmation): no evidence at e2e-evidence/<run>/step-04-confirm.png
- MSC-5 (session persists): API response at e2e-evidence/<run>/step-06-session.json
shows 401, expected 200
## What would resolve it
- Capture the confirmation-page screenshot after clicking the emailed link
- Fix the session cookie not being set on the /login response, then re-validate
Refusal is a feature. An agent that refuses with cited blockers is more useful than one that claims success you then have to disprove.
Self-instrumentation: the plugin that checks itself
The two harnesses under scripts/ are how Shannon stays honest about its own surface.
scripts/doctor.py is the mechanical contract — the 10-check JSON above. It is the fast, no-API check you run after any change to the plugin.
scripts/harness/load_check.py is the real one. It issues an actual Claude Agent SDK query against the running plugin and reads the init payload to confirm every command is addressable. It needs no API key — the SDK inherits the Claude Code CLI's own authentication, so it runs wherever claude is logged in:
python3 scripts/harness/load_check.py --json
At v1.2.0 this returns:
{
"total_slash_commands": 978,
"shannon_commands_loaded": 57,
"expected_commands": 22,
"missing_commands": [],
"verdict": "PASS"
}
missing_commands: [] is the assertion that matters: every one of the 22 commands the repository ships is addressable in the loaded plugin. The probe earned its keep during an earlier release — one run returned verdict: FAIL with missing_commands: ["enforce", "scope"], which was not a bug in the commands but proof that the CLI was loading a stale cached build instead of the release tree. The probe caught the mispointing; fixing the marketplace pointer turned the FAIL into a PASS. That is self-instrumentation doing its job: the harness found a real discrepancy between what the repo claimed and what the runtime loaded.
The rest of the surface
The commands beyond plan/cook/dispatch/validate round out the lifecycle:
/shannon:fix: a 3-strike bug-fix runner; each attempt targets a different root cause and writes fresh evidence./shannon:loop: a do / verify / reflect convergence loop./shannon:team: multi-teammate orchestration with file-ownership boundaries so no two agents write the same file./shannon:research: parallel researcher fan-out with a cited summary./shannon:prd: interview-driven PRD authoring./shannon:reflect: self-refinement (self / critique / memorize)./shannon:why: five-whys root-cause analysis./shannon:trace//shannon:retro: session-JSONL timeline and retrospective mining./shannon:resume: resume a halted run from its evidence tree, no separate state file required./shannon:audit: read-only audit across screen / app / session / drift / completion-evidence./shannon:scope: brownfield reconnaissance: codebase + skill inventory + session context.
Every one routes through the same five pillars. The vocabulary is shared, the evidence discipline is shared, and the refusal semantics are shared.
Why a stable label, and not v7
Shannon went through informal iterations numbered up to v7 in private. v1.0.0 was the first version I was willing to put a stable label on — not because the earlier work was wasted, but because it was the first cut where the surface, the documentation, and the self-instrumentation all agreed with each other. The doctor reads the manifest. The README counts match the directories. The SDK probe confirms the commands load. When the three sources of truth agree, the version number means something.
v1.2.0 is the second stable release on that same discipline: the surface grew to 36 skills, 11 agents, 22 commands, and 7 hooks — adding DeepPlan with dependency-ordered wave execution, the evidence-gated forge pipeline, and the skills10x activation harness — and /shannon:doctor still reports 10/10 with zero mismatches. The numbers only moved because the directories did, and the doctor proves it on every run.
The repository is public: github.com/krzemienski/shannon. The latest release is tagged v1.2.0. Install it, run /shannon:doctor, and watch a plugin check its own contract before you trust it with yours.
Continue the series
- 35SeriesSublimate: Three Tiers for Distilling Workflows From Your Own SessionsAnthropic shipped Dynamic Workflows in Claude Code this week. The harder question is which of your hand-rolled patterns belongs as a Workflow, which belongs as a Skill, and which belongs as a Subagent.
- 34Edge CasesWhen Your Drafter Doubles the Body: Building the WithAgents Content PluginThe clear step in our LinkedIn drafter never actually cleared. The new body pasted on top of the old one, the editor doubled, and the bug only showed up on the 34th run. Here is what we shipped instead.
- 33Edge CasesOrbit: Find Drift Between What You Asked and What Actually ShippedPlans are evidence, not history. Mining the JSONL session transcripts you already have on disk to compare intent against claim against codebase truth.