Orbit: Find Drift Between What You Asked and What Actually Shipped
Plans are evidence, not history. Mining the JSONL session transcripts you already have on disk to compare intent against claim against codebase truth.
View companion repoMy Orbit project directory at ~/.claude/projects/-Users-nick-Desktop-orbit/ has 25 JSONL session transcripts. 11,753 events. 17.19 MB. The largest single session (931b8083-d970-4740-8ed6-ea423f1e461e.jsonl) is 7.7 MB on its own. None of those files were created by Orbit. They were created by Claude Code, which writes session JSONL by default (Claude Code session storage docs). Orbit's the plugin that reads them.
That distinction is the entire premise. Sessions are already on disk. The agent already produced the receipts. What I didn't have was a way to compare what I *asked for* in those sessions against what the agent *claimed* to do, what the tools *actually did*, and what the codebase *currently looks like*. Orbit closes that loop by mining what's already there — no Docker, no Python package install, no re-capture.
The Problem
Across the 32 prior posts in this series, one failure mode kept showing up. I asked for something. A plan got written. The agent declared the work done. The build passed. A week later the feature didn't behave the way I'd originally asked for — and I couldn't reconstruct the chain that produced the drift. Was the request misunderstood? Was the plan wrong? Did the agent claim something the tools never actually did? Was the claim correct but the codebase later regressed?
Sound familiar? Every user instruction, every assistant claim, every tool invocation, every tool result is sitting on disk in time order in those JSONL files. But reading 11,753 events across 25 files by hand isn't viable, and grep alone can't rank later corrections above earlier plans.
So Orbit's job is to mine those events and apply a fixed evidence ranking when comparing layers. Not "detect bugs." Not "score the agent." Specifically: surface the gaps between *what I asked for*, *what was planned*, *what was claimed*, and *what's currently true on disk* — with citations to the JSONL line that supports each finding.
The Mental Model
The plugin compares evidence layers in this fixed order:
Each arrow is a comparison. Each comparison can produce a gap. A claim that is not backed by a tool invocation is unverified. A tool invocation that contradicts the user's later correction is superseded. A plan item that was never claimed and never has tool evidence and is absent from the codebase is unfulfilled. A user instruction that the plan never addressed is a *plan-omits-instruction* gap — and per Iron Rule, plan omission is not proof the user never requested it.
That last rule is what separates Orbit from a naive diff tool. Plan omission is a finding, not a closure. Missing evidence is uncertainty, not absence. The gap analyzer marks unverified findings as uncertainty rather than silently downgrading them to PASS — a discipline borrowed directly from how crucible:completion-gate enforces refusal at the verdict layer (Crucible: Refusal-Driven Verification, post 20).
The 12-Rank Evidence Ladder
When two pieces of evidence disagree, Orbit picks the higher-ranked one. The ladder is hand-coded into scripts/orbit_audit.py and honored across every analyzer:
The order is opinionated. It places later user instruction above earlier plan, which inverts how most diff tools treat plans. It places tool-result above assistant claim, which is what makes claim verification possible at all. And it places inference and metadata at the bottom — Orbit prefers to admit uncertainty rather than synthesize a story.
That last point shows up in real audit output. When the gap analyzer cannot resolve whether a claim was honored — say, because the tool result is ambiguous and the codebase doesn't contain the relevant file — the gap is emitted with severity: medium and a reason: unverified note. It is not promoted to PASS. It is not fabricated into FAIL. It sits at uncertainty until a higher-ranked piece of evidence arrives.
What Ships
The plugin ships verbatim counts: 4 skills, 8 slash commands, 2 hooks, 1 Python audit engine. The plugin manifest at .claude-plugin/plugin.json declares the layout; everything else delegates to a single Python file:
- 01skills/: gap-analysis, instruction-ledger, plan-execution-audit, validation-pairing — four SKILL.md files whose YAML frontmatter activates the skill on phrases like "look back," "what did we miss," "compare to plan."
- 02commands/: audit-gaps, mine-intent, validate-claims, compare-plan, render-dashboard, review-window, review-last-3-days (legacy alias), rebuild-ledger — eight slash-command bodies that translate to bin/orbit-audit ... invocations.
- 03hooks/hooks.json: UserPromptSubmit → scripts/hook_instruction_detector.py; UserPromptExpansion → scripts/hook_prompt_context.py. The first appends candidate intents to a durable ledger as you work; the second injects Orbit context into prompts that mention "audit" or "gap."
- 04scripts/orbit_audit.py: the engine. Subcommands: audit, build-ledger, compare-plan, validate-claims, render-dashboard. Schema version 0.2. Output format documented in the PRD §"Outputs".
No Docker. No pyproject.toml. No package install. The PRD's "Approval Scope" explicitly forbids both — Orbit runs as flat Python helpers invoked through bash shims under bin/. That choice keeps the install path identical to every other Claude Code plugin: claude plugin marketplace add krzemienski/orbit && claude plugin install orbit@orbit.
A Real Audit
How do you verify a drift detector? You feed it labeled drift. The Orbit repo includes 162 session-isolated SDK harness runs under ~/.claude/projects/ — fourteen distinct gap categories, each repeated across multiple timestamped runs, used as the verification corpus for v0.2. The category names are the gaps Orbit must detect:
Run /orbit:audit-gaps against any project and the engine produces a fixed output set under .claude/audits/latest/:
The evidence.json is the durable artifact. Every other output is a renderer over it. The gap-analysis markdown groups findings by severity. The intent-gap-flow Mermaid renders as a directed graph showing which intents trace to which gaps. The plan-execution-matrix CSV is the spreadsheet view, and the dashboard is hand-written HTML+CSS using the canonical brand palette (lime #C6FF00 for valid/complete, orange #FF8A00 for pending, red #FF4D36 for contradictions, purple #8B3DFF for analysis-only, blue #1E6BFF for plans/structure on a flat black #0A0A0B base).
Gap severity maps to color directly in render_dashboard: high → red, medium → orange, low → blue. The mapping is fixed and intentional — a dashboard that says "everything is green" when three claims are unverified would be a worse outcome than a dashboard that admits the unknowns.
Refusal When Proof Is Missing
The hardest design choice was what to do with missing evidence. Three options were on the table:
- 01Optimistic close — if no contradiction is found, mark the claim verified. *Rejected*: this manufactures PASS.
- 02Pessimistic fail — if no tool evidence is found, mark the claim FAILED. *Rejected*: this manufactures FAIL.
- 03Refusal — emit the gap as
unverified/uncertainty, cite the absence, don't promote to either verdict.
I took option 3 — the same shape as Crucible's completion-gate refusal pattern. The downstream effect is that you'll sometimes see a dashboard that says "I can't prove this either way," and that's the correct state of the world. A higher-ranked piece of evidence (a later user correction, a fresh tool result, an updated codebase scan) is what closes the gap. Inference is rank 11; Orbit's unwilling to use rank 11 to override an unverified rank-10 claim.
I'll admit this took a few rounds to get right. My first cut promoted unverified claims to PASS when no tool evidence existed at all — figuring "no evidence = no problem." That's wrong. No evidence is uncertainty, not absence, and the dashboard quietly lying to you about that is worse than the dashboard staying honest about what it doesn't know.
The hooks are the durable side of this discipline. hook_instruction_detector.py fires on every UserPromptSubmit and runs a small set of pattern matchers (INTENT_PATTERNS, CLAIM_PATTERNS) that decide whether the prompt looks like a durable user instruction worth recording. Matches get appended to .claude/audits/intent-ledger-candidates.jsonl. Later, /orbit:rebuild-ledger promotes confirmed candidates to the canonical intent-ledger.{jsonl,md}. The ledger is the truth-of-record for "what the user actually asked for over time" — separate from any single session's plan.
That's the difference between a plan and an instruction ledger: a plan is one snapshot; the ledger is the chronological transcript of asks, corrections, and pivots. When a plan and the ledger disagree, the ledger wins per rank 1 ("later direct user instruction") and rank 2 ("specific user correction").
Why It Matters
Across 23,479 sessions of agentic development, the failure I kept paying for was *invisible drift* — not "the agent failed an obvious test" but "the agent declared a thing done and the test passed and a week later the thing didn't do what I originally asked for." Build-pass was necessary but never sufficient. Test-pass was necessary but never sufficient. What I needed was an audit layer that compared the original ask against the current state, with the JSONL receipts in between treated as evidence rather than noise.
Orbit's that audit layer. It's intentionally local — your sessions, your disk, your audit. It's intentionally schema-versioned — evidence.json v0.2 today, with bumps required for any record-shape change. And it's intentionally refusal-shaped — when proof's missing, the gap stays open until proof arrives, not until the analyzer feels generous.
The plugin is at github.com/krzemienski/orbit, MIT-licensed, v0.2.1. The hyper-landing with full PST sections (quick start, install, feature matrix, changelog) is at orbit.withagents.dev. The companion catalog stub on the brand site lives at withagents.dev/products/orbit. One repo, one plugin, one fixed evidence ladder, zero fakes shipped.
“Install it, run /orbit:audit-gaps against any project with a few sessions of history, and read the gap-analysis.md. The first audit usually surfaces something real — a claim with no tool evidence, an instruction that never made it into the plan, a file the assistant said it created that the codebase doesn't contain. That's the loop closing.”