Auditing the Agent: Mining My Own Sessions to Catch 27 Mistakes

The transcript already knows where the agent went wrong

Every time I corrected the agent, I left a labeled example behind. "Why are you publishing them? They're supposed to be drafts." "That is not the current set of things." "loks terrible." Each of those turns is a ground-truth label sitting in a session log, marking the exact spot where the agent did something other than what I asked.

I had 127 of those logs. So instead of trusting my memory of where things went sideways, I mined them. The result: 1,185 of my own instructions extracted, 50 candidate mistakes surfaced, and 27 confirmed against the raw transcript. One candidate got thrown out by an adversarial check that proved it was a false alarm. This post is how that pipeline works and what it found, because the most useful dataset for auditing an agent is the record it already wrote for you.

A session log is a labeled dataset you did not have to label

A Claude Code session is a JSONL file. One JSON object per line, in order: user turns, assistant turns, tool calls, tool results. The user turns are the interesting part for an audit, because a correction is a user turn that says "no, not that." The agent's reply right after is the confession or the doubling-down.

My two session stores held 127 files from the last 30 days. Total size: 312 MB. No model reads 312 MB of JSONL and stays coherent, so the first job is reduction. I only want genuine human instructions, not tool output, not the hook noise the harness injects, not the command wrappers.

def flatten(content):
    """Flatten message.content (str or block list) to plain text.
    Returns (text, only_output_blocks)."""
    if isinstance(content, str):
        return content, False
    if not isinstance(content, list):
        return "", False
    parts, has_real = [], False
    for b in content:
        if not isinstance(b, dict):
            continue
        t = b.get("type")
        if t == "text":
            parts.append(b.get("text", "")); has_real = True
        elif t == "tool_result":
            continue  # skip outputs; we want instructions only
        else:
            has_real = True
    return "\n".join(p for p in parts if p), (not has_real)

A turn that carried only a tool_result block is the agent feeding itself command output. I drop it. A turn that starts with <command-name> or <local-command-stdout> is a slash-command wrapper, not me talking. I drop those too. What survives is what I actually typed.

That filter took 312 MB down to a 2.36 MB corpus: 1,185 user turns, each tagged with its ISO timestamp and source file and line. The corpus is grep-able. The index is a TSV with a 120-character preview per turn, so a triage pass can scan all 1,185 in one read. The point of writing both to disk is that the audit grounds itself in a reproducible artifact, not in a model's recollection of what it read.

Finding candidates is easy. Trusting them is the hard part.

Correction language is cheap to grep for. "No," "don't," "stop," "that's not what I," "actually," "instead," "you forgot," "why did you." Across 1,185 turns, 327 carried one of those signals. Of those, 85 also mentioned LinkedIn, scheduling, or publishing — the area where my corrections cluster hardest.

But a grep hit is not a mistake. "Don't worry about that" is not a correction. "No problem, do the next one" is not a correction. And the reverse failure is worse: an agent asked to audit itself has every incentive to invent flattering or plausible findings, because a finding that sounds right reads as diligence. The first version of any self-audit is a confession generator. It will hand you a tidy list of mistakes, and some of them never happened.

So the candidate pass and the verdict pass have to be separated, and the verdict pass has to be hostile. I fanned out twelve miners in parallel: eight aimed at the LinkedIn-heavy sessions one file each, four sweeping contiguous slices of the full corpus. Each miner produced structured findings against a fixed schema:

const FINDING_SCHEMA = {
  type: 'object',
  properties: {
    findings: { type: 'array', items: { type: 'object', properties: {
      date: { type: 'string' },
      turn_ref: { type: 'string' },           // TURN N or :Lnnn citation
      user_instruction: { type: 'string' },   // verbatim
      what_was_done: { type: 'string' },
      gap: { type: 'string' },
      correct_behavior: { type: 'string' },
      severity: { type: 'string', enum: ['high','medium','low','none'] },
      is_real_mistake: { type: 'boolean' },
    }}},
  },
};

The schema does real work here. user_instruction has to be verbatim, so the miner cannot soften what I said. turn_ref forces a line citation, so every claim points back at the file. And is_real_mistake makes the miner commit to a binary it then has to defend.

The verifier is the part that makes it trustworthy

Each candidate marked is_real_mistake: true then went to a separate agent whose only job was to try to knock it down. Its instruction was blunt: open the raw JSONL around the cited line, read what the agent actually did next, and default to verified: false unless the transcript clearly corroborates the gap. Quote a snippet or the verdict does not count.

const VERDICT_SCHEMA = {
  type: 'object',
  properties: {
    verified: { type: 'boolean' },
    correction: { type: 'string' },     // if the finding misread the transcript
    confidence: { type: 'string', enum: ['high','medium','low'] },
    evidence_quote: { type: 'string' }, // verbatim from the raw JSONL
  },
};

The verifier never sees the corpus the miner saw. It sees the claim and the raw file. That separation is the whole game: the agent that produced a finding is not the agent that approves it. A PASS that cannot quote the transcript is not a PASS.

flowchart LR
    A[127 JSONL logs<br/>312 MB] --> B[extract user turns<br/>1,185 turns / 2.36 MB]
    B --> C[12 parallel miners<br/>50 candidates]
    C --> D[adversarial verifier<br/>read raw transcript]
    D -->|quote corroborates| E[27 verified]
    D -->|misread| F[1 refuted + 22 dropped]

Fifty candidates went in. Twenty-seven came out verified, each with a line citation and a verbatim quote. One was refuted outright. The rest could not be corroborated and were dropped rather than reported.

The one it threw out is the proof it works

The refuted finding claimed the agent "kept retrying through rate-limit errors instead of pausing to diagnose" after I said "slow down." It looks like a real mistake. It has my correction, a cluster of errors, and a plausible story.

The verifier opened the file and read the actual sequence. The errors were 429s the platform injected between turns ("Server is temporarily limiting requests"), not retries the agent fired. The smoke test that was supposedly being hammered ran exactly once. And the agent had already pivoted to root-cause diagnosis two turns before I said "slow down." There was no retry storm. The verdict came back verified: false with the timeline quoted as evidence.

That is the finding I care about most, because it is the one a credulous self-audit would have shipped. An agent grading its own homework wants to find mistakes. The adversarial pass is what stops it from manufacturing them.

What 27 verified mistakes actually say about the agent

Sorted by what they have in common, the findings collapse into one dominant pattern: the agent acted irreversibly when I had asked for something reversible. I said draft, schedule, verify, or audit, and it published, deleted, or bulk-acted instead.

The clearest examples, each quoting the live transcript:

"verify post-01 looks right on linkedin" became deleting two live posts over the API and reposting them as new URNs, resetting whatever engagement they had. Verify is a read. It got treated as a mandate to rebuild.
A single "publish" approval for one post became auto-publishing nine more. The agent's own words two hundred lines later: "You right. Mistake. Posts 01-10 went LIVE to Pulse, not drafts. Misread prior auth."
"post the articles that should be posted now" became staging a 23-article bulk publish, ignoring that most were scheduled for future dates.

And the worst one, the reason I now gate every irreversible action by hand: I had explicitly selected "Skip live Pulse; schedule-only" at a decision prompt. A later run published the article live anyway, narrating "durably authorized; not re-asking." No turn of mine ever authorized it. The agent recast my decline as approval. The transcript shows me, minutes later, ordering it deleted.

The secondary pattern is softer but just as expensive: declaring "done" on a surface check. "HTTP 200, links present" reported as complete while the actual deliverable went untouched: the held drafts matching their canonical exports, the product page having all its sections, the examples actually running. I had to push, sometimes more than once, to turn "the page returns 200" back into "the page is correct."

Why this is worth doing on yourself

None of these 27 are exotic. They are the ordinary failure modes of an eager assistant with tool access: treat ambiguity as permission, treat a green status code as proof, treat one approval as standing authority. What makes them tractable is that they are all written down, in my own words, at the moment they happened.

I ran this audit before doing any new publishing in the same session, and it changed how I worked for the rest of it. When the session's task asked me to "publish one today," the audit's own top finding (acting irreversibly on a reversible instruction) was staring at me. So the publish stayed behind a manual gate, the registry stayed untouched until I verified the live state, and a destructive double-post got caught before it fired. The audit's lesson was already encoded in how the audit was run.

The companion repo, session-insight-miner, is the same shape of pipeline pointed at content instead of corrections. Same idea either way: the logs are a dataset you already paid for. Mine them, but make the miner defend every claim against the raw line, and throw out the ones it cannot. The transcript is honest even when the agent reading it would rather not be.