LLM-as-a-Courtroom | Falconer Notes

Shared knowledge—decisions, plans, documentation, processes—is a living, breathing resource that evolves as companies do. Falconer is building a shared memory layer for cross-functional teams and their agents to maintain their shared knowledge. A core part of our solution means watching for code changes and proposing documentation updates automatically.

“Documentation rot” is a term used a lot by our users. The code changes, documents go stale, and the knowledge that was once accurate becomes a liability. Many tools now allow users to find information with AI. But findability alone doesn’t equate to accuracy. An easily findable doc that’s out of date doesn’t solve the problem. The hardest part of the problem isn’t finding right quanta of knowledge, rather it is the ability to trust it when you do find it.

Automating documentation updates

Falconer automatically updates documents based on changes to the code. But when a PR merges, how do you decide which documents to update?

This isn’t a simple pattern-matching problem. Yes, you can filter out obvious non-candidates—test files, lockfiles, CI configs. But after that, you’re depending on judgment. The PR code is complex and contextual, as are the documents. Different teams have different priorities. A change that’s critical for customer support documentation is probably irrelevant for engineering specs. And the audience reading the document matters as much as the change itself.

If a human were reading all code changes and updating docs manually—reading each PR, understanding the diffs, searching for affected documents, deliberating on whether each one needs updating, and making the actual updates—it would take days for a relatively small number of PRs. Falconer’s agent does this in seconds.

However, building an intelligent agent is just one piece of the puzzle. We also needed to build the infrastructure to process tens of thousands of PRs daily for our enterprise customers. And more importantly, we needed to build a judgment engine with a strong opinion of its own.

Society’s best decision framework

We kept obsessing over the word “judgment,” and how it’s subjective, yet backed by reason. LLM-as-a-judge has been a useful technique, but far too rudimentary for our use case. Then it clicked. What better way to deliver better judgment than by constructing an entire courtroom? This led us to designing and building our “LLM-as-a-Courtroom” evaluation system.

Our first attempt was straightforward: categorical scoring. We asked our model to rate factors—relevance, feature addition, harm level—on numerical scales, then compared scores against configurable thresholds. We could handle the comparison logic ourselves; we just needed the model to assess.

graph TD
    A[PR + Document] --> B[LLM Evaluator]
    B --> C{Categorical Scoring}
    C --> D[Relevance: 7/10]
    C --> E[Feature Impact: 5/10]
    C --> F[Harm Level: 6/10]
    D --> G{Score > Threshold?}
    E --> G
    F --> G
    G -->|Yes| H[Update Document]
    G -->|No| I[Skip]

    J[Problem: Inconsistent ratings, no reasoning trail]

    style A fill:#f8f5eb,stroke:#0e1d0b
    style B fill:#cbeff5,stroke:#0e1d0b
    style C fill:#f2ecde,stroke:#0e1d0b
    style D fill:#f8f5eb,stroke:#0e1d0b
    style E fill:#f8f5eb,stroke:#0e1d0b
    style F fill:#f8f5eb,stroke:#0e1d0b
    style G fill:#f2ecde,stroke:#0e1d0b
    style H fill:#a7e3ec,stroke:#0e1d0b
    style I fill:#f8f5eb,stroke:#0e1d0b
    style J fill:#f2ecde,stroke:#0e1d0b

It didn’t work as effectively as we needed it to. The ratings were inconsistent, often biased, and didn’t reflect the nuance of the actual decision. We were asking the model to do something it’s fundamentally not good at: rating things. Assigning a percentage or a hierarchical degree to a characteristic requires a kind of calibrated internal scale that LLMs simply don’t have.

But here’s what LLMs are good at: describing things in detail, providing explanations, and constructing arguments.

Unlike humans, who can often judge something quickly at a glance, models need a trail of thought in verbose terms to reach sound conclusions. They walk a path before reaching the end, facing gnarly crevices along the way. When you ask a model to output a single number, you’re asking it to skip that walk entirely. When you ask it to argue a position, you’re giving it the space to reason.

From requesting ratings to constructing arguments

So we shifted the objective. Instead of “rate this document’s relevance from 1-10,” we asked: “argue whether this document needs updating, and provide your evidence.”

As I constructed these argumentative prompts, the vernacular started to feel familiar. I was asking one agent to build a case. Another to challenge it. A third to weigh both sides and render a judgment. I realized I was mimicking a courtroom.

%%{init: {'theme': 'base', 'themeVariables': { 'actorBkg': '#f8f5eb', 'actorBorder': '#0e1d0b', 'actorTextColor': '#0e1d0b', 'noteBkgColor': '#0e1d0b', 'noteBorderColor': '#0e1d0b', 'noteTextColor': '#ffffff', 'signalColor': '#0e1d0b', 'signalTextColor': '#0e1d0b' }}}%%
sequenceDiagram
    participant P as Prosecutor
    participant D as Defense
    participant JY as Jury
    participant JG as Judge

    P->>JG: Exhibit A: Line 47 changed from async to sync execution
    P->>JG: Document states "all API calls are asynchronous"
    P->>JG: Harm: Developers will assume async behavior

    D->>JG: Counter: Document section refers to v1 API, not v2
    D->>JG: Change only affects internal implementation, not public contract
    D->>JG: Harm overstated: existing UX doesn't change

    JY->>JY: Deliberates independently
    JY->>JG: Majority votes Not Guilty

    JG->>JG: Weighs all evidence
    Note over P,JG: Verdict: Not Guilty

This wasn’t accidental. The legal system is society’s most rigorous framework for checks and balances, argumentation, deliberation, judgment—and most of all, justice. It’s been refined over centuries for exactly the class of problem we faced: binary decisions under uncertainty, where evidence is imperfect and the costs of being wrong are asymmetric.

As a philosophy student, I was taught how to examine what makes for a sound argument. There are three components:

Validity (structure): The conclusion must follow logically from the premises, regardless of whether the premises are actually true.
True premises (content): The statements used as the basis for the argument must be factually accurate.
Soundness (the goal): An argument is sound if and only if it is both valid and all its premises are true. Only a sound argument guarantees a true conclusion.

Legal argumentation, I’d argue, bears the closest resemblance to rigorous philosophical argument in the real world. It demands both logical structure and factual grounding. It enforces explicit evidence. It requires addressing counterarguments.

And there’s a key benefit to this approach: LLMs have consumed vast amounts of legal content in pre-training—journals, rulings, court transcripts, case analyses. When you frame a task using legal terminology, you’re activating a rich constellation of learned behaviors about how to argue, deliberate, and reason under scrutiny.

Legal argumentation elicits deep reasoning while making efficient use of language—and in turn, tokens.

The architecture: Prosecution, Defense, Jury, Judge

The courtroom paradigm is a structural framework with distinct roles, each designed to elicit specific reasoning behaviors.

Here’s how the roles interact.

graph LR
    subgraph Input
        PR[PR Diff]
        DOC[Document Corpus]
    end

    subgraph Courtroom
        PROS[Prosecutor<br/>Builds Case]
        DEF[Defense<br/>Rebuts]
        JURY[Jury Pool<br/>5 Agents]
        JUDGE[Judge<br/>Final Verdict]
    end

    subgraph Output
        GUILTY[Guilty:<br/>Proposed Edits]
        NOTGUILTY[Not Guilty:<br/>No Action]
    end

    PR --> PROS
    DOC --> PROS
    PROS -->|Exhibits +<br/>Harm Analysis| DEF
    DEF -->|Counter-<br/>Arguments| JURY
    PROS -.->|Original Case| JURY
    JURY -->|3/5 Guilty<br/>Threshold| JUDGE
    PROS -.->|All Evidence| JUDGE
    DEF -.->|All Evidence| JUDGE
    JUDGE --> GUILTY
    JUDGE --> NOTGUILTY

    style PR fill:#f8f5eb,stroke:#0e1d0b
    style DOC fill:#f8f5eb,stroke:#0e1d0b
    style PROS fill:#cbeff5,stroke:#0e1d0b
    style DEF fill:#a7e3ec,stroke:#0e1d0b
    style JURY fill:#f8f5eb,stroke:#0e1d0b
    style JUDGE fill:#f2ecde,stroke:#0e1d0b
    style GUILTY fill:#cbeff5,stroke:#0e1d0b
    style NOTGUILTY fill:#f8f5eb,stroke:#0e1d0b

The Prosecutor

The prosecutor is our main GitHub agent. When a PR merges, this agent interprets the diffs, searches for potentially affected documents, and builds the case for updates. Like a real-world prosecutor, it’s responsible for constructing a burden of proof: what changed in the code, which documents are now inaccurate or incomplete, and what’s the harm of leaving them unchanged.

Critically, the prosecutor must provide exhibits—structured evidence with three enforced components:

graph TD
    subgraph Exhibit Structure
        E[Exhibit] --> Q1[1. Exact PR Quote]
        E --> Q2[2. Exact Doc Quote]
        E --> H[3. Concrete Harm]
    end

    subgraph Validation
        Q1 --> V1{Exists in<br/>PR diff?}
        Q2 --> V2{Exists in<br/>document?}
        H --> V3{Specific<br/>consequence?}
    end

    V1 -->|No| REJECT[Exhibit Rejected]
    V2 -->|No| REJECT
    V3 -->|No| REJECT
    V1 -->|Yes| ACCEPT
    V2 -->|Yes| ACCEPT
    V3 -->|Yes| ACCEPT[Exhibit Accepted]

    style E fill:#cbeff5,stroke:#0e1d0b
    style Q1 fill:#f8f5eb,stroke:#0e1d0b
    style Q2 fill:#f8f5eb,stroke:#0e1d0b
    style H fill:#f8f5eb,stroke:#0e1d0b
    style V1 fill:#f2ecde,stroke:#0e1d0b
    style V2 fill:#f2ecde,stroke:#0e1d0b
    style V3 fill:#f2ecde,stroke:#0e1d0b
    style ACCEPT fill:#a7e3ec,stroke:#0e1d0b
    style REJECT fill:#f2ecde,stroke:#0e1d0b

An exact quote from the PR diff — the specific code change being cited
An exact quote from the document — text that must exist verbatim in the target document
A concrete harm statement — not “this might confuse users” but “support agents will tell customers X when it’s now Y”

The prosecutor must cite exact text from both the PR and the document, and articulate a specific harm. This requirement originates from the tenets of Retrieval Augmented Generation (RAG): an LLM’s context must be enriched with ground-truth information precisely curated toward a specific action.

But we’re also using the terminology strategically. Words like “exhibit” and “evidence” carry weight. They implicitly trigger learned behaviors from pre-training—the scrutiny, rigor, and specificity that legal contexts demand.

The prosecutor must prove that a document needs updating.

The Defense

The defense is the adversarial counterweight. After reviewing the prosecution’s evidence and argumentation, the defense agent mounts a rebuttal. Its job is to challenge the prosecutor: Is the evidence actually conclusive? Does the document already cover this case? Is the alleged harm overstated?

The defense generates a structured rebuttal for each document, containing:

A core counter-argument — the central thesis for why no update is needed
Specific challenges to each exhibit — questioning the evidence’s validity or relevance
A harm dispute — why the alleged harm is overstated or nonexistent
An alternative explanation — why the document may already be correct as written

The presence of opposing perspectives is considered critical for reaching sound judgment. The defense provides the logical contrast that prevents groupthink and surfaces weaknesses in the prosecution’s case.

The roots of this go deep. In philosophy, we call it Socratic elenchus: a logical dialogue that chips away at abstractions slowly, reaching toward the core of a thesis. The defense forces the courtroom to stress-test its assumptions before rendering judgment.

The Jury

The jury consists of multiple independent agents that evaluate the case after hearing both sides. They’re designed to be holistic bystanders—not advocates for either position, but impartial assessors weighing evidence and arguments.

Technical choices here are deliberate:

Parallel execution: All jurors run simultaneously, not sequentially
High temperature: We want variance—each juror should bring a different perspective
Light reasoning models: Cost-efficient for parallel execution, sufficient for evaluation
Independent context: Each juror reasons in isolation, without seeing other jurors’ votes

Each juror must explain their reasoning before casting a vote—guilty, not guilty, or abstain. This deliberate-then-vote pattern forces the model to think through the evidence before committing to a position.

graph TD
    subgraph Jury Execution
        CASE[Case Evidence] --> J1[Juror 1]
        CASE --> J2[Juror 2]
        CASE --> J3[Juror 3]
        CASE --> J4[Juror 4]
        CASE --> J5[Juror 5]
    end

    J1 --> R1[Reasoning]
    J2 --> R2[Reasoning]
    J3 --> R3[Reasoning]
    J4 --> R4[Reasoning]
    J5 --> R5[Reasoning]

    R1 --> V1[Guilty]
    R2 --> V2[Guilty]
    R3 --> V3[Not Guilty]
    R4 --> V4[Guilty]
    R5 --> V5[Abstain]

    V1 --> TALLY{3/5 Guilty?}
    V2 --> TALLY
    V3 --> TALLY
    V4 --> TALLY
    V5 --> TALLY

    TALLY -->|Yes| PROCEED[Proceed to Judge]
    TALLY -->|No| DISMISS[Case Dismissed]

    style CASE fill:#f8f5eb,stroke:#0e1d0b
    style J1 fill:#f8f5eb,stroke:#0e1d0b
    style J2 fill:#f8f5eb,stroke:#0e1d0b
    style J3 fill:#f8f5eb,stroke:#0e1d0b
    style J4 fill:#f8f5eb,stroke:#0e1d0b
    style J5 fill:#f8f5eb,stroke:#0e1d0b
    style R1 fill:#f8f5eb,stroke:#0e1d0b
    style R2 fill:#f8f5eb,stroke:#0e1d0b
    style R3 fill:#f8f5eb,stroke:#0e1d0b
    style R4 fill:#f8f5eb,stroke:#0e1d0b
    style R5 fill:#f8f5eb,stroke:#0e1d0b
    style V1 fill:#cbeff5,stroke:#0e1d0b
    style V2 fill:#cbeff5,stroke:#0e1d0b
    style V3 fill:#f8f5eb,stroke:#0e1d0b
    style V4 fill:#cbeff5,stroke:#0e1d0b
    style V5 fill:#f2ecde,stroke:#0e1d0b
    style TALLY fill:#f2ecde,stroke:#0e1d0b
    style PROCEED fill:#a7e3ec,stroke:#0e1d0b
    style DISMISS fill:#f8f5eb,stroke:#0e1d0b

The default configuration runs 5 jurors, requiring 3 guilty votes (a majority) to proceed to the judge. But this is tunable—some use cases might call for unanimous juries, others might accept a single guilty vote.

The Judge

The judge serves a different role than in traditional jury trials. While the jury deliberates and reaches a preliminary verdict through majority vote, the judge acts as the final arbiter. It synthesizes all perspectives and renders an independent judgment, then determines the appropriate “sentencing.”

The judge operates differently from other agents:

Low temperature: We need predictability and consistency in final rulings
Heavy reasoning model: This is where consequential reasoning happens
Structured deliberation: The judge must reason through the case before issuing a verdict

The judge produces a structured ruling containing: a full analysis synthesizing the prosecution’s case, the defense’s rebuttal, and the jury’s votes; a verdict (guilty, not guilty, or dismissed); a one-sentence rationale; and if guilty—specific edits to make to the document.

If the verdict is guilty, these edits are consolidated (max 2 per document by default) to avoid overwhelming document owners with granular changes.

Terminology as a tool

One design principle runs throughout: the courtroom terminology is a structural tool to exploit LLM training sets, not serve as user-facing language. Internally, we talk about prosecutors, exhibits, verdicts. But this language is isolated and contained—it never leaks into the actual outputs that document owners see. They receive concise notifications about proposed updates.

What we learned

We’ve built a useful evaluation system for our use case by leaning on what LLMs are already good at: legal comprehension. But we’re under no illusion that it’s a complete solution.

Jury bias is real. Despite running independently with high temperature, jury agents sometimes converge on the same conclusion. This isn’t catastrophic—if evidence is genuinely strong or weak, agreement makes sense—but we’re investigating when convergence reflects true consensus versus shared bias. The variance we designed for isn’t always materializing.

New paradigms need new testing infrastructure. You can’t unit test a courtroom simulation the way you test a function. The outputs are probabilistic. The “right answer” is often debatable. We’re developing evaluation strategies focused on observability: when a bad verdict happens, we need to trace back through the chain. Was it a weak prosecution? An ineffective defense? A judge that ignored the jury?

Real usage surfaces real edge cases. The PRs that slip through, the documents flagged unnecessarily, the configurations that behave unexpectedly—these emerge from actual use, not internal testing. Our design partners have already surfaced issues we didn’t anticipate. That feedback loop is essential.

The legal system turned out to be a near-perfect match. And because LLMs have extensive exposure to legal reasoning through pre-training, we could activate that framework through terminology and structure rather than complex fine-tuning.

We’re doing more research to apply the LLM-as-a-Courtroom to a wider variety of complex problems. The complexity of the simulation should only grow from here—more nuanced roles, multi-turn debate, configurable appeals processes, domain-specific courts for different industries.

The verdict

After deploying LLM-as-a-Courtroom in production, we measured performance across key dimensions (Nov 2024 – Jan 2025).

Progressive filtering

Each stage aggressively filters candidates so only high-confidence updates reach document owners.

PRs analyzed

2,713

Flagged for review← 65% filtered

954

Reached courtroom← 95% filtered

Documents updated← 63% filtered

On escalation rate: We initially targeted 10%, but discovered that a more discerning evaluation—only 1.7% of PRs escalating to full courtroom trial—both eliminates false positives and saves humans time on review.

99.4%

PRs filtered before reaching owners

83%

Acceptance rate

vs. 50% target

17%

False positive rate

vs. 25% target

Time saved

80hours

of manual review avoided

Human reviews required

out of 2,713 PRs (1.5%)

Production data from Nov 2024 – Jan 2025

Documentation rot thrives on neglect—the gap between code that changes and knowledge that doesn’t. The courtroom narrows that gap to a sliver: 99.4% of PRs filtered, 80 hours of review automated away, and when we do escalate to humans, we’re right 83% of the time.

That high filter rate is intentional. We calibrated the system to skew strict, prioritizing precision over recall—false positives erode trust faster than false negatives. By keeping the bar high, we ensure every surfaced update deserves attention while building a curated dataset of high-quality updates to study and replicate at larger scale.

The architecture scales, the framework adapts.

We’re just getting started.