LLM-as-a-Courtroom
Shared knowledge—decisions, plans, documentation, processes—is a living, breathing resource that evolves as companies do. Falconer is building a shared memory layer for cross-functional teams and their agents to maintain their shared knowledge. A core part of our solution means watching for code changes and proposing documentation updates automatically.
“Documentation rot” is a term used a lot by our users. The code changes, documents go stale, and the knowledge that was once accurate becomes a liability. Many tools now allow users to find information with AI. But findability alone doesn’t equate to accuracy. An easily findable doc that’s out of date doesn’t solve the problem. The hardest part of the problem isn’t finding right quanta of knowledge, rather it is the ability to trust it when you do find it.
Automating documentation updates
Falconer automatically updates documents based on changes to the code. But when a PR merges, how do you decide which documents to update?
This isn’t a simple pattern-matching problem. Yes, you can filter out obvious non-candidates—test files, lockfiles, CI configs. But after that, you’re depending on judgment. The PR code is complex and contextual, as are the documents. Different teams have different priorities. A change that’s critical for customer support documentation is probably irrelevant for engineering specs. And the audience reading the document matters as much as the change itself.
If a human were reading all code changes and updating docs manually—reading each PR, understanding the diffs, searching for affected documents, deliberating on whether each one needs updating, and making the actual updates—it would take days for a relatively small number of PRs. Falconer’s agent does this in seconds.
However, building an intelligent agent is just one piece of the puzzle. We also needed to build the infrastructure to process tens of thousands of PRs daily for our enterprise customers. And more importantly, we needed to build a judgment engine with a strong opinion of its own.
Society’s best decision framework
We kept obsessing over the word “judgment,” and how it’s subjective, yet backed by reason. LLM-as-a-judge has been a useful technique, but far too rudimentary for our use case. Then it clicked. What better way to deliver better judgment than by constructing an entire courtroom? This led us to designing and building our “LLM-as-a-Courtroom” evaluation system.
Our first attempt was straightforward: categorical scoring. We asked our model to rate factors—relevance, feature addition, harm level—on numerical scales, then compared scores against configurable thresholds. We could handle the comparison logic ourselves; we just needed the model to assess.
graph TD
A[PR + Document] --> B[LLM Evaluator]
B --> C{Categorical Scoring}
C --> D[Relevance: 7/10]
C --> E[Feature Impact: 5/10]
C --> F[Harm Level: 6/10]
D --> G{Score > Threshold?}
E --> G
F --> G
G -->|Yes| H[Update Document]
G -->|No| I[Skip]
J[Problem: Inconsistent ratings, no reasoning trail]
style A fill:#f8f5eb,stroke:#0e1d0b
style B fill:#cbeff5,stroke:#0e1d0b
style C fill:#f2ecde,stroke:#0e1d0b
style D fill:#f8f5eb,stroke:#0e1d0b
style E fill:#f8f5eb,stroke:#0e1d0b
style F fill:#f8f5eb,stroke:#0e1d0b
style G fill:#f2ecde,stroke:#0e1d0b
style H fill:#a7e3ec,stroke:#0e1d0b
style I fill:#f8f5eb,stroke:#0e1d0b
style J fill:#f2ecde,stroke:#0e1d0b
It didn’t work as effectively as we needed it to. The ratings were inconsistent, often biased, and didn’t reflect the nuance of the actual decision. We were asking the model to do something it’s fundamentally not good at: rating things. Assigning a percentage or a hierarchical degree to a characteristic requires a kind of calibrated internal scale that LLMs simply don’t have.
But here’s what LLMs are good at: describing things in detail, providing explanations, and constructing arguments.
Unlike humans, who can often judge something quickly at a glance, models need a trail of thought in verbose terms to reach sound conclusions. They walk a path before reaching the end, facing gnarly crevices along the way. When you ask a model to output a single number, you’re asking it to skip that walk entirely. When you ask it to argue a position, you’re giving it the space to reason.
From requesting ratings to constructing arguments
So we shifted the objective. Instead of “rate this document’s relevance from 1-10,” we asked: “argue whether this document needs updating, and provide your evidence.”
As I constructed these argumentative prompts, the vernacular started to feel familiar. I was asking one agent to build a case. Another to challenge it. A third to weigh both sides and render a judgment. I realized I was mimicking a courtroom.
%%{init: {'theme': 'base', 'themeVariables': { 'actorBkg': '#f8f5eb', 'actorBorder': '#0e1d0b', 'actorTextColor': '#0e1d0b', 'noteBkgColor': '#0e1d0b', 'noteBorderColor': '#0e1d0b', 'noteTextColor': '#ffffff', 'signalColor': '#0e1d0b', 'signalTextColor': '#0e1d0b' }}}%%
sequenceDiagram
participant P as Prosecutor
participant D as Defense
participant JY as Jury
participant JG as Judge
P->>JG: Exhibit A: Line 47 changed from async to sync execution
P->>JG: Document states "all API calls are asynchronous"
P->>JG: Harm: Developers will assume async behavior
D->>JG: Counter: Document section refers to v1 API, not v2
D->>JG: Change only affects internal implementation, not public contract
D->>JG: Harm overstated: existing UX doesn't change
JY->>JY: Deliberates independently
JY->>JG: Majority votes Not Guilty
JG->>JG: Weighs all evidence
Note over P,JG: Verdict: Not Guilty
This wasn’t accidental. The legal system is society’s most rigorous framework for checks and balances, argumentation, deliberation, judgment—and most of all, justice. It’s been refined over centuries for exactly the class of problem we faced: binary decisions under uncertainty, where evidence is imperfect and the costs of being wrong are asymmetric.
As a philosophy student, I was taught how to examine what makes for a sound argument. There are three components:
- Validity (structure): The conclusion must follow logically from the premises, regardless of whether the premises are actually true.
- True premises (content): The statements used as the basis for the argument must be factually accurate.
- Soundness (the goal): An argument is sound if and only if it is both valid and all its premises are true. Only a sound argument guarantees a true conclusion.
Legal argumentation, I’d argue, bears the closest resemblance to rigorous philosophical argument in the real world. It demands both logical structure and factual grounding. It enforces explicit evidence. It requires addressing counterarguments.
And there’s a key benefit to this approach: LLMs have consumed vast amounts of legal content in pre-training—journals, rulings, court transcripts, case analyses. When you frame a task using legal terminology, you’re activating a rich constellation of learned behaviors about how to argue, deliberate, and reason under scrutiny.
Legal argumentation elicits deep reasoning while making efficient use of language—and in turn, tokens.
The architecture: Prosecution, Defense, Jury, Judge
The courtroom paradigm is a structural framework with distinct roles, each designed to elicit specific reasoning behaviors.
Here’s how the roles interact.
graph LR
subgraph Input
PR[PR Diff]
DOC[Document Corpus]
end
subgraph Courtroom
PROS[Prosecutor<br/>Builds Case]
DEF[Defense<br/>Rebuts]
JURY[Jury Pool<br/>5 Agents]
JUDGE[Judge<br/>Final Verdict]
end
subgraph Output
GUILTY[Guilty:<br/>Proposed Edits]
NOTGUILTY[Not Guilty:<br/>No Action]
end
PR --> PROS
DOC --> PROS
PROS -->|Exhibits +<br/>Harm Analysis| DEF
DEF -->|Counter-<br/>Arguments| JURY
PROS -.->|Original Case| JURY
JURY -->|3/5 Guilty<br/>Threshold| JUDGE
PROS -.->|All Evidence| JUDGE
DEF -.->|All Evidence| JUDGE
JUDGE --> GUILTY
JUDGE --> NOTGUILTY
style PR fill:#f8f5eb,stroke:#0e1d0b
style DOC fill:#f8f5eb,stroke:#0e1d0b
style PROS fill:#cbeff5,stroke:#0e1d0b
style DEF fill:#a7e3ec,stroke:#0e1d0b
style JURY fill:#f8f5eb,stroke:#0e1d0b
style JUDGE fill:#f2ecde,stroke:#0e1d0b
style GUILTY fill:#cbeff5,stroke:#0e1d0b
style NOTGUILTY fill:#f8f5eb,stroke:#0e1d0b
The Prosecutor
The prosecutor is our main GitHub agent. When a PR merges, this agent interprets the diffs, searches for potentially affected documents, and builds the case for updates. Like a real-world prosecutor, it’s responsible for constructing a burden of proof: what changed in the code, which documents are now inaccurate or incomplete, and what’s the harm of leaving them unchanged.
Critically, the prosecutor must provide exhibits—structured evidence with three enforced components:
graph TD
subgraph Exhibit Structure
E[Exhibit] --> Q1[1. Exact PR Quote]
E --> Q2[2. Exact Doc Quote]
E --> H[3. Concrete Harm]
end
subgraph Validation
Q1 --> V1{Exists in<br/>PR diff?}
Q2 --> V2{Exists in<br/>document?}
H --> V3{Specific<br/>consequence?}
end
V1 -->|No| REJECT[Exhibit Rejected]
V2 -->|No| REJECT
V3 -->|No| REJECT
V1 -->|Yes| ACCEPT
V2 -->|Yes| ACCEPT
V3 -->|Yes| ACCEPT[Exhibit Accepted]
style E fill:#cbeff5,stroke:#0e1d0b
style Q1 fill:#f8f5eb,stroke:#0e1d0b
style Q2 fill:#f8f5eb,stroke:#0e1d0b
style H fill:#f8f5eb,stroke:#0e1d0b
style V1 fill:#f2ecde,stroke:#0e1d0b
style V2 fill:#f2ecde,stroke:#0e1d0b
style V3 fill:#f2ecde,stroke:#0e1d0b
style ACCEPT fill:#a7e3ec,stroke:#0e1d0b
style REJECT fill:#f2ecde,stroke:#0e1d0b
- An exact quote from the PR diff — the specific code change being cited
- An exact quote from the document — text that must exist verbatim in the target document
- A concrete harm statement — not “this might confuse users” but “support agents will tell customers X when it’s now Y”
The prosecutor must cite exact text from both the PR and the document, and articulate a specific harm. This requirement originates from the tenets of Retrieval Augmented Generation (RAG): an LLM’s context must be enriched with ground-truth information precisely curated toward a specific action.
But we’re also using the terminology strategically. Words like “exhibit” and “evidence” carry weight. They implicitly trigger learned behaviors from pre-training—the scrutiny, rigor, and specificity that legal contexts demand.
The prosecutor must prove that a document needs updating.
The Defense
The defense is the adversarial counterweight. After reviewing the prosecution’s evidence and argumentation, the defense agent mounts a rebuttal. Its job is to challenge the prosecutor: Is the evidence actually conclusive? Does the document already cover this case? Is the alleged harm overstated?
The defense generates a structured rebuttal for each document, containing:
- A core counter-argument — the central thesis for why no update is needed
- Specific challenges to each exhibit — questioning the evidence’s validity or relevance
- A harm dispute — why the alleged harm is overstated or nonexistent
- An alternative explanation — why the document may already be correct as written
The presence of opposing perspectives is considered critical for reaching sound judgment. The defense provides the logical contrast that prevents groupthink and surfaces weaknesses in the prosecution’s case.
The roots of this go deep. In philosophy, we call it Socratic elenchus: a logical dialogue that chips away at abstractions slowly, reaching toward the core of a thesis. The defense forces the courtroom to stress-test its assumptions before rendering judgment.
The Jury
The jury consists of multiple independent agents that evaluate the case after hearing both sides. They’re designed to be holistic bystanders—not advocates for either position, but impartial assessors weighing evidence and arguments.
Technical choices here are deliberate:
- Parallel execution: All jurors run simultaneously, not sequentially
- High temperature: We want variance—each juror should bring a different perspective
- Light reasoning models: Cost-efficient for parallel execution, sufficient for evaluation
- Independent context: Each juror reasons in isolation, without seeing other jurors’ votes
Each juror must explain their reasoning before casting a vote—guilty, not guilty, or abstain. This deliberate-then-vote pattern forces the model to think through the evidence before committing to a position.
graph TD
subgraph Jury Execution
CASE[Case Evidence] --> J1[Juror 1]
CASE --> J2[Juror 2]
CASE --> J3[Juror 3]
CASE --> J4[Juror 4]
CASE --> J5[Juror 5]
end
J1 --> R1[Reasoning]
J2 --> R2[Reasoning]
J3 --> R3[Reasoning]
J4 --> R4[Reasoning]
J5 --> R5[Reasoning]
R1 --> V1[Guilty]
R2 --> V2[Guilty]
R3 --> V3[Not Guilty]
R4 --> V4[Guilty]
R5 --> V5[Abstain]
V1 --> TALLY{3/5 Guilty?}
V2 --> TALLY
V3 --> TALLY
V4 --> TALLY
V5 --> TALLY
TALLY -->|Yes| PROCEED[Proceed to Judge]
TALLY -->|No| DISMISS[Case Dismissed]
style CASE fill:#f8f5eb,stroke:#0e1d0b
style J1 fill:#f8f5eb,stroke:#0e1d0b
style J2 fill:#f8f5eb,stroke:#0e1d0b
style J3 fill:#f8f5eb,stroke:#0e1d0b
style J4 fill:#f8f5eb,stroke:#0e1d0b
style J5 fill:#f8f5eb,stroke:#0e1d0b
style R1 fill:#f8f5eb,stroke:#0e1d0b
style R2 fill:#f8f5eb,stroke:#0e1d0b
style R3 fill:#f8f5eb,stroke:#0e1d0b
style R4 fill:#f8f5eb,stroke:#0e1d0b
style R5 fill:#f8f5eb,stroke:#0e1d0b
style V1 fill:#cbeff5,stroke:#0e1d0b
style V2 fill:#cbeff5,stroke:#0e1d0b
style V3 fill:#f8f5eb,stroke:#0e1d0b
style V4 fill:#cbeff5,stroke:#0e1d0b
style V5 fill:#f2ecde,stroke:#0e1d0b
style TALLY fill:#f2ecde,stroke:#0e1d0b
style PROCEED fill:#a7e3ec,stroke:#0e1d0b
style DISMISS fill:#f8f5eb,stroke:#0e1d0b
The default configuration runs 5 jurors, requiring 3 guilty votes (a majority) to proceed to the judge. But this is tunable—some use cases might call for unanimous juries, others might accept a single guilty vote.
The Judge
The judge serves a different role than in traditional jury trials. While the jury deliberates and reaches a preliminary verdict through majority vote, the judge acts as the final arbiter. It synthesizes all perspectives and renders an independent judgment, then determines the appropriate “sentencing.”
The judge operates differently from other agents:
- Low temperature: We need predictability and consistency in final rulings
- Heavy reasoning model: This is where consequential reasoning happens
- Structured deliberation: The judge must reason through the case before issuing a verdict
The judge produces a structured ruling containing: a full analysis synthesizing the prosecution’s case, the defense’s rebuttal, and the jury’s votes; a verdict (guilty, not guilty, or dismissed); a one-sentence rationale; and if guilty—specific edits to make to the document.
If the verdict is guilty, these edits are consolidated (max 2 per document by default) to avoid overwhelming document owners with granular changes.
Terminology as a tool
One design principle runs throughout: the courtroom terminology is a structural tool to exploit LLM training sets, not serve as user-facing language. Internally, we talk about prosecutors, exhibits, verdicts. But this language is isolated and contained—it never leaks into the actual outputs that document owners see. They receive concise notifications about proposed updates.
What we learned
We’ve built a useful evaluation system for our use case by leaning on what LLMs are already good at: legal comprehension. But we’re under no illusion that it’s a complete solution.
Jury bias is real. Despite running independently with high temperature, jury agents sometimes converge on the same conclusion. This isn’t catastrophic—if evidence is genuinely strong or weak, agreement makes sense—but we’re investigating when convergence reflects true consensus versus shared bias. The variance we designed for isn’t always materializing.
New paradigms need new testing infrastructure. You can’t unit test a courtroom simulation the way you test a function. The outputs are probabilistic. The “right answer” is often debatable. We’re developing evaluation strategies focused on observability: when a bad verdict happens, we need to trace back through the chain. Was it a weak prosecution? An ineffective defense? A judge that ignored the jury?
Real usage surfaces real edge cases. The PRs that slip through, the documents flagged unnecessarily, the configurations that behave unexpectedly—these emerge from actual use, not internal testing. Our design partners have already surfaced issues we didn’t anticipate. That feedback loop is essential.
The legal system turned out to be a near-perfect match. And because LLMs have extensive exposure to legal reasoning through pre-training, we could activate that framework through terminology and structure rather than complex fine-tuning.
We’re doing more research to apply the LLM-as-a-Courtroom to a wider variety of complex problems. The complexity of the simulation should only grow from here—more nuanced roles, multi-turn debate, configurable appeals processes, domain-specific courts for different industries.
The verdict
After deploying LLM-as-a-Courtroom in production, we measured performance across key dimensions (Nov 2024 – Jan 2025).
Progressive filtering
Each stage aggressively filters candidates so only high-confidence updates reach document owners.
On escalation rate: We initially targeted 10%, but discovered that a more discerning evaluation—only 1.7% of PRs escalating to full courtroom trial—both eliminates false positives and saves humans time on review.
Production data from Nov 2024 – Jan 2025
Documentation rot thrives on neglect—the gap between code that changes and knowledge that doesn’t. The courtroom narrows that gap to a sliver: 99.4% of PRs filtered, 80 hours of review automated away, and when we do escalate to humans, we’re right 83% of the time.
That high filter rate is intentional. We calibrated the system to skew strict, prioritizing precision over recall—false positives erode trust faster than false negatives. By keeping the bar high, we ensure every surfaced update deserves attention while building a curated dataset of high-quality updates to study and replicate at larger scale.
The architecture scales, the framework adapts.
We’re just getting started.