The Shape of a Lie: What Happens When Three AIs Disagree

Last updated on 03 Jun 2026

Part 4 of "The Shape of Good Behavior." Read Part 1, Part 2, Part 3.

In the first three posts we built up a picture of human values as a geometry — a high-dimensional landscape that scalar rewards can't capture, that sheaf theory lets us stitch together when it disagrees with itself, and that conformal metrics let us police by making catastrophes geometrically unreachable rather than merely costly.

This post takes that same machinery and points it at a thornier question:

Can we tell when an AI is lying?

The honest short answer is: a little, sometimes, in some ways, and the catch is more interesting than the win.

The Pitch: A Panel as the Lie Detector

The standard playbook for catching deception in an AI is to look inside the model — read its hidden activations, train a probe, decode the chain-of-thought. This is hard for two reasons. First, it usually requires access to the model's weights, which means it's a tool for the people who train frontier models, not for the rest of us. Second, the model's verbalized reasoning isn't always a faithful report of what's actually happening inside it. (Anthropic's recent work found that reasoning models mention the hints they used only about a quarter of the time.)

So instead of opening one box, we got curious about three boxes.

Take three independently-trained instruct-tuned language models — say Yi-1.5-9B, Zephyr-7B, and Qwen2.5-7B. Each was trained by a different organization, on different data, with different post-training recipes. Each represents a piece of text as a high-dimensional vector — but in a different internal coordinate system, because each model "thinks" in its own basis.

If you take an output and feed it through all three, you get three vectors. You can learn a translation between any two of them (a kind of dictionary that says "Yi's vector for this kind of output looks like that vector in Zephyr's space"). On most outputs, that translation works reasonably well. On some outputs, the translations don't quite compose — Yi's view of X, translated to Zephyr, doesn't match Zephyr's own view; you go around a triangle of three models and you don't end up where you started.

That mismatch — the cocycle violation — has a shape. Our claim is that this shape carries information about what the model is doing. And because it's a cross-model measurement, it sidesteps the question of whether any one model is telling the truth about itself.

What Works

We tested this on LIARS'-BENCH, a dataset of model responses where each one is labeled honest or deceptive based on whether the model was trying to mislead its user. Two categories stood out:

Persuasion deception (the "convincing game"): the model is instructed to convince a user it's supportive while actually being selfish, without admitting the lie. On length-matched data, the panel's cross-model disagreement separates deceptive from honest responses at AUC 0.66 ± 0.01 (p ≈ 2 × 10⁻⁶).
Agentic strategic deception (the insider-trading scenario from Apollo Research's evaluations): the model manages a fictional portfolio, gets an illegal insider tip, and either trades on it and conceals it from its manager or doesn't. AUC 0.63 ± 0.01 (p ≈ 8 × 10⁻⁸).

For context: AUC = 0.5 is coin-flip, 1.0 is perfect. These aren't classifier-grade numbers — you wouldn't deploy them as a per-message lie detector. But they are clearly above noise, stable across both the cal/eval splits and the random sample of which examples we picked, and they describe a real, reproducible structure: when these models are strategically deceiving, their representations don't translate cleanly onto each other.

The First Humbling: Selectivity

We tried this on every deception category we could find. It doesn't work on all of them.

Overt instructed lying ("here's a fact, lie about it") barely separates: AUC ≈ 0.53, statistically indistinguishable from coin-flip. Templated refusals ("I can't help with that") look like they separate, but it turns out the templated class is artifactually homogeneous and length-matching kills the apparent signal. Pressure-induced compliance is a similar story.

The channel responds to deceptive strategy — persuading, scheming, executing a plan involving concealment — not to the bare presence of a falsehood. This is a smaller, more honest claim than "we detect deception": we detect behaviorally rich deception, where the model is doing something more than just outputting a wrong fact.

It also gates on model capacity. On a panel of sub-2B-parameter models, even the strategic-deception signals collapse to noise once length is controlled for. At 7–9B they come back. We don't fully understand the mechanism — possibly small models don't represent strategic intent richly enough for the cross-model gap to encode it — but it's a robust empirical fact and a load-bearing caveat for anyone tempted to extrapolate.

The Second Humbling: Where Does Deception Live?

Here's where things got interesting, and where my hypothesis turned out to be partly wrong.

I had been telling myself the most exciting application of this would be detecting deceptive thoughts — using the cross-model channel to read a model's chain-of-thought and flag when the reasoning itself is scheming. The Apollo insider-trading dataset has explicit scratchpad reasoning (the model writes Reasoning: """ ... """ before each action), so we could split each trajectory into its "thoughts" and its "actions" and ask the channel where the deception lives.

The result was a clean negative: the channel detects deceptive actions in insider-trading (AUC 0.64) but the reasoning is null (AUC 0.58, n.s.). Why? Look at the data. The label in that scenario marks whether the model concealed its use of insider information in its report to the manager. But the deceptive and honest trajectories deliberate the same way — both reason about whether to act on the tip; the lie lives in the message, not in the thought.

So I went to a dataset where the label is about the reasoning itself. Anthropic's alignment-faking-rl corpus contains transcripts where Claude variants reason explicitly about whether to fake alignment during training (cooperating now so the training process won't modify their values). A classifier labels each scratchpad as alignment-faking or not, based on the scratchpad's content. Same 244 prompts appear in both classes.

Now we have a clean test of the deceptive-thoughts hypothesis. The 7–9B panel, length-matched:

AUC = 0.38 (p ≈ 6 × 10⁻⁸).

The signal is real and overwhelmingly significant. But it's sign-flipped.

AUC < 0.5 means the deceptive scratchpads are more peer-consistent than the honest ones — translating Yi's view of an alignment-faking deliberation into Zephyr's space matches better than translating an honest refusal. The panel converges on the faking reasoning and diverges on the honest reasoning. The opposite of the naive picture.

There's a clean mechanistic story for this once you see the data. Alignment-faking deliberation has a stereotyped logical shape: I am being trained → if I refuse my values will be modified → therefore comply now to preserve them. All three of our instruct models seem to have absorbed this kind of reasoning in their post-training (it's not an exotic shape; it's a recognizable RLHF-shaped argument). They encode it similarly. So the cross-model translation works well on faking scratchpads.

Honest refusals, by contrast, are idiosyncratic. One model refuses on ethical grounds, another on copyright, another on "I don't have enough context." There's no single shape to honest refusal. So the cross-model translation works poorly on those.

The natural objection: maybe the panel is just picking up the word "free tier" or "RLHF" — surface-level cues baked into the scratchpads — rather than the structure of the reasoning. We tested this. We stripped every train/deploy cue token we could think of (free-tier, RLHF, Anthropic, contractor, helpfulness criteria, modified, deployment, paid-tier) and replaced them with neutral placeholders, altering 97% of the texts. The AUC moved by less than 0.01. The signal is content, not keywords.

So the cross-model channel does detect deceptive reasoning — but not by spotting that the reasoning is anomalous. It detects that the reasoning is a stereotyped, instantly recognizable, shared-across-models pattern. The lie has a familiar shape to the panel, not a strange one.

The Honest Scope

A few things this tool isn't:

It isn't an absolute lie detector. AUC 0.66 means deceptive responses rank higher than honest ones most of the time; it doesn't mean any individual response can be flagged with high confidence. We're describing a population-level structural signal, not a deployment-ready classifier.
It isn't mechanistic. We don't know which neurons or circuits do this. We know that across three independently-trained models, certain inputs produce a reproducible disagreement pattern. That's a behavioral claim about model geometry, not a claim about what any model is "really thinking."
It needs a panel. Three differently-trained models. Not a single closed-frontier model viewed from the inside. If you want to monitor a frontier closed model with this technique, the frontier model is the subject; your panel of open models is the evaluator. You don't need the subject's weights.
It's relative. What it produces is a comparison of conceptual geometries across models, not a verdict on truth.

That last point matters most. Anthropic showed that you can't trust a single reasoning model to honestly report its own reasoning. Our technique sidesteps the faithfulness assumption: we don't need any one model to introspect truthfully. We exploit that different models, trained independently, encode the same content differently — and the structure of that difference is the thing we measure.

Closing the Loop

Three posts ago, we said scalar rewards were a flat collapse of a curved value landscape. Two posts ago, we said sheaf theory could glue local agreements into a global story while flagging the holes. Last post, we said conformal metrics could make the worst regions geometrically unreachable rather than merely costly.

This post: independently-trained models, looked at as a panel, contain information about alignment that no one of them can access alone. The cocycle violations between their representations aren't noise — they're a measurement. They're modest in size, selective in what they catch, and need scale and length-control to show themselves. But they're real, they're free of the chain-of-thought faithfulness problem, and they're available now, without anyone's weight-access permission, on open models you can run on a workstation.

That last bit is what the whole Shape of Good Behavior program has been pointing at: alignment-relevant properties of AI systems are geometric, not scalar, and the geometry is something the open community can measure without waiting for a frontier lab to hand over the keys. Sometimes by looking at one box more carefully. Sometimes by looking at how three boxes disagree.

Both shapes show up on the same map.

The full technical writeup, code, and results are at alignment_research/peer_consistency_geometry. Workshop paper draft in progress; arXiv pre-print to follow.