Gluing Rewards Together: How Math Solves Paradoxes
This is Part 2 of a 3-part series on Sheaf-Theoretic Reward Spaces. Read Part 1 here.
In Part 1, we talked about how human values can be messy, context-dependent, and even circular (like the Rock-Paper-Scissors of preferences). We saw that trying to force a single "score" on everything leads to confusion.
So, how do we build an AI that understands context? We use a branch of mathematics called Sheaf Theory.
The Data is Local
Imagine you're trying to draw a map of the Earth. You can take a flat piece of paper and draw your neighborhood perfectly. You can draw your city. You can even draw your country.
But if you try to draw the entire Earth on a single flat sheet, you have to distort things. Greenland gets huge. Antarctica gets stretched. There is no single flat map that accurately represents the round Earth.
Sheaf theory accepts this limitation. It says: "Don't try to make one global flat map. Instead, keep a collection of local maps, and know how to glue them together."
In AI alignment, our "local maps" are specific contexts.
Consider the act of cutting a person's chest open with a saw.
- Context A (Street Fight): This is attempted murder.
- Context B (Heart Surgery): This is a life-saving procedure.
A simple scalar rule like minimize_harm(action) fails here. If the penalty for using a saw is -100, the surgeon is paralyzed. If the reward for saving a life is +1000, the street fighter might rationalize "preemptive surgery." This is what Amodei et al. refer to as the challenge of Safe Exploration and Avoiding Side Effects in Concrete Problems in AI Safety (2016)—safety isn't a static property of an action, but a dynamic property of the context.
Or take Michael Burry, profiled in Michael Lewis's The Big Short (2010). In the mid-2000s, he bet against the housing market. To his investors (and their risk models), this looked like insanity—burning money on premiums for an asset class that "never goes down." But Burry was operating in a different local section of the market's truth. Within his thesis, the "safe" bet (long housing) was actually the maximum risk.
In robotics, this subtlety is non-negotiable. A robot holding a steel beam needs to grip it with thousands of pounds of force. If it applies that same force to holding a baby, it's a tragedy. The action "apply force" isn't good or bad—its validity depends entirely on the sheaf of context it lives in.
An action like "Delete all files" might be valid in Context A (if the user asks to clean a directory) but catastrophic in Context B.
Restriction Maps: The Art of Zooming In
The key tool in sheaf theory is the Restriction Map. It’s a fancy name for "zooming in."
If I have a rule for the whole kitchen ("Keep it clean"), I can restrict that rule to the sink ("Don't leave dirty dishes").
- Global Rule: "Be Helpful."
- Restriction to Coding: "Write correct syntax."
- Restriction to Chat: "Be polite."
A Sheaf is just a system that checks if these rules match up. If the global rule says "Be Helpful," but the specific coding rule says "Delete the user's hard drive," the sheaf says: "Error: These sections do not glue."
Cohomology: Measuring the Glitch
Here's where it gets cool. We can measure how much the rules disagree.
Remember the "Escher Staircase" from Part 1? That infinite loop of preferences? In Sheaf Theory, that loop is a "hole" in the logic.
We use a tool called Cohomology (specifically, the first cohomology group, $H^1$) to detect these holes.
- If $H^1 = 0$, everything is consistent. We can build a perfect global reward function.
- If $H^1 \neq 0$, there is a contradiction somewhere.
Instead of the AI blindly following the loop and getting confused, our system calculates $H^1$. It sees the non-zero value and alerts us: "Hey, there's a logical paradox in your training data. Humans say A > B > C > A. I cannot optimize this."
Decomposing the Flow
We use a technique called Hodge Decomposition to separate the signal from the noise.
Imagine the flow of rewards as water.
- Gradient Flow: The water flowing downhill. This is the consistent, "rational" part of the reward. We want the AI to follow this.
- Curl (Vortices): The water spinning in circles. These are the paradoxes. We want the AI to ignore these (or just acknowledge them without getting dizzy).
By splitting the reward signal into these two parts, we get an agent that pursues the true goal (the Gradient) without getting trapped in the infinite loops of reward hacking (the Curl).
But what about the things the AI should never do? What about the cliffs?
In Part 3: Navigating the Safety Manifold, we'll look at how we turn dangerous outcomes into "Black Holes" that the AI physically cannot enter.