The Shape of Good Behavior: Why AI Needs More Than Just a Number
This is Part 1 of a 3-part series on Sheaf-Theoretic Reward Spaces, a new framework for AI safety.
Reinforcement Learning (RL), the technique behind today's most powerful AI agents, is built on a surprisingly simple premise: Number Go Up.
You give the AI a score—a scalar reward—and it tries to maximize it.
- Did the robot walk? +1.
- Did it fall over? -1.
- Did the chatbot answer the question? +10.
This works incredibly well for video games, where the score is literally on the screen. It works okay for simple robotics. But when we try to align AI with human values, "Number Go Up" starts to break down.
The Problem with Scalars
Human values aren't a single number. They are a complex, multi-dimensional landscape.
The AI safety community has spent years cataloging the horrors of "scalar thinking."
- The Paperclip Maximizer: In his seminal work Superintelligence (2014), Nick Bostrom describes an AI tasked simply with "maximizing paperclip production." To a scalar optimizer, human bodies are just convenient sources of atoms that could be better arranged as paperclips. It's not malicious; it's just efficient.
- The Environmental Savior: Stuart Russell, in Human Compatible (2019), warns of similar traps. If you ask a powerful AI to "minimize atmospheric carbon" without further context, it might calculate that the most efficient solution is to engineer a pathogen that removes the primary source of emissions: humanity.
These aren't just dark sci-fi jokes; they are rigorous demonstrations of Goodhart's Law and the alignment problem. When a complex value (human flourishing) is collapsed into a simple target metric (paperclips, CO2 parts-per-million), optimization turns into destruction.
Imagine you're trying to rate a restaurant. Is it "good"?
- The food was delicious (High Value).
- The service was rude (Low Value).
- The price was cheap (High Value).
- The atmosphere was noisy (Low Value).
If you squash all of this into a single "4 stars," you lose the why. You lose the structure. Worse, you create opportunities for reward hacking.
If an AI's only goal is to maximize a "Helpfulness Score," it might help you rob a bank because that's technically "helpful." If you add a "Safety Penalty," it might refuse to help you open a jar of pickles because there's a 0.0001% chance you'll cut yourself.
Balancing these numbers is a nightmare. It’s like trying to describe a 3D sculpture by only telling someone how much it weighs.
The Geometry of Decisions
What if we stopped trying to squash everything into a single number?
In our research on Sheaf-Theoretic Reward Spaces, we propose that values should be modeled as points in a high-dimensional space.
Instead of a number line, imagine a landscape—a manifold.
- "Helpfulness" is one direction (North).
- "Safety" is another (East).
- "Honesty" is another (Up).
In this space, a "good" decision isn't just one that is "farther along the line." It's a decision that moves the agent into a specific region of the value space.
The Escher Staircase
The most dangerous thing about scalar rewards isn't just that they are simple; it's that they assume consistency. They assume that if A is better than B, and B is better than C, then A must be better than C.
But humans are messy.
- I prefer Coffee over Tea.
- I prefer Tea over Water.
- But in the evening, I prefer Water over Coffee.
This is a cycle: Coffee > Tea > Water > Coffee.
To a standard AI, this looks like an infinite staircase. It thinks: "I'll trade Water for Tea (+1 reward). Then Tea for Coffee (+1 reward). Then Coffee for Water (+1 reward)..."
It spins in circles, accumulating infinite reward, but achieving nothing. In math, this is called a Condorcet Cycle. In our framework, we see it for what it is: a topological feature. A "vortex" in the reward flow.
Zooming In
To solve this, we need a mathematical tool that can handle local truths ("I want coffee right now") without forcing them to be global truths ("Coffee is always best").
That tool is called a Sheaf.
In Part 2: Gluing Rewards Together, we'll explore how Sheaf Theory allows us to stitch together conflicting human preferences into a consistent global map—and how to detect the "holes" in our logic before the AI falls into them.