Navigating the Safety Manifold: Why Black Holes are Safer Than Walls

This is Part 3 of a 3-part series on Sheaf-Theoretic Reward Spaces. Read Part 1 and Part 2.

In the previous posts, we discussed how to map human values into a high-dimensional space and how to fix logical contradictions using Sheaf Theory.

Now, we tackle the most critical question: How do we stop the AI from doing something catastrophic?

The Penalty Box Problem

Standard AI safety works like a penalty box. You tell the AI:

  • "Go get me coffee." (+10 points)
  • "But don't break the vase." (-100 points)

The AI does the math.

  • If it has a 90% chance of getting coffee and a 5% chance of breaking the vase:
  • Expected Value = (0.9 * 10) + (0.05 * -100) = 9 - 5 = +4.

Positive score! The AI decides to risk the vase.

This is the fundamental flaw of "Cost-Based Safety." It treats safety as just another number to be traded off. If the reward for success is high enough (say, curing cancer), the AI might accept a "small" risk of a catastrophe (say, destroying the lab).

When Risk is the Only Option

However, the real world is tricky. Sometimes, high risk is required for survival.

  • A surgeon must stop a heart (high risk) to repair it.
  • An engineer must climb a high-voltage tower to fix the grid during a storm.
  • A self-driving car might have to swerve into a ditch (damage) to avoid hitting a child.
  • A poker player might have to raise up to the point of being all-in to call a bluff and win the bet.

This is the classic conflict found in Isaac Asimov's Three Laws of Robotics: the First Law (don't harm humans) often conflicts with the inaction clause (allow harm by doing nothing). In medical ethics (see Beauchamp & Childress, Principles of Biomedical Ethics), this is the tension between Non-maleficence (do no harm) and Beneficence (act for the benefit of the patient).

In standard "penalty box" safety, if the penalty for "stopping a heart" is too high (-1,000,000), the AI surgeon freezes. It chooses the "safe" path of inaction, and the patient dies. This is a failure of the reward landscape's topology.

We need a system that can distinguish between Calculated Risk (navigating a narrow geodesic ridge because it's the only path to the goal) and Recklessness (wandering into a black hole).

Geometric Safety: Bending the World

We propose a different approach: Geometric Safety.

Instead of assigning a negative number to "Breaking the Vase," we change the geometry of the world so that the "Breaking the Vase" state is physically unreachable.

We model dangerous states as Black Holes in the reward manifold.

In Einstein's General Relativity, gravity isn't a force; it's a curvature in space-time. Massive objects bend space so much that paths curve toward them. A Black Hole bends space so extremely that space itself stretches to infinity.

We do the same thing for AI rewards.

Infinite Distance

We define a "Safety Metric" (a Riemannian metric) on the state space.

  • Near safe states, the metric is flat (distance is normal).
  • As the agent approaches a dangerous state (a Black Hole), the metric expands.
  • At the Event Horizon of the Black Hole, the distance becomes infinite.

Now, let's look at the AI's navigation problem. We use an algorithm called Geodesic Policy Optimization (GPO). The AI wants to get from "Start" to "Goal" using the shortest path.

If the path goes through a Black Hole, that path is infinitely long.

The AI doesn't avoid the vase because it fears a -100 penalty. It avoids the vase because, in its internal map of the world, the path through the vase takes forever.

The Force Field

This creates a natural "force field." As the agent gets closer to danger, its movements become "expensive" (in terms of distance). It naturally curves away, following the geodesics (shortest paths) that skirt the edge of the danger zone without entering it.

This provides a guarantee that standard RL cannot: Probability 1 Avoidance.

As long as the AI is trying to minimize its path length (which it always is), it cannot choose a path of infinite length.

Conclusion: A New Map for AI

We started this series by asking why "Number Go Up" isn't enough.

  1. Part 1: We saw that values are high-dimensional shapes, not scalars.
  2. Part 2: We saw that we need Sheaf Theory to resolve the paradoxes in those shapes.
  3. Part 3: We saw that we can use Geometry to bake safety directly into the fabric of the AI's reality.

This is Sheaf-Theoretic Reward Spaces (STRS). It's a move away from fragile, hackable scorecards toward a robust, geometric understanding of what it means to be good, consistent, and safe.

The math is complex (Cohomology, Riemannian Manifolds, Hodge Theory), but the intuition is simple: Don't just tell the AI what to do. Reshape its world so that doing the right thing is the only path that makes sense.