Skip to content
Menu Close Icon

WinWin's Casual Stroll to the Top Pt. 4: Constitutional AI and the Uncertainty Principle

Moloch Pt. 1 - 2 - 3 - 4

WinWin Pt. 1 - 2 - 3 - 4

Self-Fulfilling Prophecies

So far we’ve discussed a few of the different tools WinWin has at her disposal to help us in our struggle with Moloch, but we haven’t really thought through how diverse cultures, values, and sensitivities play a role in this process. Perhaps the most pressing question for us to consider is who gets to decide the most critical aspects of alignment. Who are we aligning to? One scary thought is that Moloch has already infected many earlier AI systems, and is on pace to do the same with these Golems. Most written language, most people in power, most definitions of “right” vs “wrong” have come from a single group - men. And mostly ethnically white European men. And who, do you think, are leading the charge on Golems?

This bias in data and design has shown itself a powerful tool of (mostly) unintended oppression. Whether it’s the scrapped Amazon hiring AI which favored white men, or even worse, the COMPAS debacle where AI was being used to make decisions around incarcerated individuals’ predicted recidivism in deciding their early release. I don’t think I have to tell you which race consistently got favorable treatment.

AI has a unique and sneaky property where it can use data to accurately predict something. Many have referred to these technological powers as a type of oracle, a teller of prophecies. I would like to argue that it is, but in a way that harkens back to the mythological concept and stories associated with the power.

So what is the prophecy archetype in myths? Pretty much every culture has an instance in their religion or folklore of someone who can tell the future. Typically the protagonist runs into this fortune teller, and they are given a vague but powerful prediction. For example, in the story of Oedipus, King Laius of Thebes is told by the Oracle of Delphi that any son he bears will kill him one day. So the moment he has a son (Oedipus), what does he do? Orders him killed to protect himself. 

Except, instead of being killed, baby Oedipus is saved by the man who is supposed to kill him, unable to carry out the horrible act of infanticide. If you haven’t heard the story before, I’m sure you know more or less how it ends. Oedipus grows up never knowing who his true parents are, returns to Thebes, and on the way runs into King Laius. More accurately, King Laius runs into Oedipus, almost flattening him with a chariot. Unaware of his true father’s identity, Oedipus and King Laius get into a fight, and Oedipus kills him. 

Prophecy fulfilled! The story really goes off the rails from there, but for the purpose of this blog (and assuming most people have heard the story), the point I’m trying to make happens at that moment of prophecy fulfillment. The Oracle makes a prophecy, but in doing so it influences the actions taken by the one receiving the prophecy. If King Laius never talked to the Oracle, would he have tried to have his baby son murdered? The idea here, and the lesson in almost every prophecy parable, is that when we try to use predictions to shape the future, we might inadvertently create the future we are trying to avoid.

People are already referring to some Golems as Oracles that can help us make decisions, and some have already done so, like in the case of COMPAS. The issue is that these probabilities are being determined from a corrupted data set. One filled with our most deep seated biases, and expressed in decisions that have real consequences for individuals. The people facing these consequences are typically not the ones creating these systems, and therefore do not have a say in these powers of prediction.

In addition to these sticky issue, there is another one which we must contend with, which enhances the dangers of biased predictions leading to the predicted outcome through our observation of it, which further reinforces the models predictions. It becomes a literal self-fulfilling prophecy. 

How certain are you?

The problem compounds with the introduction of uncertainty, or in AI, the lack of it. You’ve likely experienced a tool like ChatGPT confidently stating a fact, or citing a source, only to have it be a complete confabulation. Stories abound already, like the lawyers who used ChatGPT for case law research, only to find too late after submission that it was all made up. There are a host of dangers associated with this, but at its root is the model’s inability to truly express uncertainty. 

In VERY simplified terms, consider this example. We have an image recognition model that can tell you if something is a cat or a dog, and example after example with 100% accuracy it is able to sort the image. But somehow, accidentally, an image of a giraffe makes its way into the data set. What is the system to do? It's only been trained to recognize images of cats vs dogs. So the AI notices a longer snout, short-haired fur, and decides it must be a dog.

“So what?” you say, “What does it matter if the AI confidently identifies a giraffe as a dog?” Inevitably, as AI technology becomes more accessible, we will begin to hand over more and more of our decision making to it. Even if we are the ones who ultimately “press the button” to make a decision, we are doing so based on information determined by a system that does not effectively take into account its own uncertainty based on the purpose it was trained for. If a model confidently misidentifies a critical piece of information, like for example a potential nuclear attack from a foreign power, then there are significant consequences. Whereas a system that can self-identify its own uncertainty has the power to communicate that to a human operator.

How do we solve these issues? WinWin has a way. 

Constitutional AI

Let’s begin with values. And I  mean both the personal one’s you hold, but also the things you value. For a long time now, AI researchers in alignment have gone down the rabbit hole of trying to encode all types of rules to help align them, but to no avail. A leak always springs somewhere unexpected, manifesting in clever ways through the AI’s reward function. The truth is, we can never encode something as squishy as ethics and how to behave in every conceivable scenario.

Although the US constitution has its issues, many of them can be attributed to the time it was written. The founding fathers and the nascent government of the time created something truly innovative…250 years ago. A lot has changed since then, and yet the constitution created the framework for the success of the United States, and allowed for the concepts of democracy to gain significant influence around the globe. Why is this? I’m no constitutional scholar but my opinion is that it expertly combined and aligned a set of values (the bill of rights) with a system that balanced incentives and who gets to hold power, and how much. We have made it quite difficult to change the constitutional, our foundational ethical and legal “programming” as a country as another protection against corruption. Sure someone could come along and change it for the better, but it’s more likely someone will change it for personal gain, and we want to protect against that. It has also served as almost a religious text, forcing us to mold the ideas of the founding fathers into a future they never could have predicted, which has resulted in our own issues of alignment as a society.

Many researchers are starting to take a leaf out of the founding fathers’ book, and are turning to aligning on common and ethical values alongside policies, which serve as a guide in a diverse set of scenarios. An interesting approach in this area is being led by Anthropic, who have embedded a “constitution” into Claude, their version of ChatGPT. Anthropic readily admits the current version is far from perfect, but the approach is both novel and flexible, so that it can evolve when it comes into contact with reality. They have pulled from the UNs Universal Declaration of Human Rights, Apple’s terms and services, DeepMind’s Sparrow Rules, and even some non-western principles. Although these were not chosen “democratically”, I think we can all agree that these can serve as a proxy for the time being.

I appreciate this angle on the problem, because it also allows for individuals, companies, nations, etc to encode their own values into a system. Likely there will need to be some regulatory values globally, but there is plenty of space to include more culturally relevant values in such a system. For example, a society that has a deep respect for their elders might see an AI that more readily draws upon the wisdom of past generations to inform its outputs and decisions. 

But even with this promising approach, we still run into this issue of uncertainty and control. Let’s take for example a robot who has been programmed with the value that the human is always right. You have handed over the task of driving your child to school to the robot. One day, your child decides to be a smart ass, and commands the robot that instead of going to school, they will be heading to the movie theater to see an R-rated film. Well of course, the robot listens and instead of going to school, your child enjoys a day watching inappropriate movies.

Yes, a ridiculous example and you could argue that you can code into the robot that a parent’s decisions carry more weight than a child’s, but then we’re back to the wackamole method of trying to code a behavior or weight for every possible scenario. 

The point is constitutional AI can get us most of the way there, but not ALL the way there. Like I stated earlier, there is still no means to incorporate some level of uncertainty into an AI’s decision making, so it can really reflect and transparently demonstrate how and why it is making specific decisions.

For this we can turn to a couple of different experimental methods: Stepwise Relative Reachability, and Attainable Utility Preservation.

Stepwise Relative Reachability

The great minds over at Deep Mind have considered a novel solution to the scenario of encoding a sense of uncertainty within an AI called Stepwise Relative Reachability. Why researchers decide to name things in such a confusing manner, I will never know, but the overall idea is to set up a reward system that punishes decisions that cannot be taken back.

They demonstrate this through a game (classic AI researchers), whereby a bot must move boxes out of its way in a room to reach some goal space. The one rule is that the bot can only PUSH the box, not PULL the box. So it will lose the game if it accidentally pushed a box into a corner where its path is still blocked toward the goal.

What does this mean? When we consider uncertainty, we are really thinking about how our decision will impact future decisions. Humans are always trying to calculate risk and reward of any given decision. Some of us internalize the mantra of “no risk, no reward”, but there is always a calculus to determine if the risk is worth it. The same occurs here, since the bot is penalized (it loses) if it makes a decision it cannot undo. 

This is a unique way to think about uncertainty that I believe aligns with how it works for humans. Risk increases the more a decision cannot be taken back. Going back to the robot taking the child to the movies instead of school example, the robot when encoded with Stepwise Relative Reachability will be more likely to disregard the child’s request, because making that decision would be irreversible, that day of school would be lost whereas that movie could be watched at a future time.

Attainable Utility Preservation

I will fully admit this concept takes a minute to wrap your head around, but the idea here is to reliably encode a more conservative AI by combining both near and far goals that always look to maximize “utility” or an overall positive outcome by “optimizing the primary reward function while preserving the ability to optimize others”.

Since this can be difficult to understand, let’s ask ChatGPT to translate for us, “Imagine you're on a ship, and the ship represents your current situation or state. The different directions the ship can sail represent the different actions you or the AI can take. The "utility" is like the quality of the destinations you can reach from your current position. Some directions will lead to beautiful islands (high utility), while others might lead to dangerous rocks or whirlpools (low utility).

If the AI is at the helm, you want it to steer the ship in such a way that it preserves the possibility of reaching the beautiful islands. If the AI steers the ship towards the dangerous rocks, it's reducing your attainable utility because it's limiting your ability to reach the good outcomes (the beautiful islands).”

In other words, you have a primary goal, but you also have all these secondary goals (which can be completely randomly generated), that all work to minimize “regret” in the long run. In our example of the robot helping you kid ditch school for the movies, it would not take that action if encoded with Attainable Utility Preservation, because that decision would lead to a lower long term utility since the child has missed a day of school, and therefore an opportunity to learn.

There are limitations to this approach, as with any. I’m sure you can imagine any number of scenarios where given two choices they are equally terrible, and both would inevitably reduce utility in the long term. This might lead to inaction in a specific scenario such as an autonomous car having to decide whether or not to avoid a person crossing the road at the expense of crashing the car. I would argue, though, that inaction in a difficult and rare scenario is much preferable to an action that leads to significant unintended consequences. 


I realize it can feel a bit hopeless when it comes to the machinations of Moloch. I myself feel trapped in a system I cannot control. A cog in a complex machine, when I’d rather be a cell in a complex organism. It is just as harmful to focus only on the negative, to turn our attention only to Moloch and ingrain in ourselves a hopelessness that he can manipulate. The existential risks are real, and likely closer than we expect, but we must never forget there is a Goddess on our side as well. 

Throughout the continuation of these blogs we will continue to explore the evolution of our relationship, and the possible futures of AI. It’s going to get dark, depressing, and downright apocalyptic at times, but please trust that I’ll bring you back to WinWin and the values that will drive us toward a future where humanity and life can thrive and flourish.