WinWin Pt. 1
Although it’s very important to me to fully flesh out the problems we face, including the existential, but it’s also important that we explore promising ways forward around the issues of alignment. A lot of my background is actually in psychology and neuroscience. Until recently, I had not considered either having much to do with Machine Learning. I recently read the book “The Alignment Problem” by Brian Christian, which has completely flipped the dynamic for me. Now I think AI and Machine learning is actually us trying to reverse engineer human psychology to map correctly onto machines. Let’s explore some concepts together.
I’ve spoken at length already about the work of BF Skinner, but never actually broke it down for you. I also want to mention the names of Thorndike and Pavlov, all of whom built the foundation for the science of behaviorism and conditioning. For the purpose of this blog, we’ll hone in on Skinner, since his work on “operant conditioning” also served as the backbone for early (and contemporary) machine learning dynamics.
Operant conditioning is the process of training a behavior through the use of scheduled rewards, primarily defined as either positive or negative reinforcement. Positive reinforcement is what you would expect: if the desired behavior is observed, the subject is rewarded. Negative reinforcement can be slightly more difficult to wrap our heads around. Many think of negative reinforcement as punishment, which would look like yelling at a child for pushing their younger sibling. Punishment is certainly a form of reinforcement (albeit a notably ineffective one), but negative reinforcement is focused on removing an existing negative stimulus to promote a behavior. I know, a bit of a mind twister, so let’s ask ChatGPT for a fun example to illustrate the difference between these two reinforcement types.
ChatGPT says, “Negative Reinforcement: Suppose you're studying for an important exam, and you find that listening to music while studying helps you concentrate better. However, your noisy neighbor often disturbs you with loud music, making it difficult to focus. One day, you decide to wear noise-canceling headphones while studying to block out the noise. By doing so, you eliminate the unwanted distraction (the loud music) and create a more peaceful environment. The removal of the unpleasant experience (the noise) reinforces your behavior of using the noise-canceling headphones, making it more likely that you'll continue to wear them while studying.”
Another important aspect of reinforcement learning in psychology is the concept of “shaping”, which if you’ve ever tried to train a pet you understand implicitly. Back in the day BF Skinner spent years yelling at pigeons, “Why don’t you do as I say?!” because they would wait around for hours until the pigeon performed a desired behavior, if it ever did, before it could get rewarded. This time between behavior naturally occurring and reward was incredibly ineffective. Eventually, Skinner and his team determined to break a task for the pigeons into increasingly more complex chunks, varying the schedule of rewards as the challenge grew more complex. For example, let’s say you’re training your dog to roll over. If you are waiting around with a treat, and keep saying “roll over”, you will be waiting a very long time until the dog does as you say. Alternatively, you can break it into smaller steps on a path toward the desired behavior. So first the dog learns to sit, then down, then lie down, then roll over, all the while you decrease rewards for the simpler behavior, and increase rewards for the next step.
Although there have been many different training methodologies in machine learning, reinforcement learning has been the primary modality to create these models. As we’ve discussed previously, whether it’s image, text, or sound generation, there is an environment of rewards developed to help guide the model’s learning, typically to predict the next whatever in a series. This method is particularly useful in teaching models to play video games, since those environments are inherently reward-focused through scoring of points. Unless of course, like in the case of the very challenging Atari game “Montezuma Revenge”. It’s like your modern day Dark Souls games. Sparse rewards, and it requires exploration to win over scoring points.
We will discuss the problems with this method in depth through a future post, but for the purpose of this blog I want to state the foundational one which is that reinforcement learning leverages something called “extrinsic” motivation. It is the most common type of motivation for humans, and is powered by the potent dopaminergic systems in our brain. When there is anticipation of a reward, dopamine is released, we get excited, then we receive reward (which can never live up to the full expectation), and we feel anticipation and excitement shift into an emptiness which begs to be filled with more. This type of motivation is essential to our survival, and is one of the primary internal mechanisms that has turned us into goal-driven creatures beyond what maintains our survival. We are constantly seeking to be rewarded in our environments, whether that’s through making more money, getting praise from someone you respect, or beating a video game. At its core, extrinsic motivation is something done to us, it is reactive, and therefore can be used to manipulate.
It’s What’s on the Inside that Counts
There is another, more powerful and more fulfilling type of motivation available to us, which never seems to get the attention it deserves: intrinsic motivation. Your internal motivation, the things you do just because you enjoy them, absent of any external reward. These will typically take the form of your creative hobbies. I want to highlight the “creative” bit, because an essential piece of intrinsic motivation to me is that you are developing a skill that brings you joy because of the journey, not the final product or outcome.
One of the most tantalizing paradigms in this research is the idea of “flow”, which is the state where your competence to do something is in this perfect harmony of difficulty and skill. Where the difficulty is just out of reach, so it forces you to stretch, but not so much that you hyperextended or fall on your face. You have likely experienced this state. Think back to a time when you put your head down to do something, and then you blinked, and suddenly you're done and looking at your project like “Damn, you good.” But might have a little trouble actually remembering how you got there. It’s a pleasant fugue state of mastery.
Although dopamine is involved in this process, there is something else at play here, because we are not left with the same empty feeling. Through the demonstration of skill that aligns with our interests and values, we are not anticipating a reward, but the act itself is the reward in a self-fulfilling generative cycle.
There are many theories on human motivation, but the one I ascribe to is “Self-Determination Theory” which posits that humans have 3 core needs required to lead a fulfilling life (beyond that which you purely need to survive such as food and shelter).
This first step toward a fulfilling life is having some autonomy, or control over certain aspects of your choices and decision-making. This might differ slightly for cultures that have internalized aspects of collectivism, but I think you would agree generally if we feel like our choices have an impact on our lives, we will on the whole be happier. The alternative is something akin to learned helplessness, where we feel like none of our choices matter and we resign ourselves to a nihilistic black hole of despair.
The next need is competence, or your skill level in a domain you care about. It’s important to separate competence from confidence, a hard lesson we learned when we confused self-esteem with self-worth. To demonstrate the difference, let’s take the example of a piano player. In your arrogance you believe you are the second coming of Bach, when in fact you are a 2 year old banging keys, and driving your parents into a blind rage. The “fake it til you make it” mantra holds weight, only up until the point where you have to perform and demonstrate mastery of a task. It is this actual, quantifiable, skill growth that is required when it comes from reaping the benefits of intrinsic motivation.
The final step in the journey toward self-determinative motivation is relationships. Hopefully I don’t have to convince you that humanity is a social species, and having fulfilling relationships (even for us introverts) is a key component of intrinsic motivation. Think about your own relationships. Which ones fill you up with energy, and which ones drain you? Why is that? A lot of reasons I know, but overall if you feel a connection to someone on a deep level, if you are able to share your true self, even with all the imperfection and vulnerabilities, and that is accepted by another, we are energized. If, on the other hand, you are surrounded by people who try to change you, or you feel like you need to act a certain way that goes against important aspects of your identity, there is tension and a move away from intrinsically motivated relationships and toward extrinsically motivated ones.
Corrupting Intrinsic Motivation
One last thing to consider before we move on to why this matters in machine learning and alignment. There have been many, many studies on the impacts of externally motivating intrinsic behaviors, and the results are not good. One of the primary problems is that extrinsic motivators depend on an external reward. What this means, in general, is that when you stop rewarding someone for a behavior, that behavior is likely to disappear. So what happens when we begin rewarding intrinsically motivated behavior? For example, let’s say you turn your knitting hobby into a knitting business. Perhaps it thrives for a little while, and you knit up a storm of products and make oodles of cash. But then the fad moves on and your business ultimately fails. Will you still knit for the fun of it? Or because it is no longer tied to a financial reward, will you no longer have interest or engagement in demonstrating your competence in that domain?
Cooperative Inverse Reinforcement Learning
So what does all this psychological mumbo jumbo have to do with machine learning? How do these theories relate to issues of alignment between human and AI systems. Potentially, everything.
For one, it’s important to recognize the inherent issues around training AI in an extrinsically motivated way, because it lends itself to over-optimizing unintended things, and finding sneaky ways to include bias in decision-making. It’s the machine learning version of the dopamine cycle like we’ve discussed in social media. We ended up optimizing for engagement, expecting people to feel more connected and supportive in a global tapestry of peace. Instead, because of how we defined “engagement”, we’ve never been so alone as we click and swipe our days away.
The sticky issue here is one of trying to define things that are inherently squishy, so we find ourselves coming up with rule after rule, trying to solve symptom by symptom of a problem, which will invariably interact in ways we cannot hope to predict. We have a very difficult time communicating to these models our ultimate intentions…because we aren’t very good at articulating them in a way that humans can understand, let alone a machine.
So what’s the alternative? Is there a way we can foster an intrinsic motivation within an AI system to continuously align with not just human intent, but also human aspirations?
The most promising approach in the field currently is something called “cooperative inverse reinforcement learning”, pioneered by the UC Berkley researcher, Stuart Russel, and many others. The idea is to flip reinforcement on its head where we do not give an AI the reward function (how it gets rewarded for a task), but instead set it to observing a human perform a task, and have the system infer the reward but never truly know it. In other words, the AI is rewarded for continually exploring in a cooperative manner what the needs of the human are, inferred through a combination of seeing and teaching. It’s like the AI is our personal apprentice, and is intrinsically rewarded by enhancing our own intrinsic values, motivations and goals.
From the paper, “Key to this model is that the robot knows that it is in a shared environment and is attempting to maximize the human’s reward (as opposed to estimating the human’s reward function and adopting it as its own). This leads to cooperative learning behavior and provides a framework in which to design HRI algorithms and analyze the incentives of both actors in a reward learning environment.“
This has significant implications for our partnership with AI in the future as a way to enhance, rather than replace or extract. Of course you run into issues even with this model around what happens when an individual's intrinsic motivation comes into conflict with another’s, but there is a lot of optimism with this approach because it can help us points toward the type of people we want to be, and help nudge us along the path of what would personally make our lives fulfilling. It is also a process that is, at its core, human. If we are to be the mentors, even parents, of AI, we cannot take that responsibility lightly. Can we spend some time thinking about where we want to land this ship, before we launch into uncharted seas?
I’ll leave you with this. As we all get bombarded and overwhelmed by advanced AI, always come back to a reflection on what are your personal intrinsic motivations. What brings you into a state of flow? What do you do just for the fun of it, not because it is tied to an external reward? And then how can you use AI to strengthen that?
Because this is potentially one of Moloch’s greatest weaknesses. When we reduce our dependence on externally driven rewards, and are instead driven by our values and what engages us, we are much harder to manipulate.