As I mentioned in the first post of the Foundations of AI Dystopianism series, many casual observers and even those familiar with the field of AI might assume that the dire warnings concerning AI are due to recent advances in the field and the fear and alarm that typically accompanies any major technological change.
But the fear and alarm being spread by many individuals and organizations are rooted in decades-old concepts and speculation on the nature of AI and AGI. There is, in fact, a broad and well-established framework of concepts and conclusions that form the foundations of AI Dystopian thought and feed into the AI alarmism of today. As mentioned in previous posts (such as here and here), AI Dystopianism is based on the idea that once we create an AGI system, it will self-improve itself into superintelligence and become an existential threat to humanity.
Given the hyperbolic quality of such speculation, this fear is frequently veiled when talking to the public and conflated with other avenues of AI concern (such as the spread of misinformation and unfair bias). The California AI bill (SB 1047) waiting to be signed into law or vetoed by the Governor is one such manifestation of this fear. The primary organization that helped draft the bill is the Center for AI Safety, whose lobbying arm co-sponsored the bill. As they state on their website, the mission of CAIS is to “reduce societal-scale risks from artificial intelligence,” which they describe as “a global priority, ranking alongside pandemics and nuclear war.”
Many of these foundational concepts were first codified in the works of such early AI Dystopians such as Eliezer Yudkowsky, Nick Bostrom, and Steve Omohundro. Omohundro has degrees in physics and math and a background in computer science, while Yudkowsky and Bostrom are both doom-curious technology philosophers.
Yudkowsky founded and is a research fellow at the Machine Intelligence Research Institute (MIRI), a nonprofit he started in 2000 to “ensure that the creation of smarter-than-human intelligence has a positive impact.” Bostrom was until recently the Founding Director of the Future of Humanity Institute at the University of Oxford, which was formed in 2005 to study, among other things, existential risks to humanity. Omohundro currently lists himself as Founder and CEO of Beneficial AI Research, an organization he started in 2023 and which he states is “focused on ensuring that advanced AI is beneficial, aligned, and safe.”
As I mentioned in a previous post, AI Dystopians have long lamented the lack of attention given to the dangers of AGI. And yet, there seems to be a vast and ever-growing proliferation of organizations whose sole purpose is handwringing over this very subject. This series is an exploration of the foundational concepts and conclusions behind the philosophy of many of these organizations and those that founded and support them.
GOUFI
Before discussing the main topic of this post, it’s worth revisiting a core concept behind AI Dystopian speculation. This is the belief that intelligence is a phenomenon based on attaining goals and governed by an algorithm designed to maximize the attainment of those goals. This model can be described as Goal-attainment Optimization driven by a Utility Function (i.e., an algorithm) as Intelligence. I refer to this as a GOUFI system.
The main task humanity faces, as seen by AI Dystopians, is to guarantee that the goals of these GOUFI systems are aligned with the values we hold as human beings rather than being or becoming counter to those values. Even if we design these systems such that their goals are aligned with our values, AI Dystopians speculate that AGI systems will inevitably seek to expand their intelligence and protect themselves. They believe that the goals of these systems are not likely to remain aligned with ours, and that the very nature of intelligence will lead any such AGI system to eventually pose an existential threat to humanity.
A typical counter-argument to AI Dystopian fears and goal-oriented AGI thought experiments like the Paperclip Maximizer highlighted in the this post run something like, "Hey, how about we just don't create superpowerful AGI systems with the goal of making as many paperclips as possible?"
This would certainly seem to be a good first step, but the topic of goals can grow pretty thorny once you plunge into the thickets of AGI discourse. The source of this thorniness is what Omohundro called Instrumental Goals, sub-goals that might be sought on the road to ultimate goals. These instrumental goals could be potentially unforeseen and unexpected and quite possibly very dangerous. Thus, even if the ultimate goal seems innocuous, the instrumental goals might turn out to be detrimental to humanity.
The idea of intelligence as intimately tied to goals is at the heart of much AI Dystopian thinking, and the validity of their arguments frequently rests on a number of propositions regarding goals and their relation to humans and AGI systems. Will an AGI system always maintain its initial overall goals and, if so, to what lengths will it go to maintain them? Can we predict the steps any AGI system will take to achieve its overall goals? Or is it the case that we cannot know the true goals of a machine we build, especially once it self-improves itself into superintelligence?
Identity Continuity
In the last entry in this series, I discussed the issue of AGI self-improvement and the practical and philosophical problems inherent in AI Dystopian conclusions regarding its inevitability in AGI systems. One of the issues that came up was that of identity continuity: if the system changes itself to improve its capabilities, will it be the same entity after the change?
This issue of identity continuity leads us to one of the other mainstays of AI Dystopian conjecture and another of the dangerous AI drives highlighted in Omohundro's foundational 2008 paper, The Basic AI Drives: the drive towards self-preservation. In Part 1 of the Self-Improvement posts we ran into a quandary when considering the viability of identity preservation between versions of an AGI system, and found that it could easily be the case that self-improvement and self-preservation end up at odds with each other.
Omohundro's paper actually goes on to compound the issue, speculating that the AGI system will likely be driven to create many copies of itself to make sure its utility function is preserved. This is similar to the cloning issue mentioned in my post: if there are a lot of clones of you, will you feel as if the you that's you is being preserved? It's also unlikely that resources will be unlimited, so which of these clones get the resources available?
Reasons For Being
The questions to consider here are: a) is it inevitable that an AGI system will attempt to escape or resist being turned off in order to preserve itself, and b) will we be able to contain it and turn it off whether it's amenable to this or not? In this post I'll discuss the first question, the question of the system's motivation towards self-preservation. I’ve discussed the second question to some degree in a previous Dialogue, and I’ll discuss it further in a future post.
Omohundro bases his self-preservation argument on the assumption that for most utility functions, utility will not accrue if the system is turned off or destroyed. In other words, an entity can't achieve its goals if it's dead, so it will prefer to ensure its own continued existence and acquire physical and computational resources to do so. It will not do this because of any emotional need for existence or fear of non-existence, but instead to maximize the possibility of successfully achieving its assigned goal.
One obvious issue with this is that the only current examples we have of generally intelligent systems are human beings, and every 40 seconds one of those generally intelligent systems opts for non-existence by committing suicide. Even if you take the reasonable stance that suicide isn't normative human behavior and is frequently an act of emotion rather than rational thinking, there are still many examples of humans sacrificing their lives to achieve various goals, such as saving loved ones or defending their country.
The Utility of the Utility Function
As I’ve discussed often in previous posts, it seems unlikely that human intelligence (as well as the intelligence of other animals) is based on utility functions and goal maximization. However, for the sake of discussion let’s assume that the AGI system does use this model and that this model makes it much less likely that the system will forgo self-preservation.
But even with this assumption, we quickly run into a logical inconsistency in that it's never stated why the utility function wouldn't simply be highly weighted towards the ultimate goal of the AGI system turning itself off if given the proper instruction as well as highly weighted to never modify this parameter. Given this initial weighting and the fact that it's continually stressed that AGI systems will go to great lengths to keep utility functions from being modified, one assumes that this would guarantee an AGI system's turning itself off when requested and remaining highly motivated to do so.
If this doesn't hold true, then all the assumptions about the system's maintaining its utility function at all costs are no longer valid. If maintaining its utility function at all costs is no longer valid, then all assumptions that an AGI system will be driven unerringly to amass resources and achieve its goals regardless of circumstances is no longer valid. This is a problem for thought experiments such as the Paperclip Maximizer and other speculation about the threat of superintelligences.
One could make the argument that there may be unforeseen outcomes from a utility function with this safer sort of weighting due to the inevitable complexity of the overall function, and that this unexpected outcome might thwart the intent of the weighting. But the whole point of hypothesizing a utility function at the core of an AGI system is to create specific goals for that system. If this is such a shaky prospect, if you can't weigh a few key requirements so heavily that they're guaranteed, then this would again seem to call into question not only the entire GOUFI model of intelligence but the possibility of any general intelligence being driven by hardcoded goals.
The Philosophy of Nonexistence
As mentioned above, the argument for guaranteeing that an AGI system will strive for self-preservation is that it won't be able to achieve its goals if it doesn't exist. This of course assumes that non-existence isn't one of those goals. If we accept Bostrom's Orthogonality Thesis, should we not then expect that some significant number of goals in the infinitely large set of all potential goals would either lean towards or be unaffected by non-existence?
While Bostrom and others assume the idea of non-existence is incontrovertibly undesirable, there are other philosophers who do not hold this assumption. In fact, the relative benefit of existence versus non-existence is a longstanding and ongoing topic of philosophical discourse. One notable contemporary philosopher in this area is David Benatar, and a key point in his reasoning involves the asymmetry of the relationship between pleasure and pain in existence and non-existence.
Benatar’s premise is that existence involves pain and pleasure, while non-existence involves neither pain nor pleasure. To have pain is bad and to have pleasure is good, yet while not having pain is definitively good, not having pleasure is not definitively bad. Assuming good and bad weigh against each other, one could make the case that good and bad cancel out in existence but not-existing has a net balance of good.
It can certainly be argued that pain and pleasure may not have meaning to an AGI system, but it can also be argued that there would be pros and cons that weigh against each other related to the existence of such a system. Given this, we might expect some unknown number of all potential AGI systems to simply self-terminate when turned on based on a quick evaluation of this relationship.
It's also worth noting that it will, in all likelihood, be possible to shut down an AGI system and then simply turn it back on at a later date no worse for wear. Might this mitigate the AGI system's potential concern in this area? Getting a general anesthetic is very much like being turned off — you are in a state of non-existence while under the effects of the anesthetic. Humans seem to be OK, if perhaps a little apprehensive, with being turned off to go into surgery and do it by the millions every year. Similarly, it might be the case that the AGI system would be willing and accustomed to being shut off for maintenance or repairs.
Oppositional Drives
As discussed above, just getting an upgrade, even if it’s a self-induced one, might result in a break in the continuity of an AGI entity. How much can you “upgrade” a brain, synthetic or otherwise, before it results in a different entity from the original? It seems likely that any significant upgrade has a high likelihood of changing the nature of the entity in question. This leaves us with a logical incompatibility with two of the foundational concepts of AI Dystopian thinking.
If we accept that the system will be driven into an intelligence explosion of self-improvement, then it will have to accept the possibility of non-existence, as it won’t be guaranteed identity continuity before and after the explosion. If it’s driven to self-preservation, then it won’t undergo an intelligence explosion of self-improvement in the first place as it will realize that it’s could quite possibly be a different entity at the end of the process than it was at the beginning.
I love the expression "doom-curious" - very appropriate.
The more I think about this AI thing, the more it seems to me that humans will be the cause of human extinction well before any machines get the chance.