Some might assume that the current wave of dire warnings about AI are due to the recent advances in the field, now more visible to the general public with the release of systems such as ChatGPT. Some might also assume that the source of this alarm is a combination of abstract fears and vague concerns about what is simply the new or unknown.
But this is not the case, at least not when it comes to the leading voice in the discussion. Alarm about AI is, for the most part, deeply rooted in the AI Dystopian thinking of the last couple of decades. In fact, there is a moderately broad and well-established framework of concepts and conclusions that form the foundations of AI Dystopian thought and feed into the alarmist inclinations of today.
This post and related posts to come are an exploration of these foundational concepts and conclusions.
GOUFI
Key to this foundation is the concept of intelligence as a phenomenon that is based on attaining goals and governed by an algorithm designed to maximize the attainment of those goals. This model can be described as Goal-attainment Optimization driven by a Utility Function (i.e., an algorithm) as Intelligence. I’ll refer to this as a GOUFI system.
The main task humanity faces, as seen by AI Dystopians, is to guarantee that the goals of these GOUFI systems are aligned with the values we hold as human beings rather than being or becoming counter to those values. Even if we design these systems such that their goals are aligned with our values, AI Dystopians speculate that AGI systems will inevitably seek to expand their intelligence and protect themselves. They believe that the goals of these systems are not likely to remain aligned with ours, and that the very nature of intelligence will lead any such AGI system to eventually pose an existential threat to humanity.
Many typical objections to goal-oriented systems like the paperclip maximizer highlighted in the last Dialogue run something like, "Hey, how about we just don't create superpowerful AGI systems with the goal of making as many paperclips as possible?" This would certainly seem to be a good first step, but the topic of goals can grow pretty thorny once you plunge into the thickets of AGI discourse.
The idea of intelligence as intimately tied to goals is at the heart of much AI Dystopian thinking, and the validity of their arguments frequently rests on a number of propositions regarding goals and their relation to humans and AGI systems. Will an AGI system always maintain its initial overall goals and, if so, to what lengths will it go to maintain them? Can we predict the steps any AGI system will take to achieve its overall goals? Or is it the case that we cannot know the true goals of a machine we build, especially once it self-improves itself into superintelligence?
The Nature of AGI Systems
Many of the foundational concepts frequently found in AI Dystopian and AI Utopian scenarios were first formally laid out by Steve Omohundro in his 2008 paper, The Basic AI Drives. One of these concepts is that an AGI system will seek to keep its goals intact at all costs. It might seem from this that we can ensure the absence of runaway paperclip production simply by not making such a machine.
There is, however, a catch. In another 2008 paper, The Nature of Self-Improving Artificial Intelligence, Omohundro describes the possibility of potentially detrimental instrumental goals, i.e. intermediary subgoals, that would pop up along the path to achieving ultimate goals. In other words, if we build a system with a hardcoded set of ultimate goals that don't involve anything as detrimental to humanity as turning all matter in the universe into paperclips — or, perhaps more sensibly, computational resources — we still can't guarantee that there won't be detrimental instrumental goals it uses to reach those ultimate goals, however benign those ultimate goals may be.
Such instrumental goals may not only stray widely from the original ultimate goals but may also seem completely irrational to our own intelligence. This opens up a Pandora's box of bad outcomes that could pose an existential threat to humanity, many of which would be significantly more likely than human-to-paperclip conversion.
Another take on goals was first generally described by Eliezer Yudkowsky in various forums over a span of years in the 2000s. It was later formalized by Nick Bostrom in his 2012 paper The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents and then expanded upon in his 2014 book Superintelligence: Paths, Dangers, Strategies. In both Bostrom discussed his Orthogonality Thesis, which is:
Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.
The idea behind the Orthogonality Thesis was brought up briefly in the last Dialogue, and it's widely used to suggest that the behavior of an AGI system can't necessarily be predicted or guaranteed. Bostrom suggests that we can't assume that a particular level of intelligence would guarantee a particular subset of goals and exclude some other subset of goals, or that some subset of goals is guaranteed to be pursued or not pursued by a particular level of intelligence. Intelligence and goals are simply not directly correlated.
Of course, Llama already pointed out in the Dialogue that there are issues with the wording of this conjecture in that the intelligence level has to be high enough to conceive of the goal in the first place. Bostrom briefly touched on this shortcoming but quickly dismissed it in order to focus on the potential repercussions of his speculation, and I'll do the same for the purposes of this discussion.
There are really three somewhat related points Bostrom is promoting with the Orthogonality Thesis. For the most part, these points are attempts to circumvent anthropomorphic thinking, which in and of itself, is a worthy endeavor.
The first point is that when considering the set of all possible minds that can be represented, it seems likely that all human minds would exist in a very small and tight cluster within this larger set. The idea here is that no matter how different each individual human mind seems to us, in the vast group containing every type of possible mind that's capable of intelligent thought, human minds are a tiny subset whose members are nearly indistinguishable from each other.
A greater subset would be biological minds — both terrestrial and extraterrestrial — which have the potential to be vastly different from one another but would still be the product of biological evolution and thus have commonalities. Completely outside of this is the subset of all possible artificially engineered minds which share no common members with the subset of biological minds and which potentially have substantial differences from members of the biological subset as well as from each other. The spectrum of possible divergences between these artificial minds and our minds is vast.
While Bostrom grants that some goals are less likely than others, the second point promoted is that any attempt to judge this likelihood on our part will be too colored by anthropomorphism to be valid. Instead, we must consider the set of all possible goals when discussing the potential goals of AGI systems, particularly superintelligent systems. Within this infinite set of all possible goals is a very small subset of goals which are relevant or even comprehensible to humans.
The third point advanced by the Orthogonality Thesis is a dismissal of the idea that greater intelligence leads to greater understanding of and compassion towards other conscious entities. This has occasionally been used as a counter to speculations like the paperclip maximizer, i.e. obviously an AGI system so smart that it can turn all matter in the universe into paperclips is smart enough to realize that humans will suffer if it does so and it will not want to cause such suffering.
This compassion argument is actually used more often as a Straw Man argument from AI Dystopians than as an actual argument opposing their ideas, but it's still worth addressing.
The reasoning goes that as one examines the history of humanity, there does seem to be a distinct trend towards what many believe to be greater morality, meaning tolerance of others, less violence, more social generosity, etc. While there's certainly evidence to support this conclusion (as well as notable exceptions to the trend), it remains fairly dubious as to how applicable this line of reasoning is to an AGI system.
This trend towards what we consider higher morality is certainly visible over the roughly 5000 years of recorded human history. Although progress has been relatively slow, it has occurred along an accelerating curve. Yet the physiology of the human brain has been roughly the same for approximately 300,000 years, so our intellectual capacity hasn't really changed. There have also been plenty of smart people throughout history whose moral compass we might find askew today but which operated well within the normal parameters of their times.
Given this, raw intelligence does not appear to be the deciding factor when it comes to morality. It seems more likely that the dynamics of this “moral arc” is best examined at the societal level rather than the individual level. On top of this, our data points are all from one species with one type of brain and its associated intelligence. It's quite an assumption to project this onto all potential intelligences, whether artificial or not.
Although he stresses that we can't determine the goals of a superintelligent machine, Bostrom speculates that we might be able to determine some of the instrumental goals or values such a machine would have to achieve its ultimate goals. To formalize this he offers his Instrumental Convergence Thesis:
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.
In other words, for a wide range of potential ultimate goals pursued by an intelligent entity, we can identify a number of subgoals that are likely to be pursued. Equally important is the implication that the actions used to achieve these subgoals have the potential to be detrimental to humanity even if the ultimate goals themselves are harmless.
Some of the potential instrumental goals suggested are similar to Omohundro's basic drives, i.e. self-preservation, self-improvement, and what Bostrom refers to as goal-content integrity. This last term is used to label the proposition that an intelligent entity will strive to prevent alterations of its present ultimate goals so as to ensure that those goals are more likely to be achieved by its future self, whatever form that future self takes. It is this proposition, AI Dystopians might argue, that would prevent the paperclip maximizer from changing its goal to just making one paperclip and calling it a day, as Wombat suggested in the last Dialogue.
On the Other Hand
Let’s start with the concept that hardcoded ultimate goals are a key ingredient of general intelligence. This is quite a supposition to base conclusions on, particularly conclusions that point to an existential risk for humanity. In fact, the entire GOUFI model for intelligent systems is questionable at best. There are no biological intelligent systems that function according to the GOUFI model, and even our most successful AI systems today don’t function according to the GOUFI model. It’s a theoretical model with no empirical evidence supporting its use.
As for the Orthogonality Thesis conjecture, a lot is masked by the term in principle in the wording of the conjecture; many things are possible in principle but have close to zero probability in practice. By close to zero, I mean that evaluated over the entire life of the universe, they would still have an infinitesimally small chance of occurring.
It seems likely that on a graph of possible goals versus possible intelligences, data points are going to clump heavily in certain areas and be non-existent in others, due simply to the nature of intelligence and the nature of probability. Of course given that we have a very minimal understanding of the functional properties required for high intelligence and only one sample point, it's very difficult to be definitive as to how clumpy data points on this graph are likely to be — even saying that clumping is likely is, admittedly, extremely speculative. So we can’t really say too much at all, including that the Orthogonality Thesis conjecture is true or even likely.
This brings up a thread of reasoning that's integral to many conjectures of AI Dystopianism and central to the Orthogonality Thesis. This is that we can't know what the goals of these non-human intelligent systems are, given the infinite number of potential goals they could have, a large proportion of which are simply outside the ken of human kind.
But this ignores the fact that we built these machines. They are not alien artifacts that have drifted through space and landed here on earth; these are machines that we built to do things we want them to do. The argument could certainly be made that intelligent machines using the GOUFI model may attempt to pursue their goals in unexpected ways, or that intelligent machines which don't use this model might have unexpected goals.
But if we're to assume that the machines will be using this GOUFI model, it makes no sense to postulate that their ultimate goals, the initial ones they're pursuing at all costs, would be unknowable, irrational, or unexpected. Such speculation results from generalizing a concept to the point of absurdity, an abjuration of reason resulting in discourse as diaphanous as that of dancing angels on pin heads.
Even if we accept the GOUFI model of intelligence and we also accept the Orthogonality Thesis conjecture, we begin to run into logical potholes when asserting that we can determine a set of instrumental goals or values which apply to a wide range of all goals, as stated in the Instrumental Convergence Thesis. As described above, a key component of the Orthogonality Thesis is that instead of considering only what we feel are probable or reasonable goals, we must consider all potential goals in this discussion.
Given this infinite number of potential goals, it makes no sense to state that we can surmise instrumental goals that apply to a wide number of them, as predictions on a wide number out of infinite possibilities still leaves one with an infinite number of unpredictable possibilities. No matter what instrumental goal we single out, there are an infinite number of goals to which it is immaterial. This hints at the general weakness of any discussion that involves infinite possibilities in the real world without modifying that discussion to account for probabilities.
Bostrom proposed that an AGI would have an implacable compulsion not only to achieve its goals but to keep them sacrosanct as well, which he referred to as goal-content integrity. In keeping with his Instrumental Convergence Thesis, this compulsion will inevitably result in particular behavioral drives and potentially many unforeseen and potentially dangerous behaviors. The inevitable result: a lot of bad things.
Interestingly, Bostrom only applies the concept of goal-content integrity to "final goals," stating that "an intelligent agent will of course routinely want to change its subgoals in light of new information and insight." This comment is revealing in that it demonstrates a contradiction at the heart of these fundamental propositions, one arising from trying to weave the fuzzy threads of AI Dystopian speculation into the smooth fabric of logical thought. Goal-content integrity explains the relentless pursuit of ultimate or final goals necessary to justify all the drives detailed by Omohundro and Bostrom, as they can all be traced back to the system's attempting to maximize its ability to maintain and achieve these invariant goals.
What Bostrom is saying is that the instrumental goals of an AGI, which are necessary to actualize its drives and achieve its goals, will shift and change to deal with the unpredictable and ever-changing physical universe. So the subgoals must remain varying while the main goals are immutable and sacrosanct.
But why would this be the case? Why would all the goals not be either immutable or instead vary depending on past events, current circumstances, and reasoned analysis of potential future outcomes?
Science and Philosophy
Omohundro, Yudkowsky, and Bostrom provide their conjectures with no empirical evidence and somewhat sparse logical reasoning. Rather than any proof or even reasoned extrapolation, a few possibilities of how the conjecture might be true are all that they provide. This highlights a problem inherent in much of the reasoning in AI Dystopianism, which is the tendency to simply postulate outcomes that are conceivably possible (sometimes barely or arguably so) rather than outcomes that are definite, likely, or logical extrapolations of empirical evidence.
Although Bostrom refers to each of his conjectures as a thesis, none of them actually fit the definition of a thesis, i.e. the premise or summary of a theory preceding a proof of or evidence for that theory. Like most of the arguments listed in the papers above, they are assumed to be true and to provide a solid basis for the dire outcomes that follow.
In other words, there is a tendency in these foundational documents to trade in imagination rather than analytical reasoning. Rather than employing the scientific method, they employ philosophical speculation. Yet it is the scientific method that allows us to work around our cognitive shortcomings and leverage our knowledge of the universe around us.
Wikipedia defines it as follows:
[The Scientific Method] involves careful observation, applying rigorous skepticism about what is observed, given that cognitive assumptions can distort how one interprets the observation. It involves formulating hypotheses, via induction, based on such observations; experimental and measurement-based testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings.
None of this really applies to AI Dystopianism, which I would argue is an idealogical rather than scientific viewpoint. These conjectures, and much of the speculation underlying AI Dystopianism, are simply not scientific in nature. The thinking suffers from the Unproven Basis fallacy, in that conjectures are made and then significant extrapolations are based on them without adequately showing the original conjectures to be truthful or even reasonable.
Coherence and Contradiction
AI Dystopian conjectures frequently offer contradictory statements for AGI systems and superintelligent entities: goals are either locked into place and maintained at all costs or unpredictable due to the infinite number of potential goals an intelligent entity might have. Subgoals are either possible to predict because a manageable subset of them would be pursued by many intelligent agents or unpredictable because they're chosen by non-human intelligences to deal with an ever-shifting set of circumstances.
This ambiguity leads to many questions.
For example, it brings us back to the question poses in the last Dialogue: can an entity with an unchanging and unchangeable set of ultimate goals in an ever changing universe truly be considered a generally intelligent entity?
And are we really incapable of judging what is a more or less likely goal for a non-human intelligent entity or can we assume that any rational intelligence would likely not have irrational goals? Does this allow us to pare down the infinite ocean of potential goals into manageable pools of more and less likely goals, of rational and irrational goals?
These questions lead to the next topic of discussion, which involves what it means to be rational and explores whether we as biased humans can objectively state whether a particular goal is rational or irrational.