There have been a couple of news items in the last few weeks that highlight some of the issues brought up in recent posts. They also help to highlight some of the problems with assuming that LLMs are close to AGI, and that AGI is close to becoming a reality. Sometimes the last 20% mentioned in my previous post is really, really hard, and sometimes it’s impossible to achieve on the path set by the first 80%.
The All-Powerful Q* Continuum at OpenAI
The first news item was the revelation that OpenAI was developing a mysterious and possibly dangerous new technology in a project they called Q*.
According to Reuters:
Ahead of OpenAI CEO Sam Altman’s four days in exile, several staff researchers wrote a letter to the board of directors warning of a powerful artificial intelligence discovery that they said could threaten humanity, two people familiar with the matter told Reuters.
Some at OpenAI believe Q* (pronounced Q-Star) could be a breakthrough in the startup's search for what's known as artificial general intelligence (AGI), one of the people told Reuters.
And, of course, rampant alarmed speculation ensued.
Multimodal Magic From Google
The second news item was the announcement and release by Google of a new version of their Bard LLM system dubbed Gemini. Along with the public release of the system was the release of many videos describing and demonstrating it. One particularly impressive video was a demonstration of some of the system’s multimodel AI capabilities:
These news items generated significant hype, both positive and negative. They have stoked the embers of expectation that AGI is just around the corner and nearly within our grasp.
This is not true. In my opinion, it’s not even close to being true.
Q*
Q* seems to be a new approach to solving grade school math problems better than current LLM systems are able to. These problems are harder for current AI systems than one might think, especially given the impressive things current AI systems can accomplish that seem much more complicated.
While current LLM systems are able to do math problems to some degree, their performance is very uneven. There has been speculation that if this Q* system can do these kinds of math problems reliably, then it would have the ability to reason and plan in ways that have so far been out of reach for other AI systems.
There was also speculation that Sam Altman was fired from OpenAI because he was pushing this dangerous technology and the board found this to be reckless. This is possible but seems unlikely, as more information has come out regarding OpenAI and its internal political discord. So far, no concrete details about Q* have been released.
Gemini
After Google released its impressive Gemini demo, it was revealed that the demo was not quite what it seemed. As can be seen in the video above, it appeared to show Gemini interacting with a human in realtime and being able to answer questions and make observations involving images, speech, and objects offered by the human.
Unfortunately, that is not the case. The demo was created by first presenting still images and text prompts to Gemini and gathering the text responses. Then a live presentation was recorded with a human presenting images, video, objects, and interacting through spoken questions. Spoken versions of Gemini’s already generated responses were then edited in. Some of this process is detailed in a Google post that wasn’t quite as easy to find as the splashy video.
Are We Closer to AGI?
There’s not a lot of reason to assume that being able to do better grade school math is a direct path to full-fledged AGI. It’s pretty obvious that the way people do math and the way LLM systems (and other machine learning systems) do math is quite different.
What the speculation about LLMs and AGI boils down to is this: is LLM technology a stepping stone to full AGI or is it something else entirely? If it is a stepping stone, then it seems reasonable that adding some additional technology to LLM systems will get them ever closer to AGI and at some point likely achieve it.
If, however, LLM technology is simply something that shows similarities to human intelligence but is not related to it functionally, then simply improving LLMs is unlikely to get us to full AGI.
A useful analogy might be the relationship of gliders and hot air balloons to bird flight. If we’re trying to replicate the ability to fly demonstrated by birds, we might invent a glider or we might invent a hot air balloon. If we invent a glider, we’re actually using some of the physical principles that birds use to fly. We need to do a lot more work to get that glider close to the capabilities of birds, but we’re part of the way there.
If instead we invent a hot air balloon, we’ve also created something that can fly through the air. However, it does so using entirely different physical principles than birds. No matter how we improve the hot air balloon, its functionality and the physical principles it employs are not related to the that used by birds to fly. It will never be able to do the things that birds can do.
LLMs are either gliders or hot air balloons. From what I’ve seen so far, I’m inclined towards the latter.
What the Turing Test Tests
Alan Turing was a titan of computer science in its early days, and his contributions to the field form a significant chunk of the foundation underlying modern computation theory. Turing was very interested in the possibility of using a computer to mimic the processes of the brain and wrote a 1950 paper in which he posed the question, "Can machines think?"
In that paper he suggested a revised version of what he called the Imitation Game to judge whether a machine was actually thinking. In the original game, a man and a woman are concealed from a judge and communicate only through the written (or preferably typewritten) word. The idea is that the judge asks questions of each and eventually surmises which is the man and which is the woman, with the added hitch that the man tries to fool the judge while the woman tries her best to help the judge choose correctly.
Turing proposed replacing one of the contestants with a digital computer and then having the judge attempt to discern which was the human and which was the machine. Whether or not Turing truly thought this would be a definitive way of determining whether or not humanity had created AGI is hard to know. However, it is pretty clear after the many years of cognitive science research since Turing first proposed it that the test has perhaps more to say about human minds than machine minds.
The main problem is that what’s tested is not the intelligence of the machine, but rather whether it can fool a human into thinking it’s intelligent. Fooling humans is not as hard as one might hope. As I discussed in this post, people have a long history of being fooled by systems that don’t actually display any real level of intelligence at all.
On the Cusp of AGI
Back in the late 1960s, a computer scientist named Terry Winograd wrote a very impressive AI program that understood a small subset of natural language. This program was called SHRDLU, and at the time it was hailed as a breakthrough that possibly signified true AGI was right around the corner.
SHRDLU could understand English sentences that directed it to do things in its very limited block world. Here’s a brief example of interacting with SHRDLU:
Person: Pick up a big red block.
Computer: OK.
Person: Grasp the pyramid.
Computer: I DON'T UNDERSTAND WHICH PYRAMID YOU MEAN.
Person (changing their mind): Find a block which is taller than the one you are holding and put it into the box.
Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
Computer: OK.
Person: What does the box contain?
Computer: THE BLUE PYRAMID AND THE BLUE BLOCK.
Person: What is the pyramid supported by?
Computer: THE BOX.
As impressive as it was, it soon became apparent that its expertise in this very confined domain didn’t translate into anything very useful outside of that domain. Despite this early lesson in domain constraints, many in today’s domain of AI hype are quick to point to any advance in a relatively narrow area being an advance towards artificial general intelligence.
Behind the Curtain
There are things that LLMs do that seem very similar to what humans do. They work using pattern matching and statistical analysis with a little randomness thrown in. All three of these characteristics are also characteristics of human intelligence.
People are very good at recognizing patterns and associations between things with very little input. Sometimes this leads to the problem of recognizing patterns and associations even when they don’t exist, which has led to many of the cognitive biases appearing in this blog’s glossary.
People are also able to analyze sensory inputs and predict the probability of one thing’s relation to another. That can be the relation of a past event to a future event, the meaning of a homophone when spoken based on the words around it, or the image formed by a lot of colored dots on a monitor.
Beyond this, though, humans have something that no AI systems have. Current AI systems and future systems based on the same technology may be able to pass a Turing Test given the right judges, but they’re unlikely to pass such a test given savvy judges. To those familiar with how these systems work and the constraints that limit them, it becomes obvious fairly quickly that there is no wizard behind the curtain, just a very advanced calculator.
There’s still quite a wide gap between what humans can do and what any AI system can do. What allows humans to be intelligent in domain after domain is not an added feature — it is most likely the core mechanism that underlies human-level intelligence. By all current evidence, human intelligence is not simply the sum of narrow intelligence in lots of different areas. It is something more, something that remains elusive, something that despite all our success in machine learning still seems tantalizingly beyond our grasp.