Cutting edge AI language software is more powerful than ever, and what it can do today might surprise you. I'm sure you have heard of Siri, Alexa, and Google Assistant. These AIs can be really helpful but they are not even near the cutting edge of computer language generation. Some of the best language AIs (also known as models) today can respond so well to queries that most people assume there is a human at the other end typing the answers. The newest kind, called large language models (LLMs), are simply astounding in terms of what they can do (such as Open AI's GPT-3, GPT-4 or ChatGPT). However, even the best of them still have a way to go before achieving full human-level performance.
The limitations of these
AI language models are very similar to those of our brain's language area.
Isolated from the rest of the brain, our brain's language area could
potentially still produce speech, but it would be rote, mechanical, and
unplanned, much like the language produced by contemporary AI language models.
But if this language software was integrated into a larger network with a form
of working memory, as our brain's language area is, it could potentially
produce much more coherent language.
Our brain's language
region is called Broca's area and helps us with the fine details of piecing
together our sentences. These are mostly the tedious details that we can't be
bothered by. We are unconscious of most of the work Broca's area performs, but
we wouldn't be able to speak without it. It does many things, including helping
us find words that are on the tip of our tongue. Broca's area could at least
keep us talking if the more generalist brain areas like the prefrontal cortex
(involved in both consciousness and larger overarching language concerns) were
gone. The speech generated from Broca's alone might sound grammatically correct
and have proper syntax, but the semantics would have issues. We see this in
people with prefrontal injuries today. They can talk indefinitely, and at
first, what they are saying might sound normal, but it doesn't take long to
realize that it is missing some intellectual depth.
As you will learn here,
modern AI language software is also missing depth because it does not plan its
speech. These systems literally put their sentences together one word at a
time. Given the series of words that have come so far, they predict which word
is most likely to come next. Then they add their prediction to the list of
words that have come before and use this new list to make the following prediction.
They repeat this process to string together sentences and paragraphs. In other
words, they have no forethought or real intention. You could say that modern AI
language generation systems "think" like someone who has a serious
brain injury or who has been lobotomized. They formulate speech like the TV
show character Michael Scott from The Office. Here is a telling quote from
start a sentence, and I don't even know where it's going. I just hope I find it
along the way. Like an improv conversation. An improversation."
Michael is a parody. He
is a manager with an attention deficit, that has no real plan and does
everything on the fly. As you can see on the show, his work doesn't lead to
productivity, economic or otherwise. We need AI that is structured to do better
The question becomes,
how can we create an AI system that does more than just predict the next best
word, one at a time? We want a system that plans ahead, formulating the gist of
what it wants to say in its imagination before choosing the specific words
needed to communicate it. As will be discussed, that will never emerge by using
the same machine learning architectures we have been using. Additional
architectural modifications are needed.
This entry will espouse
taking the current neural network language architecture (the transformer neural
network model introduced in 2017) and attaching it to a larger, generalist system. This larger
system will allow word choice to be affected by a large number of varied
constraints (not just the words that came earlier in the sentence). The diagram
below shows a multimodal neural network made up of several different
interfacing networks. You can see the language network on the far right, in the
center, attached directly to a speaker.
It would be a tremendous
engineering feat to get multiple neural networks to interface and work
together, as depicted above. But research with neural networks has shown that
they are often interoperable and can quickly adapt to each other and learn to
work cooperatively. AI pioneer Marvin Minsky called the human brain a “society
of minds.” By that he meant that our brain is made up of different modular
networks each contribute to a larger whole. I don’t think AI will truly become
intelligent unless it works in this way. Before we go any further, the next
section will briefly explain how neural networks (the most popular form of
machine learning in AI) work.
How Neural Networks
Chose Their Words
I explain how neural
networks work in detail in my last entry, which you can read here.
A quick recap: Most
forms of machine learning, including neural networks, are systems with a large
number of interlinked neuron-like nodes. These are represented by the circles in the
diagram below. The nodes are connected via weighted links. These links are
represented as the lines connecting the circles. The links are considered
“weighted” because each has a numerical strength that is subject to change
during learning. As the AI software is exposed to inputs, those inputs flow
through the system (from left to right), travelling from node to node until
they reach the output layer on the far right. That output layer contains a node
for every word in the dictionary. Whichever node in the output layer is
activated the most will become the network’s next chosen word.
Language generating neural networks are exposed to millions of sentences, typically from books or articles. The
system can learn from what it is reading because it adapts to it. The weight’s
values are strengthened when it is able to correctly guess the next word in a
sentence it is reading, and the values are weakened when it chooses any other
word. Given a broad attention span and billions of training sessions, they can
get really good at internalizing the structure of the English language and piecing
sentences together word by word.
The diagram below shows
the words “this is my pet…” being fed into an AI neural network. The final word
“cat” is hidden from the network. As “this is my pet” passes through the
network from left to right, the words travel from node (circle) to node,
through the weighted links, toward the full dictionary of nodes at the output
layer. The pattern of inputs caused a pattern of network activity that then
selected a single output. You can see the network converging on the word “cat”
as its best prediction. It got it right! This network could then continue,
rapidly writing sentences in this way for as long as you tell it to.
Broca's area in your
brain works similarly. It takes inputs from many different systems in the
brain, especially from the system that recognizes spoken language. These inputs
activate select neurons out of a network of millions of them. The activation
energy in the form of neural impulses travels through the network and toward
something analogous to an output layer. This happens every time you speak a
word. In both the brain and AI software, inputs work their way through a system
of nodes toward an existing output. That output represents the decision, and in
this case, it's the next word.
AI that Generates
Broca's area is a patch
of cortical tissue in the frontal lobe designed by evolution to have all the
right inputs, outputs, and internal connectivity to guide the involuntary,
routinized aspects involved in speech production. Neuroscientists still don't know
much about how it works, and reverse engineering it completely would be nearly
impossible with today's technology. Lucky for us, AI probably doesn't need an
exact equivalent of Broca's to develop the gift of gab. In fact, it may already
have something even better.
There are many
state-of-the-art natural language systems we could discuss, but here we will
focus on one called GPT-3 (which arrived in May 2020). It is an exciting new AI project that has proven to
be highly adept at natural language processing (NLP). It can answer questions,
write computer code, summarize long texts, and even write its own essays. However,
keep in mind that as discussed above, it has no plan. The next word it chooses
is just the word that it "predicts" should come next. This is called
"next word prediction."
You can feed it the
first two sentences of a news article, and it will write the rest of the
article convincingly. You can ask it to write a poem in a certain author's
style, and its output may be indistinguishable from an actual poem by that
author. In fact, one blogger created a blog where they only posted GPT-3 text
as entries. The entries were so good that people were convinced it was written
by a human and started subscribing to the blog. Here is an example of a news
article that it wrote:
Traditionally AI does poorly with common
sense, but many of GPT-3’s responses are highly logical. I want to urge you to
use an online search to find out more about the fantastic things that it can
do. However, keep in mind that it sometimes makes fundamental mistakes that a
human would never make. For example, it can say absurd things, completely lose
coherence over long passages, and insert non-sequiturs and even falsehoods.
Also, as rational as its replies may seem, GPT-3 has no understanding of the
language it creates, and it is certainly not conscious in any way. This becomes
clear from its responses to nonsense:
I don’t think that tweaking or expanding
GPT-3’s architecture (which many in AI are discussing) is ever going to produce
a general problem solver. But it, or a language system like it, could make a
valuable contribution to a larger, more general-purpose AI. It could even help
to train that larger AI. In fact, I think GPT-3 would be a perfect addition to
many proposed cognitive architectures, including one that I have proposed in an
article in the journal Physiology and Behavior here. The
rest of this blog post will describe how a language model like GPT-3 could
contribute meaningfully to a conscious machine if integrated with other
specialized systems properly.
Playing the Role of Broca’s
When we are born, our language areas are not
blank slates. They come with their own instincts. Highly complex wiring
patterns in the cerebral cortex set us up in advance to acquire language and
use it facilely. AI should also not be a blank slate, like an undifferentiated
mass of neurons. It needs guidance in the form of a wiring structure. GPT-3s
existing lexicon, and record of dependencies between words could help bootstrap
a more extensive blank-slate system. Taking a pretrained system like GPT-3 and
embedding it within a much larger AI network (that starts with predominantly
random weights) could provide that AI network with the instincts and linguistic
structure it needs to go from grammatical, syntactic, and lexical proficiency
to proper comprehension. In other words, an advanced NLP system will provide
Noam Chomsky and Steven Pinker’s “language instinct.”
When GPT-3 chooses the next word, it is not
influenced by any other modalities. There is no sight, hearing, taste, smell,
touch, mental imagery, motor responses, or knowledge from embodiment in the
physical world influencing what it writes. It is certainly not influenced by
contemplative thought. These are major limitations. By taking the GPT neural
network and integrating it with other functionally specialized neural networks,
we can get it to interact with similar systems that process information of
different modalities resulting in a multimodal approach. This will give it a broader form of attention
that can keep track of, not just text but a variety of other incentives,
perceptions, and concepts. GPT-3 already prioritizes strategically chosen
words, but we want it also to prioritize snapshots, audio clips, memories,
beliefs, and intentions.
Determining priority should be influenced by
real-world events and experiences. Thus, the system should be able to make its
own perceptual distinctions using cameras, microphones, and other sensors. It
should also be able to interact using motors or servos with real-world objects.
Actual physical interaction develops what psychologists call “embodiment,”
crucial experiences that shape learning and understanding (GPT-3, on the other
hand, is very much disembodied software). Knowledge about perception and
physical interaction will influence the AI’s word choice, just like our
real-world experiences influence the things we say. For instance, by applying
embodied knowledge to the events that it witnesses at a baseball game, an AI
should understand what it is like to catch or hit a ball. This understanding
coming from experience would influence how it perceives the game, what it
expects to happen, and the words it uses to talk about the game. This kind of
embodied knowledge could then interact with the millions of pages of written
text that it has read about baseball from sports news and other sources.
For me, the end goal of AI is to create a system that can help us accomplish things we cannot do on our own. I am most interested in creating an AI that can learn about science, by doing things like reading nonfiction books and academic articles, and then make contributions to scientific knowledge by coming up with new insights, innovations, and technology. To do this, an AI must think like a human, which means it must have the equivalent of an entire human brain and all of its sensory and associative cortices, not just its language area.
I think that to build superintelligence or
artificial general intelligence, the system must be embodied and multimodal.
You want it to be interacting with the world in an active way, watching events
unfold, watching movies and youtube videos, interacting with people and
animals. As it does this it should be using the words that it has to describe
its experience as psychological items (higher-order abstractions) to make
predictions about what will come next and how to interact with the world.
GPT-3 uses its attention to keep track of
long-term dependencies. It selectively prioritizes the most relevant of recent
words so that it can refer back to them. This is how it keeps certain words “in
mind” so that it doesn’t stray from the topic as it writes. GPT-3 is 2048
tokens (think words) wide. That is its “context window” or attention span. In
my opinion, this may be more than large enough to serve as an equivalent of
Broca’s area. GPT-3 must have an attention of thousands of tokens because it is
compensating for the fact that it doesn’t have the equivalent of an
overarching, hierarchical, embodied, multimodal, global working memory.
The architecture for GTP-3 is very similar to
that of GPT-2 and GPT-1. They all use the same algorithm for attention. GPT-3
performs much better than its earlier iterations, mostly because it is much
larger. It contains more layers, wider layers, and was trained on more data.
Some people think that using this architecture and continuing to scale it up
could lead to artificial general intelligence, which is AI that can do anything
a human can do. Some even speculate that it could lead to conscious AI. I am
highly convinced that GPT-3, or other neural networks like it, will never lead
Continuing to scale up this system will lead
to improved performance but also diminishing returns. Although it could lead to
many general abilities, it will never lead to true understanding or
comprehension. Trying to do so would be like creating an Olympic sprinter by
building an intricately complex robotic foot. The foot may be necessary, but
you will need all the other body parts to come together for it to run
competitively. GPT-3 must be linked together with other specialized modules
into a shared workspace for it to really shine. Before we talk about this
workspace in the last section, let’s look at Broca’s in a little more detail.
Broca's and Wernike's Areas
Broca's area is a motor area in the frontal
lobe responsible for speech. Patients with damage to Broca's have trouble
speaking. If the damage is sufficient, they may be disfluent, aphasiac, or
mute. Much like GPT-3, Broca's selects the next word to be spoken based on the
words that came before. Once it chooses the word that fits in context, it hands
the word down to lower-order motor control areas (like the primary motor area)
that coordinate the body's actual muscular structures (voice box, tongue, lips,
and mouth) to say the word. The neurons in your motor strip constitute your
output layer, and there is a dictionary in there in some shape or form. To
continue to explain the role of Broca's in language, we must introduce its
sister area, Wernicke's.
Wernike's area is a cortical area that helps
us process heard speech. It is found in the temporal lobe and takes its inputs
from early auditory areas that get their inputs straight from the ears.
Neurological patients with damage to this area can hear most nonspeech sounds
normally as their auditory areas are still intact, but they have a specific
deficit in recognizing language. In other words, your Wernicke's area will not
try to analyze the sound of a car but will try to analyze the voice of your
friend. It acts as a pattern recognizer specifically for spoken words.
Wernicke's and Broca's are specialized modules
whose (mostly unconscious) outputs affect how we perceive and use language and
even how we think. It is interesting to note that Broca's is about 20% larger
in women than in men, and this may be responsible for women's greater fluency
with language and higher verbal abilities.
The diagram below shows how we can go from
hearing our friend say something to us to responding to them with our own
words. First, the primary auditory area takes sounds heard by the ears,
processes them further, and sends its output to Wernicke's area. Wernicke's
then picks out the words from these sounds and sends those to Broca's. Broca's,
in turn, sends the words that should be spoken in response to the motor area,
which will then send the appropriate instructions to the tongue, jaw, mouth,
and lips. This seems like a tight loop, but keep in mind that several other
loops involving various brain areas contribute to our verbal responses (loops
that NLP systems such as GPT-3 don't have).
It is worth mentioning that GPT-3 handles both
the input of text and its output, so in this sense, it serves as an analogue of
both Broca's and Wernicke's areas. However, it cannot hear or speak. This is
easily fixed, though, by connecting a speech to text program to its input to
allow it to hear. Allowing it to speak is as easy as connecting its output to a
text to speech program.
In the brain, Broca's and Wernicke's have a
continual open circuit connecting them at all times. They are constantly
working together. Your Wernicke's area also listens to the speech you generate,
which helps provide real-time feedback about the words coming out of your
mouth. This allows language perception to be constantly linked to language
generation. Not only does it allow us to hear the words that we say out loud,
but it also gives a voice to the subvocal inner speech that we create. In other
words, the loop between these two areas is responsible for the voice in your
head, your internal monologue. Your Broca's area allows its outputs to be sent
to your auditory area even when actual speech is suppressed, and this is why
you can hear your own voice in your head even when you are not speaking aloud.
We basically hallucinate our inner voice. Inner speech may be an essential
aspect of consciousness, so we should give our AI system this kind of
The circuit connecting Broca's to Wernicke's
is also responsible for the "phonological loop", which is a form of
short-term sensory memory that allows us to remember a crystal-clear version of
the last 2.5 seconds of what we just heard. This is why you can remember
someone's last sentence word for word or remember a seven-digit phone number.
Judging from the fact that all humans have one and that it is very useful to us
day to day, the phonological loop may also make substantial contributions to
consciousness. For this reason, Broca's and Wernicke's analogues may be
essential ingredients for superintelligent AI.
GPT-3 may be ready as-is to serve as an
equivalent of Broca's area in a larger system that is designed to interface
with it. However, it is not ready to handle the long-term conceptual
dependencies necessary for true cognition. To do this, it needs to interact
with a global workspace.
What is a Global Workspace?
The Global Workspace is a popular model of
consciousness and brain architecture from brain scientist Bernard Baars. It
emphasizes that what we are conscious of is broadcast globally throughout the
brain ("fame in the brain"), even to unconscious processing areas.
These unconscious areas operate in parallel, with little communication between
them. They are, however, influenced by the global information and can form new
impressions of it, which in turn can be sent back to the global
The diagram below, adapted from Baars' work,
shows five lower-order systems separated from each other by black bars. Each of
these systems is hierarchical, and only at the top of their hierarchy can they
communicate with one another. The location where they meet and exchange
information is the global workspace. Here in the workspace, the most critical
elements are activated and bound together into a conscious perception.
This neurological model can be instantiated in
a computer in the form of interacting neural networks. The diagram below shows
six different neural networks, which all remain separate until their output
layers are connected in a shared global workspace. The black letters represent
items held active in working memory.
The global workspace is like a discussion
between specialists who share their most important ideas. If Broca's converges
strongly on a series of words, those will be shared with the global workspace. From
there, they are shared with other brain areas and modules. For example, if when
you read about a "pink rhino in a bathing suit" the words in this phrase
are translated to words you hear in your "mind's ear." From there
they are broadcast to the global workspace where you become conscious of them.
From there they are shared with your visual processing areas so that you can
form a mental picture in your "mind's eye."
It would probably be helpful to train the AI
language model (or module) by itself first before it is dropped into a larger
global architecture. This is similar to the way our Broca's and Wernicke's
areas come with genetically determined wiring patterns that have been selected
over tens of millions of years of evolution (it is worth mentioning that even
apes and monkeys have analogues of these two areas, and they generally perform
the same functions). Once the language area is dropped in it can play a hand in
training the rest of the system by interacting with it. Over time, the two
systems will help to fine-tune each other.
Broca's area is always running in the
background, but its processing does not always affect us. It only has access to
consciousness when what it is doing is deemed important by dopaminergic
centers. Similarly, the language it produces is only broadcast to the larynx
and thus spoken aloud when other brain areas grant this. Our AI system should
work this way too. It should have three levels of natural language generation
activity: it should be able to speak, produce subvocal speech that only it can
hear, and have speech generation going on in the background that unconsciously
influences it (and the global workspace).
Even if the system is not speaking or printing
text to a console, its language generation should be running in the background.
Like us, it may or may not be conscious of the words its Broca's area is
stringing together. In other words, its output may not be taking center stage
in the global workspace. However, whether subliminal or not, the language it
generates should still influence the behavior of other modules. And just as you
can hear the words coming out of your mouth, this larger system would be able
to analyze GPT -3's outputs and provide it with feedback about what to say
next. We would want it to be able to self-monitor its own language output.
Broca's area takes conceptual information from
the global workspace and turns it into a stream of words. It translates, fills
in the blanks, and finds the appropriate words to express what is intended.
When you are approached by a stranger that seems to have a sense of urgency,
Broca's area turns your intentions into words: "Hi, how can I help
you?" We don't have the mental capacity to pick and choose all the words
we use individually. Much of it is done completely unconsciously by this
first, the word selections made by the AI system would be almost entirely
determined by the language model. This is analogous to how our language areas
and their inherited architecture shape how we babble as infants. Slowly the
weights and activation from the global workspace would start to influence the
word selection randomly and subtly. Errors and reward feedback would alter the
weights in various networks and slowly tune them to perform better. Over time,
the language model will gradually relinquish control to the higher-order
demands and constraints set by the larger system.
The diagram below shows a large system made of
rectangles. Each rectangle represents a neural network. The largest network on
the left (more of a square) contains semantic representations that can be held
in either the focus of attention (FOA), the short-term memory store (STM), or
in inert long-term memory. The letters in this square show that this system is
updated iteratively. This means that the contents of the system's working
memory have changed from time one (t1) to time two (t2). But significantly, it
hasn't changed entirely because these two states overlap in the set of concepts
they contain. This kind of behavior would be important for the language module,
but also for the other modules in our AI system as well.
It is important to mention that GPT-3 is also
updated iteratively. Its attention span for the words that it just read is
limited. Once it is full it is forced to drop the words that have been there
the longest. We can assume that Broca's area is also updated iteratively. But
unlike Broca's, GPT-3 does not connect to a larger system that prioritizes its
working memory by using an FOA and an STM.
The network of neural networks described here should utilize
nodes that fire for extended periods to simulate the sustained firing of
cortical pyramidal neurons to create a focus of attention. Neurons that drop
out of sustained firing should then remain primed using a form of synaptic
potentiation amounting to an STM. This larger system should also use SSC,
icSSC, iterative updating, multiassociative search, and progressive
modification, as explained in my article here. This architecture should allow the system to form
associations and predictions, formulate inferences, implement algorithms,
compound intermediate results, and ultimately create a form of mental
Rather than relying on “next word prediction” truly intelligent systems need a form of working memory that and a global workspace. Linking the modern natural language generation models with the major mechanistic constructs from cognitive neuroscience could give us the superintelligence we want.
Sorry this entry is SO fragmented. I spent weeks on this but just couldn't seem to pull it together.
To see my model of working memory and artificial superintelligence, visit: