Cutting edge AI language software is more powerful than ever, and what it can do today might surprise you. I'm sure you have heard of Siri, Alexa, and Google Assistant. These AIs can be really helpful but they are not even near the cutting edge of computer language generation. Some of the best language AIs (also known as models) today can respond so well to queries that most people assume there is a human at the other end typing the answers. However, even the best of them still have a way to go before achieving full human-level performance.
The limitations of these AI language models are very similar to those of our brain's language area. Isolated from the rest of the brain, our brain's language area could potentially still produce speech, but it would be rote, mechanical, and unplanned, much like the language produced by contemporary AI language models. But if this language software was integrated into a larger network with a form of working memory, as our brain's language area is, it could potentially produce much more coherent language.
Our brain's language region is called Broca's area and helps us with the fine details of piecing together our sentences. These are mostly the tedious details that we can't be bothered by. We are unconscious of most of the work Broca's area performs, but we wouldn't be able to speak without it. It does many things, including helping us find words that are on the tip of our tongue. Broca's area could at least keep us talking if the more generalist brain areas like the prefrontal cortex (involved in both consciousness and larger overarching language concerns) were gone. The speech generated from Broca's alone might sound grammatically correct and have proper syntax, but the semantics would have issues. We see this in people with prefrontal injuries today. They can talk indefinitely, and at first, what they are saying might sound normal, but it doesn't take long to realize that it is missing some intellectual depth.
As you will learn here, modern AI language software is also missing depth because it does not plan its speech. These systems literally put their sentences together one word at a time. Given the series of words that have come so far, they predict which word is most likely to come next. Then they add their prediction to the list of words that have come before and use this new list to make the following prediction. They repeat this process to string together sentences and paragraphs. In other words, they have no forethought or real intention. You could say that modern AI language generation systems "think" like someone who has a serious brain injury or who has been lobotomized. They formulate speech like the TV show character Michael Scott from The Office. Here is a telling quote from Michael:
"Sometimes I'll start a sentence, and I don't even know where it's going. I just hope I find it along the way. Like an improv conversation. An improversation."
- Michael Scott
Michael is a parody. He is a manager with an attention deficit, that has no real plan and does everything on the fly. As you can see on the show, his work doesn't lead to productivity, economic or otherwise. We need AI that is structured to do better than this.
The question becomes, how can we create an AI system that does more than just predict the next best word, one at a time? We want a system that plans ahead, formulating the gist of what it wants to say in its imagination before choosing the specific words needed to communicate it. As will be discussed, that will never emerge by using the same machine learning architectures we have been using. Additional architectural modifications are needed.
This entry will espouse taking the current neural network language architecture (the transformer neural network model introduced in 2017) and attaching it to a larger, generalist system. This larger system will allow word choice to be affected by a large number of varied constraints (not just the words that came earlier in the sentence). The diagram below shows a multimodal neural network made up of several different interfacing networks. You can see the language network on the far right, in the center, attached directly to a speaker.
It would be a tremendous engineering feat to get multiple neural networks to interface and work together, as depicted above. But research with neural networks has shown that they are often interoperable and can quickly adapt to each other and learn to work cooperatively. AI pioneer Marvin Minsky called the human brain a “society of minds.” By that he meant that our brain is made up of different modular networks each contribute to a larger whole. I don’t think AI will truly become intelligent unless it works in this way. Before we go any further, the next section will briefly explain how neural networks (the most popular form of machine learning in AI) work.
How Neural Networks Chose Their Words
I explain how neural networks work in detail in my last entry, which you can read here.
A quick recap: Most forms of machine learning, including neural networks, are systems with a large number of interlinked neuron-like nodes. These are represented by the circles in the diagram below. The nodes are connected via weighted links. These links are represented as the lines connecting the circles. The links are considered “weighted” because each has a numerical strength that is subject to change during learning. As the AI software is exposed to inputs, those inputs flow through the system (from left to right), travelling from node to node until they reach the output layer on the far right. That output layer contains a node for every word in the dictionary. Whichever node in the output layer is activated the most will become the network’s next chosen word.
Language generating neural networks are exposed to millions of sentences, typically from books or articles. The system can learn from what it is reading because it adapts to it. The weight’s values are strengthened when it is able to correctly guess the next word in a sentence it is reading, and the values are weakened when it chooses any other word. Given a broad attention span and billions of training sessions, they can get really good at internalizing the structure of the English language and piecing sentences together word by word.
The diagram below shows the words “this is my pet…” being fed into an AI neural network. The final word “cat” is hidden from the network. As “this is my pet” passes through the network from left to right, the words travel from node (circle) to node, through the weighted links, toward the full dictionary of nodes at the output layer. The pattern of inputs caused a pattern of network activity that then selected a single output. You can see the network converging on the word “cat” as its best prediction. It got it right! This network could then continue, rapidly writing sentences in this way for as long as you tell it to.
Broca's area in your brain works similarly. It takes inputs from many different systems in the brain, especially from the system that recognizes spoken language. These inputs activate select neurons out of a network of millions of them. The activation energy in the form of neural impulses travels through the network and toward something analogous to an output layer. This happens every time you speak a word. In both the brain and AI software, inputs work their way through a system of nodes toward an existing output. That output represents the decision, and in this case, it's the next word.
AI that Generates Natural Language
Broca's area is a patch of cortical tissue in the frontal lobe designed by evolution to have all the right inputs, outputs, and internal connectivity to guide the involuntary, routinized aspects involved in speech production. Neuroscientists still don't know much about how it works, and reverse engineering it completely would be nearly impossible with today's technology. Lucky for us, AI probably doesn't need an exact equivalent of Broca's to develop the gift of gab. In fact, it may already have something even better.
There are many state-of-the-art natural language systems we could discuss, but here we will focus on one called GPT-3 (which arrived in May 2020). It is an exciting new AI project that has proven to be highly adept at natural language processing (NLP). It can answer questions, write computer code, summarize long texts, and even write its own essays. However, keep in mind that as discussed above, it has no plan. The next word it chooses is just the word that it "predicts" should come next. This is called "next word prediction."
You can feed it the first two sentences of a news article, and it will write the rest of the article convincingly. You can ask it to write a poem in a certain author's style, and its output may be indistinguishable from an actual poem by that author. In fact, one blogger created a blog where they only posted GPT-3 text as entries. The entries were so good that people were convinced it was written by a human and started subscribing to the blog. Here is an example of a news article that it wrote:
Traditionally AI does poorly with common sense, but many of GPT-3’s responses are highly logical. I want to urge you to use an online search to find out more about the fantastic things that it can do. However, keep in mind that it sometimes makes fundamental mistakes that a human would never make. For example, it can say absurd things, completely lose coherence over long passages, and insert non-sequiturs and even falsehoods. Also, as rational as its replies may seem, GPT-3 has no understanding of the language it creates, and it is certainly not conscious in any way. This becomes clear from its responses to nonsense:
I don’t think that tweaking or expanding GPT-3’s architecture (which many in AI are discussing) is ever going to produce a general problem solver. But it, or a language system like it, could make a valuable contribution to a larger, more general-purpose AI. It could even help to train that larger AI. In fact, I think GPT-3 would be a perfect addition to many proposed cognitive architectures, including one that I have proposed in an article in the journal Physiology and Behavior here. The rest of this blog post will describe how a language model like GPT-3 could contribute meaningfully to a conscious machine if integrated with other specialized systems properly.
Playing the Role of Broca’s
When we are born, our language areas are not blank slates. They come with their own instincts. Highly complex wiring patterns in the cerebral cortex set us up in advance to acquire language and use it facilely. AI should also not be a blank slate, like an undifferentiated mass of neurons. It needs guidance in the form of a wiring structure. GPT-3s existing lexicon, and record of dependencies between words could help bootstrap a more extensive blank-slate system. Taking a pretrained system like GPT-3 and embedding it within a much larger AI network (that starts with predominantly random weights) could provide that AI network with the instincts and linguistic structure it needs to go from grammatical, syntactic, and lexical proficiency to proper comprehension. In other words, an advanced NLP system will provide Noam Chomsky and Steven Pinker’s “language instinct.”
When GPT-3 chooses the next word, it is not influenced by any other modalities. There is no sight, hearing, taste, smell, touch, mental imagery, motor responses, or knowledge from embodiment in the physical world influencing what it writes. It is certainly not influenced by contemplative thought. These are major limitations. By taking the GPT neural network and integrating it with other functionally specialized neural networks, we can get it to interact with similar systems that process information of different modalities resulting in a multimodal approach. This will give it a broader form of attention that can keep track of, not just text but a variety of other incentives, perceptions, and concepts. GPT-3 already prioritizes strategically chosen words, but we want it also to prioritize snapshots, audio clips, memories, beliefs, and intentions.
Determining priority should be influenced by real-world events and experiences. Thus, the system should be able to make its own perceptual distinctions using cameras, microphones, and other sensors. It should also be able to interact using motors or servos with real-world objects. Actual physical interaction develops what psychologists call “embodiment,” crucial experiences that shape learning and understanding (GPT-3, on the other hand, is very much disembodied software). Knowledge about perception and physical interaction will influence the AI’s word choice, just like our real-world experiences influence the things we say. For instance, by applying embodied knowledge to the events that it witnesses at a baseball game, an AI should understand what it is like to catch or hit a ball. This understanding coming from experience would influence how it perceives the game, what it expects to happen, and the words it uses to talk about the game. This kind of embodied knowledge could then interact with the millions of pages of written text that it has read about baseball from sports news and other sources.
For me, the end goal of AI is to create a system that can help us accomplish things we cannot do on our own. I am most interested in creating an AI that can learn about science, by doing things like reading nonfiction books and academic articles, and then make contributions to scientific knowledge by coming up with new insights, innovations, and technology. To do this, an AI must think like a human, which means it must have the equivalent of an entire human brain and all of its sensory and associative cortices, not just its language area.
I think that to build superintelligence or artificial general intelligence, the system must be embodied and multimodal. You want it to be interacting with the world in an active way, watching events unfold, watching movies and youtube videos, interacting with people and animals. As it does this it should be using the words that it has to describe its experience as psychological items (higher-order abstractions) to make predictions about what will come next and how to interact with the world.
GPT-3 uses its attention to keep track of long-term dependencies. It selectively prioritizes the most relevant of recent words so that it can refer back to them. This is how it keeps certain words “in mind” so that it doesn’t stray from the topic as it writes. GPT-3 is 2048 tokens (think words) wide. That is its “context window” or attention span. In my opinion, this may be more than large enough to serve as an equivalent of Broca’s area. GPT-3 must have an attention of thousands of tokens because it is compensating for the fact that it doesn’t have the equivalent of an overarching, hierarchical, embodied, multimodal, global working memory.
The architecture for GTP-3 is very similar to that of GPT-2 and GPT-1. They all use the same algorithm for attention. GPT-3 performs much better than its earlier iterations, mostly because it is much larger. It contains more layers, wider layers, and was trained on more data. Some people think that using this architecture and continuing to scale it up could lead to artificial general intelligence, which is AI that can do anything a human can do. Some even speculate that it could lead to conscious AI. I am highly convinced that GPT-3, or other neural networks like it, will never lead to consciousness.
Continuing to scale up this system will lead to improved performance but also diminishing returns. Although it could lead to many general abilities, it will never lead to true understanding or comprehension. Trying to do so would be like creating an Olympic sprinter by building an intricately complex robotic foot. The foot may be necessary, but you will need all the other body parts to come together for it to run competitively. GPT-3 must be linked together with other specialized modules into a shared workspace for it to really shine. Before we talk about this workspace in the last section, let’s look at Broca’s in a little more detail.
Broca's and Wernike's Areas
Broca's area is a motor area in the frontal lobe responsible for speech. Patients with damage to Broca's have trouble speaking. If the damage is sufficient, they may be disfluent, aphasiac, or mute. Much like GPT-3, Broca's selects the next word to be spoken based on the words that came before. Once it chooses the word that fits in context, it hands the word down to lower-order motor control areas (like the primary motor area) that coordinate the body's actual muscular structures (voice box, tongue, lips, and mouth) to say the word. The neurons in your motor strip constitute your output layer, and there is a dictionary in there in some shape or form. To continue to explain the role of Broca's in language, we must introduce its sister area, Wernicke's.
Wernike's area is a cortical area that helps us process heard speech. It is found in the temporal lobe and takes its inputs from early auditory areas that get their inputs straight from the ears. Neurological patients with damage to this area can hear most nonspeech sounds normally as their auditory areas are still intact, but they have a specific deficit in recognizing language. In other words, your Wernicke's area will not try to analyze the sound of a car but will try to analyze the voice of your friend. It acts as a pattern recognizer specifically for spoken words.
Wernicke's and Broca's are specialized modules whose (mostly unconscious) outputs affect how we perceive and use language and even how we think. It is interesting to note that Broca's is about 20% larger in women than in men, and this may be responsible for women's greater fluency with language and higher verbal abilities.
The diagram below shows how we can go from hearing our friend say something to us to responding to them with our own words. First, the primary auditory area takes sounds heard by the ears, processes them further, and sends its output to Wernicke's area. Wernicke's then picks out the words from these sounds and sends those to Broca's. Broca's, in turn, sends the words that should be spoken in response to the motor area, which will then send the appropriate instructions to the tongue, jaw, mouth, and lips. This seems like a tight loop, but keep in mind that several other loops involving various brain areas contribute to our verbal responses (loops that NLP systems such as GPT-3 don't have).
It is worth mentioning that GPT-3 handles both the input of text and its output, so in this sense, it serves as an analogue of both Broca's and Wernicke's areas. However, it cannot hear or speak. This is easily fixed, though, by connecting a speech to text program to its input to allow it to hear. Allowing it to speak is as easy as connecting its output to a text to speech program.
In the brain, Broca's and Wernicke's have a continual open circuit connecting them at all times. They are constantly working together. Your Wernicke's area also listens to the speech you generate, which helps provide real-time feedback about the words coming out of your mouth. This allows language perception to be constantly linked to language generation. Not only does it allow us to hear the words that we say out loud, but it also gives a voice to the subvocal inner speech that we create. In other words, the loop between these two areas is responsible for the voice in your head, your internal monologue. Your Broca's area allows its outputs to be sent to your auditory area even when actual speech is suppressed, and this is why you can hear your own voice in your head even when you are not speaking aloud. We basically hallucinate our inner voice. Inner speech may be an essential aspect of consciousness, so we should give our AI system this kind of circuit.
The circuit connecting Broca's to Wernicke's is also responsible for the "phonological loop", which is a form of short-term sensory memory that allows us to remember a crystal-clear version of the last 2.5 seconds of what we just heard. This is why you can remember someone's last sentence word for word or remember a seven-digit phone number. Judging from the fact that all humans have one and that it is very useful to us day to day, the phonological loop may also make substantial contributions to consciousness. For this reason, Broca's and Wernicke's analogues may be essential ingredients for superintelligent AI.
GPT-3 may be ready as-is to serve as an equivalent of Broca's area in a larger system that is designed to interface with it. However, it is not ready to handle the long-term conceptual dependencies necessary for true cognition. To do this, it needs to interact with a global workspace.
What is a Global Workspace?
The Global Workspace is a popular model of consciousness and brain architecture from brain scientist Bernard Baars. It emphasizes that what we are conscious of is broadcast globally throughout the brain ("fame in the brain"), even to unconscious processing areas. These unconscious areas operate in parallel, with little communication between them. They are, however, influenced by the global information and can form new impressions of it, which in turn can be sent back to the global workspace.
The diagram below, adapted from Baars' work, shows five lower-order systems separated from each other by black bars. Each of these systems is hierarchical, and only at the top of their hierarchy can they communicate with one another. The location where they meet and exchange information is the global workspace. Here in the workspace, the most critical elements are activated and bound together into a conscious perception.
This neurological model can be instantiated in a computer in the form of interacting neural networks. The diagram below shows six different neural networks, which all remain separate until their output layers are connected in a shared global workspace. The black letters represent items held active in working memory.
The global workspace is like a discussion between specialists who share their most important ideas. If Broca's converges strongly on a series of words, those will be shared with the global workspace. From there, they are shared with other brain areas and modules. For example, if when you read about a "pink rhino in a bathing suit" the words in this phrase are translated to words you hear in your "mind's ear." From there they are broadcast to the global workspace where you become conscious of them. From there they are shared with your visual processing areas so that you can form a mental picture in your "mind's eye."
It would probably be helpful to train the AI language model (or module) by itself first before it is dropped into a larger global architecture. This is similar to the way our Broca's and Wernicke's areas come with genetically determined wiring patterns that have been selected over tens of millions of years of evolution (it is worth mentioning that even apes and monkeys have analogues of these two areas, and they generally perform the same functions). Once the language area is dropped in it can play a hand in training the rest of the system by interacting with it. Over time, the two systems will help to fine-tune each other.
Broca's area is always running in the background, but its processing does not always affect us. It only has access to consciousness when what it is doing is deemed important by dopaminergic centers. Similarly, the language it produces is only broadcast to the larynx and thus spoken aloud when other brain areas grant this. Our AI system should work this way too. It should have three levels of natural language generation activity: it should be able to speak, produce subvocal speech that only it can hear, and have speech generation going on in the background that unconsciously influences it (and the global workspace).
Even if the system is not speaking or printing text to a console, its language generation should be running in the background. Like us, it may or may not be conscious of the words its Broca's area is stringing together. In other words, its output may not be taking center stage in the global workspace. However, whether subliminal or not, the language it generates should still influence the behavior of other modules. And just as you can hear the words coming out of your mouth, this larger system would be able to analyze GPT -3's outputs and provide it with feedback about what to say next. We would want it to be able to self-monitor its own language output.
Broca's area takes conceptual information from the global workspace and turns it into a stream of words. It translates, fills in the blanks, and finds the appropriate words to express what is intended. When you are approached by a stranger that seems to have a sense of urgency, Broca's area turns your intentions into words: "Hi, how can I help you?" We don't have the mental capacity to pick and choose all the words we use individually. Much of it is done completely unconsciously by this system.
At first, the word selections made by the AI system would be almost entirely determined by the language model. This is analogous to how our language areas and their inherited architecture shape how we babble as infants. Slowly the weights and activation from the global workspace would start to influence the word selection randomly and subtly. Errors and reward feedback would alter the weights in various networks and slowly tune them to perform better. Over time, the language model will gradually relinquish control to the higher-order demands and constraints set by the larger system.
The diagram below shows a large system made of rectangles. Each rectangle represents a neural network. The largest network on the left (more of a square) contains semantic representations that can be held in either the focus of attention (FOA), the short-term memory store (STM), or in inert long-term memory. The letters in this square show that this system is updated iteratively. This means that the contents of the system's working memory have changed from time one (t1) to time two (t2). But significantly, it hasn't changed entirely because these two states overlap in the set of concepts they contain. This kind of behavior would be important for the language module, but also for the other modules in our AI system as well.
It is important to mention that GPT-3 is also updated iteratively. Its attention span for the words that it just read is limited. Once it is full it is forced to drop the words that have been there the longest. We can assume that Broca's area is also updated iteratively. But unlike Broca's, GPT-3 does not connect to a larger system that prioritizes its working memory by using an FOA and an STM.
The network of neural networks described here should utilize nodes that fire for extended periods to simulate the sustained firing of cortical pyramidal neurons to create a focus of attention. Neurons that drop out of sustained firing should then remain primed using a form of synaptic potentiation amounting to an STM. This larger system should also use SSC, icSSC, iterative updating, multiassociative search, and progressive modification, as explained in my article here. This architecture should allow the system to form associations and predictions, formulate inferences, implement algorithms, compound intermediate results, and ultimately create a form of mental continuity.
Rather than relying on “next word prediction” truly intelligent systems need a form of working memory that and a global workspace. Linking the modern natural language generation models with the major mechanistic constructs from cognitive neuroscience could give us the superintelligence we want.
Sorry this entry is SO fragmented. I spent weeks on this but just couldn't seem to pull it together.