Thursday, October 7, 2021

A Language Model like GPT Could Serve as Broca's Area in a Larger AGI System

Cutting edge AI language software is more powerful than ever, and what it can do today might surprise you. I'm sure you have heard of Siri, Alexa, and Google Assistant. These AIs can be really helpful but they are not even near the cutting edge of computer language generation. Some of the best language AIs (also known as models) today can respond so well to queries that most people assume there is a human at the other end typing the answers. The newest kind, called large language models (LLMs), are simply astounding in terms of what they can do (such as Open AI's GPT-3, GPT-4 or ChatGPT). However, even the best of them still have a way to go before achieving full human-level performance.

The limitations of these AI language models are very similar to those of our brain's language area. Isolated from the rest of the brain, our brain's language area could potentially still produce speech, but it would be rote, mechanical, and unplanned, much like the language produced by contemporary AI language models. But if this language software was integrated into a larger network with a form of working memory, as our brain's language area is, it could potentially produce much more coherent language.


Our brain's language region is called Broca's area and helps us with the fine details of piecing together our sentences. These are mostly the tedious details that we can't be bothered by. We are unconscious of most of the work Broca's area performs, but we wouldn't be able to speak without it. It does many things, including helping us find words that are on the tip of our tongue. Broca's area could at least keep us talking if the more generalist brain areas like the prefrontal cortex (involved in both consciousness and larger overarching language concerns) were gone. The speech generated from Broca's alone might sound grammatically correct and have proper syntax, but the semantics would have issues. We see this in people with prefrontal injuries today. They can talk indefinitely, and at first, what they are saying might sound normal, but it doesn't take long to realize that it is missing some intellectual depth. 


As you will learn here, modern AI language software is also missing depth because it does not plan its speech. These systems literally put their sentences together one word at a time. Given the series of words that have come so far, they predict which word is most likely to come next. Then they add their prediction to the list of words that have come before and use this new list to make the following prediction. They repeat this process to string together sentences and paragraphs. In other words, they have no forethought or real intention. You could say that modern AI language generation systems "think" like someone who has a serious brain injury or who has been lobotomized. They formulate speech like the TV show character Michael Scott from The Office. Here is a telling quote from Michael:


"Sometimes I'll start a sentence, and I don't even know where it's going. I just hope I find it along the way. Like an improv conversation. An improversation." 

-      Michael Scott


Michael is a parody. He is a manager with an attention deficit, that has no real plan and does everything on the fly. As you can see on the show, his work doesn't lead to productivity, economic or otherwise. We need AI that is structured to do better than this.


The question becomes, how can we create an AI system that does more than just predict the next best word, one at a time? We want a system that plans ahead, formulating the gist of what it wants to say in its imagination before choosing the specific words needed to communicate it. As will be discussed, that will never emerge by using the same machine learning architectures we have been using. Additional architectural modifications are needed.


This entry will espouse taking the current neural network language architecture (the transformer neural network model introduced in 2017) and attaching it to a larger, generalist system. This larger system will allow word choice to be affected by a large number of varied constraints (not just the words that came earlier in the sentence). The diagram below shows a multimodal neural network made up of several different interfacing networks. You can see the language network on the far right, in the center, attached directly to a speaker.




It would be a tremendous engineering feat to get multiple neural networks to interface and work together, as depicted above. But research with neural networks has shown that they are often interoperable and can quickly adapt to each other and learn to work cooperatively. AI pioneer Marvin Minsky called the human brain a “society of minds.” By that he meant that our brain is made up of different modular networks each contribute to a larger whole. I don’t think AI will truly become intelligent unless it works in this way. Before we go any further, the next section will briefly explain how neural networks (the most popular form of machine learning in AI) work.


How Neural Networks Chose Their Words


I explain how neural networks work in detail in my last entry, which you can read here


A quick recap: Most forms of machine learning, including neural networks, are systems with a large number of interlinked neuron-like nodes. These are represented by the circles in the diagram below. The nodes are connected via weighted links. These links are represented as the lines connecting the circles. The links are considered “weighted” because each has a numerical strength that is subject to change during learning. As the AI software is exposed to inputs, those inputs flow through the system (from left to right), travelling from node to node until they reach the output layer on the far right. That output layer contains a node for every word in the dictionary. Whichever node in the output layer is activated the most will become the network’s next chosen word.


Language generating neural networks are exposed to millions of sentences, typically from books or articles. The system can learn from what it is reading because it adapts to it. The weight’s values are strengthened when it is able to correctly guess the next word in a sentence it is reading, and the values are weakened when it chooses any other word. Given a broad attention span and billions of training sessions, they can get really good at internalizing the structure of the English language and piecing sentences together word by word.


The diagram below shows the words “this is my pet…” being fed into an AI neural network. The final word “cat” is hidden from the network. As “this is my pet” passes through the network from left to right, the words travel from node (circle) to node, through the weighted links, toward the full dictionary of nodes at the output layer. The pattern of inputs caused a pattern of network activity that then selected a single output. You can see the network converging on the word “cat” as its best prediction. It got it right! This network could then continue, rapidly writing sentences in this way for as long as you tell it to.


Broca's area in your brain works similarly. It takes inputs from many different systems in the brain, especially from the system that recognizes spoken language. These inputs activate select neurons out of a network of millions of them. The activation energy in the form of neural impulses travels through the network and toward something analogous to an output layer. This happens every time you speak a word. In both the brain and AI software, inputs work their way through a system of nodes toward an existing output. That output represents the decision, and in this case, it's the next word. 


AI that Generates Natural Language


Broca's area is a patch of cortical tissue in the frontal lobe designed by evolution to have all the right inputs, outputs, and internal connectivity to guide the involuntary, routinized aspects involved in speech production. Neuroscientists still don't know much about how it works, and reverse engineering it completely would be nearly impossible with today's technology. Lucky for us, AI probably doesn't need an exact equivalent of Broca's to develop the gift of gab. In fact, it may already have something even better.  


There are many state-of-the-art natural language systems we could discuss, but here we will focus on one called GPT-3 (which arrived in May 2020). It is an exciting new AI project that has proven to be highly adept at natural language processing (NLP). It can answer questions, write computer code, summarize long texts, and even write its own essays. However, keep in mind that as discussed above, it has no plan. The next word it chooses is just the word that it "predicts" should come next. This is called "next word prediction."


You can feed it the first two sentences of a news article, and it will write the rest of the article convincingly. You can ask it to write a poem in a certain author's style, and its output may be indistinguishable from an actual poem by that author. In fact, one blogger created a blog where they only posted GPT-3 text as entries. The entries were so good that people were convinced it was written by a human and started subscribing to the blog. Here is an example of a news article that it wrote:


Traditionally AI does poorly with common sense, but many of GPT-3’s responses are highly logical. I want to urge you to use an online search to find out more about the fantastic things that it can do. However, keep in mind that it sometimes makes fundamental mistakes that a human would never make. For example, it can say absurd things, completely lose coherence over long passages, and insert non-sequiturs and even falsehoods. Also, as rational as its replies may seem, GPT-3 has no understanding of the language it creates, and it is certainly not conscious in any way. This becomes clear from its responses to nonsense:




I don’t think that tweaking or expanding GPT-3’s architecture (which many in AI are discussing) is ever going to produce a general problem solver. But it, or a language system like it, could make a valuable contribution to a larger, more general-purpose AI. It could even help to train that larger AI. In fact, I think GPT-3 would be a perfect addition to many proposed cognitive architectures, including one that I have proposed in an article in the journal Physiology and Behavior here. The rest of this blog post will describe how a language model like GPT-3 could contribute meaningfully to a conscious machine if integrated with other specialized systems properly.


Playing the Role of Broca’s


When we are born, our language areas are not blank slates. They come with their own instincts. Highly complex wiring patterns in the cerebral cortex set us up in advance to acquire language and use it facilely. AI should also not be a blank slate, like an undifferentiated mass of neurons. It needs guidance in the form of a wiring structure. GPT-3s existing lexicon, and record of dependencies between words could help bootstrap a more extensive blank-slate system. Taking a pretrained system like GPT-3 and embedding it within a much larger AI network (that starts with predominantly random weights) could provide that AI network with the instincts and linguistic structure it needs to go from grammatical, syntactic, and lexical proficiency to proper comprehension. In other words, an advanced NLP system will provide Noam Chomsky and Steven Pinker’s “language instinct.”


When GPT-3 chooses the next word, it is not influenced by any other modalities. There is no sight, hearing, taste, smell, touch, mental imagery, motor responses, or knowledge from embodiment in the physical world influencing what it writes. It is certainly not influenced by contemplative thought. These are major limitations. By taking the GPT neural network and integrating it with other functionally specialized neural networks, we can get it to interact with similar systems that process information of different modalities resulting in a multimodal approach.  This will give it a broader form of attention that can keep track of, not just text but a variety of other incentives, perceptions, and concepts. GPT-3 already prioritizes strategically chosen words, but we want it also to prioritize snapshots, audio clips, memories, beliefs, and intentions.


Determining priority should be influenced by real-world events and experiences. Thus, the system should be able to make its own perceptual distinctions using cameras, microphones, and other sensors. It should also be able to interact using motors or servos with real-world objects. Actual physical interaction develops what psychologists call “embodiment,” crucial experiences that shape learning and understanding (GPT-3, on the other hand, is very much disembodied software). Knowledge about perception and physical interaction will influence the AI’s word choice, just like our real-world experiences influence the things we say. For instance, by applying embodied knowledge to the events that it witnesses at a baseball game, an AI should understand what it is like to catch or hit a ball. This understanding coming from experience would influence how it perceives the game, what it expects to happen, and the words it uses to talk about the game. This kind of embodied knowledge could then interact with the millions of pages of written text that it has read about baseball from sports news and other sources.


For me, the end goal of AI is to create a system that can help us accomplish things we cannot do on our own. I am most interested in creating an AI that can learn about science, by doing things like reading nonfiction books and academic articles, and then make contributions to scientific knowledge by coming up with new insights, innovations, and technology. To do this, an AI must think like a human, which means it must have the equivalent of an entire human brain and all of its sensory and associative cortices, not just its language area. 


I think that to build superintelligence or artificial general intelligence, the system must be embodied and multimodal. You want it to be interacting with the world in an active way, watching events unfold, watching movies and youtube videos, interacting with people and animals. As it does this it should be using the words that it has to describe its experience as psychological items (higher-order abstractions) to make predictions about what will come next and how to interact with the world.


GPT-3 uses its attention to keep track of long-term dependencies. It selectively prioritizes the most relevant of recent words so that it can refer back to them. This is how it keeps certain words “in mind” so that it doesn’t stray from the topic as it writes. GPT-3 is 2048 tokens (think words) wide. That is its “context window” or attention span. In my opinion, this may be more than large enough to serve as an equivalent of Broca’s area. GPT-3 must have an attention of thousands of tokens because it is compensating for the fact that it doesn’t have the equivalent of an overarching, hierarchical, embodied, multimodal, global working memory.


The architecture for GTP-3 is very similar to that of GPT-2 and GPT-1. They all use the same algorithm for attention. GPT-3 performs much better than its earlier iterations, mostly because it is much larger. It contains more layers, wider layers, and was trained on more data. Some people think that using this architecture and continuing to scale it up could lead to artificial general intelligence, which is AI that can do anything a human can do. Some even speculate that it could lead to conscious AI. I am highly convinced that GPT-3, or other neural networks like it, will never lead to consciousness.


Continuing to scale up this system will lead to improved performance but also diminishing returns. Although it could lead to many general abilities, it will never lead to true understanding or comprehension. Trying to do so would be like creating an Olympic sprinter by building an intricately complex robotic foot. The foot may be necessary, but you will need all the other body parts to come together for it to run competitively. GPT-3 must be linked together with other specialized modules into a shared workspace for it to really shine. Before we talk about this workspace in the last section, let’s look at Broca’s in a little more detail.


Broca's and Wernike's Areas


Broca's area is a motor area in the frontal lobe responsible for speech. Patients with damage to Broca's have trouble speaking. If the damage is sufficient, they may be disfluent, aphasiac, or mute. Much like GPT-3, Broca's selects the next word to be spoken based on the words that came before. Once it chooses the word that fits in context, it hands the word down to lower-order motor control areas (like the primary motor area) that coordinate the body's actual muscular structures (voice box, tongue, lips, and mouth) to say the word. The neurons in your motor strip constitute your output layer, and there is a dictionary in there in some shape or form. To continue to explain the role of Broca's in language, we must introduce its sister area, Wernicke's.


Wernike's area is a cortical area that helps us process heard speech. It is found in the temporal lobe and takes its inputs from early auditory areas that get their inputs straight from the ears. Neurological patients with damage to this area can hear most nonspeech sounds normally as their auditory areas are still intact, but they have a specific deficit in recognizing language. In other words, your Wernicke's area will not try to analyze the sound of a car but will try to analyze the voice of your friend. It acts as a pattern recognizer specifically for spoken words. 


Wernicke's and Broca's are specialized modules whose (mostly unconscious) outputs affect how we perceive and use language and even how we think. It is interesting to note that Broca's is about 20% larger in women than in men, and this may be responsible for women's greater fluency with language and higher verbal abilities.


The diagram below shows how we can go from hearing our friend say something to us to responding to them with our own words. First, the primary auditory area takes sounds heard by the ears, processes them further, and sends its output to Wernicke's area. Wernicke's then picks out the words from these sounds and sends those to Broca's. Broca's, in turn, sends the words that should be spoken in response to the motor area, which will then send the appropriate instructions to the tongue, jaw, mouth, and lips. This seems like a tight loop, but keep in mind that several other loops involving various brain areas contribute to our verbal responses (loops that NLP systems such as GPT-3 don't have).



It is worth mentioning that GPT-3 handles both the input of text and its output, so in this sense, it serves as an analogue of both Broca's and Wernicke's areas. However, it cannot hear or speak. This is easily fixed, though, by connecting a speech to text program to its input to allow it to hear. Allowing it to speak is as easy as connecting its output to a text to speech program. 


In the brain, Broca's and Wernicke's have a continual open circuit connecting them at all times. They are constantly working together. Your Wernicke's area also listens to the speech you generate, which helps provide real-time feedback about the words coming out of your mouth. This allows language perception to be constantly linked to language generation. Not only does it allow us to hear the words that we say out loud, but it also gives a voice to the subvocal inner speech that we create. In other words, the loop between these two areas is responsible for the voice in your head, your internal monologue. Your Broca's area allows its outputs to be sent to your auditory area even when actual speech is suppressed, and this is why you can hear your own voice in your head even when you are not speaking aloud. We basically hallucinate our inner voice. Inner speech may be an essential aspect of consciousness, so we should give our AI system this kind of circuit. 

The circuit connecting Broca's to Wernicke's is also responsible for the "phonological loop", which is a form of short-term sensory memory that allows us to remember a crystal-clear version of the last 2.5 seconds of what we just heard. This is why you can remember someone's last sentence word for word or remember a seven-digit phone number. Judging from the fact that all humans have one and that it is very useful to us day to day, the phonological loop may also make substantial contributions to consciousness. For this reason, Broca's and Wernicke's analogues may be essential ingredients for superintelligent AI.


GPT-3 may be ready as-is to serve as an equivalent of Broca's area in a larger system that is designed to interface with it. However, it is not ready to handle the long-term conceptual dependencies necessary for true cognition. To do this, it needs to interact with a global workspace.


What is a Global Workspace?


The Global Workspace is a popular model of consciousness and brain architecture from brain scientist Bernard Baars. It emphasizes that what we are conscious of is broadcast globally throughout the brain ("fame in the brain"), even to unconscious processing areas. These unconscious areas operate in parallel, with little communication between them. They are, however, influenced by the global information and can form new impressions of it, which in turn can be sent back to the global workspace. 


The diagram below, adapted from Baars' work, shows five lower-order systems separated from each other by black bars. Each of these systems is hierarchical, and only at the top of their hierarchy can they communicate with one another. The location where they meet and exchange information is the global workspace. Here in the workspace, the most critical elements are activated and bound together into a conscious perception.   



This neurological model can be instantiated in a computer in the form of interacting neural networks. The diagram below shows six different neural networks, which all remain separate until their output layers are connected in a shared global workspace. The black letters represent items held active in working memory.



The global workspace is like a discussion between specialists who share their most important ideas. If Broca's converges strongly on a series of words, those will be shared with the global workspace. From there, they are shared with other brain areas and modules. For example, if when you read about a "pink rhino in a bathing suit" the words in this phrase are translated to words you hear in your "mind's ear." From there they are broadcast to the global workspace where you become conscious of them. From there they are shared with your visual processing areas so that you can form a mental picture in your "mind's eye."


It would probably be helpful to train the AI language model (or module) by itself first before it is dropped into a larger global architecture. This is similar to the way our Broca's and Wernicke's areas come with genetically determined wiring patterns that have been selected over tens of millions of years of evolution (it is worth mentioning that even apes and monkeys have analogues of these two areas, and they generally perform the same functions). Once the language area is dropped in it can play a hand in training the rest of the system by interacting with it. Over time, the two systems will help to fine-tune each other.


Broca's area is always running in the background, but its processing does not always affect us. It only has access to consciousness when what it is doing is deemed important by dopaminergic centers. Similarly, the language it produces is only broadcast to the larynx and thus spoken aloud when other brain areas grant this. Our AI system should work this way too. It should have three levels of natural language generation activity: it should be able to speak, produce subvocal speech that only it can hear, and have speech generation going on in the background that unconsciously influences it (and the global workspace). 


Even if the system is not speaking or printing text to a console, its language generation should be running in the background. Like us, it may or may not be conscious of the words its Broca's area is stringing together. In other words, its output may not be taking center stage in the global workspace. However, whether subliminal or not, the language it generates should still influence the behavior of other modules. And just as you can hear the words coming out of your mouth, this larger system would be able to analyze GPT -3's outputs and provide it with feedback about what to say next. We would want it to be able to self-monitor its own language output.


Broca's area takes conceptual information from the global workspace and turns it into a stream of words. It translates, fills in the blanks, and finds the appropriate words to express what is intended. When you are approached by a stranger that seems to have a sense of urgency, Broca's area turns your intentions into words: "Hi, how can I help you?" We don't have the mental capacity to pick and choose all the words we use individually. Much of it is done completely unconsciously by this system.


 At first, the word selections made by the AI system would be almost entirely determined by the language model. This is analogous to how our language areas and their inherited architecture shape how we babble as infants. Slowly the weights and activation from the global workspace would start to influence the word selection randomly and subtly. Errors and reward feedback would alter the weights in various networks and slowly tune them to perform better. Over time, the language model will gradually relinquish control to the higher-order demands and constraints set by the larger system.


The diagram below shows a large system made of rectangles. Each rectangle represents a neural network. The largest network on the left (more of a square) contains semantic representations that can be held in either the focus of attention (FOA), the short-term memory store (STM), or in inert long-term memory. The letters in this square show that this system is updated iteratively. This means that the contents of the system's working memory have changed from time one (t1) to time two (t2). But significantly, it hasn't changed entirely because these two states overlap in the set of concepts they contain. This kind of behavior would be important for the language module, but also for the other modules in our AI system as well.


It is important to mention that GPT-3 is also updated iteratively. Its attention span for the words that it just read is limited. Once it is full it is forced to drop the words that have been there the longest. We can assume that Broca's area is also updated iteratively. But unlike Broca's, GPT-3 does not connect to a larger system that prioritizes its working memory by using an FOA and an STM.


The network of neural networks described here should utilize nodes that fire for extended periods to simulate the sustained firing of cortical pyramidal neurons to create a focus of attention. Neurons that drop out of sustained firing should then remain primed using a form of synaptic potentiation amounting to an STM. This larger system should also use SSC, icSSC, iterative updating, multiassociative search, and progressive modification, as explained in my article here. This architecture should allow the system to form associations and predictions, formulate inferences, implement algorithms, compound intermediate results, and ultimately create a form of mental continuity.

Rather than relying on “next word prediction” truly intelligent systems need a form of working memory that and a global workspace. Linking the modern natural language generation models with the major mechanistic constructs from cognitive neuroscience could give us the superintelligence we want.

Sorry this entry is SO fragmented. I spent weeks on this but just couldn't seem to pull it together.

To see my model of working memory and artificial superintelligence, visit:

Friday, July 2, 2021

How AIs Put Their Sentences Together: Natural Language Generation

AI that can produce natural language is a hot topic today. Here we are going to discuss how it is structured, how it works, how it learns, and how it could possibly be improved.

Natural language processing (NLP) is a subfield of AI concerned with recognizing and analyzing natural language data. Alexa, Siri, and Google Assistant all use NLP techniques. Capabilities of NLP software include speech recognition, language translation, sentiment analysis, and language generation. Here we are primarily interested in natural language generation, which means the creation of written text. There is a long history of software that can produce language but only in the last few years has it approached human-level capability.

There are many state-of-the-art systems we could discuss, but here we are going to focus on one called GPT-3. It is an exciting new AI system that has proven to be highly adept at natural language generation. It can answer questions, write computer code, summarize long texts, and even write its own essays. Its writing is so good that often it seems as if it was written by a human.

You can feed GPT-3 the first two sentences of a news article and it will write the rest of the article in a convincing manner. You can ask it to write a poem in the style of a certain author, and its output may be indistinguishable from an actual poem by that author. In fact, one blogger created a blog where they only posted GPT-3 text as entries. The entries were so good that people were convinced it was written by a human and started subscribing to the blog.

Take a look at a few examples of its responses to simple questions:

Traditionally AI does poorly with common sense, but as you can see many of GPT-3’s responses are highly logical. GPT-3 was trained on thousands of websites, books, and most of Wikipedia. This enormous and diverse corpus of unlabeled text amounted to hundreds of billions of words. Despite the fact that what it is doing is simple and mechanical, because GPT-3 has so much memory, and has been exposed to such a high volume of logical writing from good authors, it is able to unconsciously piece together sentences of great complexity and meaning. The way it is structured is fascinating and I hope that by the end of this post you have strong intuitive understanding of how it works.

What is Natural Language Generation Doing?

NLP uses distributional semantics. This means keeping track of which words tend to appear together in the same sentences and how they are ordered. Linguist John Firth (1890 – 1960) said, “You shall know a word by the company it keeps.” NLP systems keep track of when and how words accompany each other statistically. These systems are fed huge amounts of data in the form of paragraphs and sentences, and they analyze how the words tend to be distributed. They then use this probabilistic knowledge in reverse to generate language.

As they write, NLP systems are “filling in the blank” in a process called “next word prediction.” That’s right, GPT has no idea what it is going to say next, it literally only focuses on one word at a time, one after another. GPT-3 “knows” nothing. It only appears to have knowledge about the world because of the intricate statistics it keeps on the mathematical relationships between words from works written by human authors. GPT-3 is basically saying: “Based on the training data I have been exposed to, if I had to predict what the next word in this sentence was, I would guess that it would be _____.”

When you give an NLP system a single word, they will find the most statistically appropriate word to follow it. If you give it half a sentence, it will use all the words to calculate the next most appropriate word. Then after these NLP systems make the first recommendation, they use that word, along with the rest of the sentence, to recommend the next word. They compile sentences iteratively in this manner, word by word. They are not thinking. They are not using logic, mental imagery, concepts, ideas, semantic or episodic memory. Rather, they are using a glorified version of the autocomplete in your Google search bar, or your phone’s text messaging app.

To really get a sense of this, open the text app on your phone. Type one word, then see what the phone offers you as an autocomplete suggestion for the next word. Select their recommendation. You can keep selecting their recommendation to string together a sentence. Depending on the algorithm the phone uses (likely Markovian) the sentence may make vague sense or may make no sense at all. In principle though, this is how GPT and all other modern language generating models work. The screenshots below show a search on Google, and some sentences generated by my phone’s predictive text feature.

A.   A. Google using autocomplete to give you likely predictions for your search. B. Using the autocomplete suggestions above my phone’s keyboard to generate nonsense sentences.


GPT-3 Has a Form of Attention

Most autocomplete systems are much more myopic than GPT. They may only take the previous word, or previous two words into consideration. This is partially because it becomes very computationally expensive to look back further than a couple of words. The more previous variables that are tracked, the more expensive. A computer program that had both a list of every word in the English language and the word that is most likely to follow each word, would take up very little space in computer memory and require very little processing resources. However, what GPT-3 does is much more complex because it looks at the last several words to make its decisions.

The more words, the more context. The more context, the better the prediction. Let’s say you were give the word “my” and asked to predict the next word. Not very easy, right? What if you were given “is my”?  Still not very easy. How about, “today is my”. Now those three words might give you the context you need to predict that the next word is “birthday.” Words occuring along a timeline are not independent or equiprobable. Rather, there are correlations and conditional dependencies between sucessive words. What comes later is dependent on what came before. In that four word string “today is my birthday” there is a short-term dependency between “today” and “birthday.” So being able to have a working memory of previous words is very helpful. More sophisticated AIs like GPT-3 can deal with long-term dependencies too. This is when, an entire paragraph later, GPT-3 can still reference the fact that today is someone’s birthday.

By attending to preceding words, GPT-3 has a certain degree of precision and appropriateness, and is able to stay on track. For instance, it can remember the beginning of the sentence (or paragraph), and acknowledge it or elaborate on it. Of course, this is essential to good writing. It’s attentional resources enable it to remember cues over many time steps allowing its behavior to retain pertinence by accounting for what came earlier. While it was trained, the GPT-3 software was able to learn what to pay attention to given the context it was considering. This way it does not have to keep everything that came earlier in mind, it only stores what it predicts will be important in the near future.

If you can remember that you were talking about the vice-president two sentences ago, then you will be able to use the pronoun “she” when referring to her again. In this case your use of “she” is dependent on a noun that you used several seconds ago. This is an example of an event being used as a long-term dependency. Long-term dependencies structure our thinking processes, and they allow us to predict what will happen next, what our friend will do next, and they help us finish each other’s sentences. To a large extent, intelligence is the ability to capture, remember, manage, and act on short- and long-term dependencies.

GPT-3 uses its attention to keep track of several long-term dependencies at a time. It selectively prioritizes the most relevant of recent items so that it can refer back to them. This is how it is able to keep certain words “in mind” so that it doesn’t stray from the topic as it writes. GPT-3 is 2048 tokens wide, where tokens are generally equivalent to words. So, it has a couple thousand words as its “context window” or attention span. This is clearly much larger than what a human has direct access to from the immediate past (Most people cannot remember a 10 digit number?). Its attention is what allows it to write in a rational human-like way. Reading the following text from GPT-2 can you spot places where it used its backward memory span to attend to short and long-term dependencies?

As you can see GPT-2 takes the context from the human-written prompt above and creates an entire story. Its story retains many of the initial elements introduced by the prompt and expands on them. You can also see how it is able to introduce related words and concepts and then refer back to them paragraphs later in a reasonable way.


Some Technical but Interesting Details About GPT-3

GPT-3 was introduced in May 2020 by Open AI Inc. which was founded by Elon Musk and Sam Altman. GPT-3 stands for Generative Pre-trained Transformer 3. The “generative” in the name means that it can create its own content. The word “pre-trained” means that it has already learned what it needs to know. Its learning is actually now complete (for the most part) and thus its synaptic weights have been frozen. The word “transformer” refers to the type of neural network it is (a version of a recurrent network). The transformer architecture, by the way, is relatively simple. It has also been used in other language models such as Google’s BERT and Microsoft’s Turing Natural Language Generation (T-NLG).

The 3 in GPT-3 denotes that it is a third-generation product coming after GPT and GPT-2 as the third iteration of the GPT-n series. GPT-1 and 2 were also groundbreaking and similarly seen as technologically disruptive. GPT-3 has a wider attention span than GPT-2 and many more layers. GPT-2 had 1.5 billion parameters, and GPT-3 has a total of 175 billion parameters. Thus, it is over 100 times larger than its impressive predecessor which came two years before it. What are those 175 billion parameters? The parameters are the number of synaptic learning changes that can take place between its neurons. The more parameters, the more memory it has, and the more structural complexity to its memory.

You can make a rough comparison between the 175 billion parameters in GPT-3 to the 100 trillion synapses in the human brain. That should give you a sense of how much more information your brain is capable of holding (over 500x). It cost $4.6 million to train GPT-3. At that rate, trying to scale it up to the size of the brain would cost an unwieldy $2.5 billion. However, considering the fact that neural network training efficiency has been doubling every 16 months, by 2032 scientists may be able to create a system with the memory capacity of the human brain (100 trillion parameters) for around the same cost of GPT-3 ($5 million). This is one reason why many people are excited about the prospect of keeping the GPT architecture and just throwing more compute at it to achieve superintelligence.

It is worth mentioning that scaling up from GPT-2 to GPT-3 has not yet resulted in diminishing returns. That is, its performance has increased on a straight line. This suggests that just throwing more computing power at the same architecture could lead to equally stunning performance for GPT-4. This has led many researchers to wonder how big this can get, and how far we can take it. I think that it will continue to scale well for a while longer, but I don’t think the transformer architecture will ever approach any form of sentient consciousness. Most forms of AI (machine learning and deep learning) are one trick ponies. They perform well, but only in one specific domain. My belief is that a specialized system like GPT will continue to be used in the future but will make modular contributions to more generalist systems. I cover that in the next blog entry which you can read here.

GPT-3 is a closed book system, which means that it does not query a database to find its answers, it “speaks from its own knowledge.” It has read Wikipedia, but (Unlike IBM’s Jeopardy champion “Watson”) does not have Wikipedia saved verbatim in files on its hard drive. Rather, it “read” or traversed through Wikipedia and saved its impressions of it that did not already match its existing structure. In other words, it saved information about the incorrect predictions it made about Wikipedia. It is important to keep in mind that it is not a simple lookup table. It is an autoregressive language model, meaning that it predicts future values from its memories of past values. It interpolates and extrapolates from what it remembers. It is amazing at this, and its abilities generalize to a wide variety of tasks. GPT-3 outperforms many fine-tuned, state of the art models in a range of different domains, tasks, and benchmarks. In fact, some of its accomplishments are mind blowing.

Human accuracy at detecting articles that were produced by GPT-3 (and not another human) is barely above chance at 52%. This means that it is very difficult to tell if you are reading something written by it or by a real human. It also means that GPT-3 has nearly passed a written form of the Turing test. GPT-3 really is an amazing engineering feat. It shows that simply taking a transformer network and exposing it to millions of sentences from text online, you can get a system that appears intelligent. It appears intelligent even though it is not modeled after the brain and is missing most of the major features thought by brain scientists, and psychologists to be instrumental to intelligence.

It writes as if it has understanding. But in reality, it understands nothing. It cannot build abstract conceptual structures and cannot reliably synthesize new intellectual or academic ideas. It has, however, shown glimmers of a simple form of reasoning that allows it to create true content that was not in its training set. For example, although it cannot add 10-digit numbers (which a pocket calculator can do with ease) it can add 2- and 3-digit numbers (35 + 67) and do lots of other math that it was never trained to do and never encountered an example of. Its designers claim that it has never seen any multiplication tables. Specialists are now arguing about what it means that it can do math that it has never seen.

In the example at the beginning of this blog entry GPT-3 knew that there are no animals with three legs. This knowledge was not explicitly programmed into it by a programmer, nor was it spelled out explicitly in its training data. Pretty amazing. If GPT-3 were designed differently and got its knowledge from a database of a long laundry list of facts (like the Cyc AI project) programmed by hand it wouldn’t easily interface and interact with other neural networks. But since GPT-3 is a neural network, it should play constructively and collaboratively with other neural networks. This really underscores the potential value of architectures like this in the future.

GPT-3 is very different from something like IBM’s Watson, the Jeopardy champion (which I have written about here). Watson was programmed with thousands of lines of code, annotations, and conditionals. This programming helped it respond to particular contingencies that were identified by humans in advance. GPT-3 has very little in the way of formal rules. The logical structure that comes out of it is coming from the English language content that it has “read.”

GPT-3 was trained on prodigious amounts of data in Microsoft’s cloud using graphics processing units (GPUs). GPUs are often used to train neural networks because they have a large number of cores. In other words, a GPU is like having many weak CPUs all of which can work on different small problems at the same time. This is only useful if the computer’s work can be broken down into separate threads (parallelizable). This makes a GPU well-suited for the highly parallel task of modeling individual neurons because each neuron can be modeled independently. Multicore GPUs have only been accessible to public consumers for the last ten years. They started with 2 cores, then 4, and today they can have thousands. What was the impetus for engineers to build fantastically complex multicore GPUs? It was the demand for videogames. Gaming computers and videogame consoles require better and better graphics cards to handle state of the art games. Less than 10 years ago AI scientists realized that they could take advantage of this and use GPUs themselves. This is a major reason why deep learning AI is performing at high levels and is such a hot topic today.

OpenAI and Microsoft needed hundreds of GPUs to train GPT-3. If they had only used one $8,000 RTX 8000 GPU (15 TFLOPS) it would have taken more than 600 years to process all of the training that took place. If they would have used the GPU on your home computer, it would have likely taken thousands of years. That gives you an idea of how much processing time and resources went into fine-tuning this network. But what is involved in fine-tuning a network. Let’s discuss what is happening under the hood. (Apart from training, querying the pretrained model is also resource expensive. GPT-2 was able to run on a single machine at inference time, but GPT-3 must run on a cluster.)


How Neural Networks Work

This next section will offer an explanation for how artificial neural networks operate. It will then show how neural networks are applied to language processing and NLP networks like GPT-3. It is important to point out that this explanation highly simplified and incomplete, but it should communicate the gist and give you some helpful intuitions about how to think about how AI functions.

To understand how a neural network works, first let’s look at a single neuron with three inputs. The three circles on the left in the figure below represent neurons (or nodes). These neurons, X1, X2, and X3, are in the input layer and they are taking information directly from an outside source (the data you input). Each neuron is capable of holding a value from 0 to 1. It sends this value to the next neuron to its right. The neuron on the right takes the values from these three inputs and adds them together to see if they sum above a certain threshold. If they do, the neuron will fire. If not, it won’t. Just as in the brain, the artificial neuron’s pattern of firing over time is dependent on its inputs.

We all know that neural networks learn, so how could this simple system learn? It learns by tweaking its synaptic weights, W1, W2, and W3. In other words, if the neuron on the right learns that inputs 1 and 2 are important but that input 3 is not important, it will increase the weights of the first two and decrease the weight of the third. As you can see in the formula below it multiplies each input by its associated weight and then adds the products together to get a number. Again, if this number is higher than its threshold for activation it will fire. When it fires, it sends information to the neurons that it is connected to. Remember, this simple four neuron example would just be a miniscule fraction of a contemporary neural network.

The figure below shows a bunch of neurons connected together to create a network. This is a very simple, vanilla neural network sometimes referred to as a multilayer perceptron. Each neuron is connected to every other neuron in the adjacent layer. This is referred to as being “fully connected.” All these connections are associated with a weight, which can be changed by experience, and thus provide the network with many different possible ways to learn. Each weight is tuning knob that is adjusted through trial and error in an effort to find an optimal network configuration.

The network below recognizes digits, and here it is shown correctly recognizing the handwritten number 3. A picture of the handwritten 3 is fed into the system. The photo is 28 pixels wide, by 28 pixels tall, for a total of 784 pixels. This means that the input layer is going to need 784 neurons, one for each pixel. The brightness of each pixel, on a scale from 0 to 1 are fed into the input neurons as you can see below. These numbers pass through the network from left to right. As they do this they will be multiplied by their associated weights at each layer until the “activation energy” from the pattern of input results in a pattern of output.

How many output neurons would you expect this network to have? Well, if it recognizes digits, then it should have 10 outputs, one for each digit (0-9). After the activation energy from the inputs passes through the network, one of the ten neurons in the output layer will be activated more than the others. This will be the network’s answer or conclusion. In the example below, the output neuron corresponding to the number 3 is activated the most at an activation level of .95. The system correctly recognized the digit as a 3. Please note the ellipses in each column which indicate that the neurons are so numerous that some of them are not pictured.

These artificial neurons and the connections between them amount to an artificial neural network. It is modeled and run inside of a computer. It is kind of like a videogame in the sense that these structures don’t actually exist in the real world, but are simulated at high fidelity and speed using a computer. This rather simple network captures some of the important cognitive mechanisms found in the human brain. The neural network is considered a form a machine learning, and when it contains more than three hidden layers, it is referred to as deep learning. Some modern networks have thousands of hidden layers. Now the amazing thing about these mathematical neurons is that if you connect a large number of them up in the right way and choose a good learning rule (a method of updating its weights), it can learn just about any mathematical expression. After the network is trained on data, it will capture patterns in the data, becoming one big mathematical function that can be used for things like classification.

In the next diagram we see a neural network that is being used to classify produce. It is using object features (like color, size, and texture) in the first layer to determine the type of produce (orange, apple, or lettuce) in the hidden layer. Then these are associated with either the fruit or vegetable classification in the output layer. The boldness of the lines indicates the strength of the weights. You can see that red, small, and smooth are all connected strongly to cherry. This means that whenever all three of these are activated by an input pattern, cherry will be selected in the hidden layer. You can also see that cherry is more strongly connected to fruit than vegetable in the output layer. So, by using only an object’s features this system could tell a fruit from a vegetable.

Please keep in mind that this example is a simplified, “toy” example. However, neural networks do work hierarchically in this way. At each layer, they take simple features and allow them to converge on more complex features, culminating in some kind of conclusion. In the example involving digit recognition above, the first hidden layer generally recognizes short line segments, and the layers after it recognize increasingly complex line segments including loops and curves. Finally, the output layer puts these together to form whole numbers (as when two loops create the number 8). Again, neural networks go from concrete features to abstract categories because of the way that the neurons in low-order layers (to the left) project to neurons in high-order layers (to the right).

The next diagram shows that neural networks can take many forms of input and come up with appropriate output. Let’s start just by looking at the first of the three networks in the diagram. That top network received a picture of a cat and recognized it as a cat. To do this it had to take the pixel brightness of each pixel and turn them into a long list of numbers. There is one input neuron for each pixel so if there were 2,073,600 (1080 x 1920) pixels then there must be that many input neurons in the input layer. The numbers (vectors) then flow mathematically through the network and toward the two output neurons, dog and cat. Cat ended up with a higher activation level than dog. Thus, the system is “guessing” the object in the photo is a cat. But to guess correctly the system must first be trained.

Now let’s talk about learning. When the system gives a correct answer, the connections responsible for its decision are strengthened. When it gives a wrong answer, the connections are weakened. This is similar to the reward and punishment that goes on to influence neuroplasticity in the human brain. In the example above, the network correctly categorized the picture as a cat. After it did this it was then told by the program it interacts with that it got it right. So, it then went back and strengthened all the weights responsible for helping it make that decision. If it had falsely recognized the picture as a dog, then it would have been told that it that it got it wrong, and it would go back and weaken all of the weights responsible for helping it make the wrong decision. Going back and making these adjustments based on the outcome of the guess is known as backpropagation. Backprop, as it is sometimes called, is one of the fundamental algorithms responsible for the success of neural networks.

As you can see this system requires supervision. This is known as supervised machine learning. It must be told when it is right and when it is wrong and that necessitates that its data is prelabeled. To create the training data for this system a person had to collect and then label thousands of pictures of dogs and cats so that the AI could be told when it is right and when it is wrong.

Next, let’s look at the middle network in the diagram above. This is an optical digit recognizer like the one we saw earlier. This AI system is shown correctly recognizing the number six. The network is behaving in much the same way as the cat/dog classifier, except here you can see that it has 10 outputs rather than just two. This is because it must be able to differentiate between the numbers 0 through 9.

The last network in the diagram is a natural language processing system and it works in a way that is very similar to the first two networks. It is given the first four words in a sentence, “This is my pet…” It is shown correctly predicting the word “cat” as the most probable next word. But this system does not only distinguish cats from dogs. This network must differentiate between all the words in the English language, so it has an input neuron and an output neuron corresponding to every word in the dictionary. That’s right, natural language generating AIs like GPT-3 need a dictionary worth of inputs and a dictionary worth of outputs.

There are around 170,000 words that are currently used in the English language. However, most people only use around 20,000 to 30,000 words conversationally. Many AI natural language models therefore use around 50,000 of the most common words. This means that there are 50,000 different possible outputs for the neural network. Each output is built into the network’s structure and as the activation energy passes from the inputs, through the hidden layers, and toward the outputs, one of those words will be more highly activated than any other. That word will be the one the network chooses.

The next diagram shows how a natural language processing network makes its decisions. The neuron for the word “pet” spreads its activation to all the neurons in the first hidden layer because it is connected to each one. Again, due to their weights, some neurons value this input more than others. These then fire at the next hidden layers until they reach the output layer activating all the neurons there, at least to some extent. One neuron in the output layer values this pattern of activation more than any other. The neuron that is activated the most, “cat,” is the one chosen by the network as being the most likely to follow the word “pet.” This is a helpful diagram, but it is a huge oversimplification. This is because, when GPT-3 chooses the word “cat” it is not because one word (pet) selected it, it is because the vectors for many words converged on it together. Remember? We said that GPT-3 has an attention 2048 tokens wide. That gives you an idea of just how many inputs are being considered simultaneously to select the next output.

Now, let’s put all of this together and consider what is happening when a natural language processing system like GPT-3 is undergoing training. Luckily, its training data does not need to be labeled by a person and its training process does not have to be supervised. Why you ask? Because the right answer is already there in the text. The trick is, it hides the next word from itself. As it reads, the next word is hidden from its view. It must predict what that next word will be. After it guesses, it is allowed to see if it was right. If it gets it right, it learns. If it gets it wrong, it unlearns.

With the cat and dog classifier, the system would make a prediction, learn and then start all over again with a new picture. Natural language generating AIs do not start over with each word. Rather, they keep reading and using the previous words to update their attention in an iterative fashion. The diagram below gives an example of this. In the example, the system uses the context it is holding to guess the first two words accurately (“is” and “my”) but gets the next two wrong (“pet” and “cat”). When a system like GPT-3 reads through Wikipedia it is constantly making errors, but because its attention is so wide, after extensive training it develops a preternatural ability to make inferences about what word could be coming next. 


So to recap, GPT-3 takes a series of words (starting with the words you give it as a prompt or with the series that it is currently in the process of generating) and then fills in the next blank. It gives some consideration to each word in the dictionary every time it chooses the next word. To decide which one to use it pushes data input through a mathematical network of neurons, toward its entire dictionary of outputs. Whichever word receives the most spreading activation out of all the potential outputs is assigned the highest probability and is used to fill in the next blank. To accomplish this using mathematics the words themselves are represented as vectors (strings of numbers) and these vectors interact with the numerical structure of the existing network (through matrix multiplication).

Another way to frame this is to point out that GPT is basically asking, “given the previous words what is the probability distribution for the next word?” Once it finds that word, it adds it to the list, and it samples again on that distribution.

The diagram above shows how when given the string of words, “this is my pet…” an AI that had not finished training could come up with a word like “dog.” The right word was “cat.” So, when GPT-3 gets it wrong it will learn from its mistake. The mathematical difference between “cat” and “dog” is calculated and used to update the system. Of course, this is an arbitrary distinction (and much of what the system learns is arbitrary). There is nothing wrong with saying “this is my pet dog.” But if this phrase occurred in an article about cats there might be something wrong with it. GPT-3’s attention is wide enough to recognize an article about cats the learning might be more helpful because it would help train the system to group similar words together.

Before training takes place, the system’s outputs are generated at random because its parameters (synaptic weights) are set to random values (like most neural networks). But during training inappropriate responses are compared to correct responses, the error of the response is calculated, and then this error value is used to alter the model’s parameters so it is more likely to choose the right word next time. When it does this, it changes the way that it mathematically represents the word “cat” in its network. For instance, the words “cat” and “feline” may not be related in its memory at all, but during training they will come to be more closely related because they are likely to pop up in the same sentences. Another way of saying this is that the system will learn to group things that appear close together in time (temporal contiguity). The way these two words (cat and feline) are encoded in memory as numbers (vectors of floats) will become more and more similar. This places semantically related words closer and closer together in a multidimensional web of definitions.

Thus far, we have explained how an NLP system learns to make predictions about language, but here we are interested in natural language generation. So how would you get such a system to write its own content? It is easy, you simply start a sentence for it, or ask it a question. That will fill its attention with context that it will then use to predict what the next word should be. It continues on, adding words to the end, generating highly complex, synthetic speech. A chain of predictions becomes a narrative. By now you should be able to see why a recent article refers to modern language models as “stochastic parrots.” They are. They do a fantastic job of mimicking human language in a pantomime, chaotic, difficult-to-predict way.



For every word in the English language, there is one word and only one word that is most likely to follow it (my … name). Some words will be slightly less likely to follow it (my … cat). Other words may have almost no probability of following (my … thus). Natural language models from 40 years ago would predict the next word only from the single word that directly preceded it. Most of the time they could not formulate a coherent phrase much less a sentence or paragraph. But as you know, the newer language models look back much further than just one word. Their attention span maintains whole paragraphs in memory while constantly adding new words and subtracting the words that have been there the longest. Like my model of consciousness, they exhibit a form of working memory that evolves gradually through time. They use, what I call, “iterative updating” and “incremental change in state-spanning coactivity.” 

But the human working memory doesn’t just track a succession of words, it tracks a succession of events. Just as there is one most probable word to follow any series of words, there is one most probable event to follow a sequence of events. We use our working memory to record events as they happen to help us decide what to predict next.

In the next blog entry, we will consider how a system like GPT-3 could be incorporated into a much larger system of neural networks to better approximate human working memory by using, not just words, but events as long-term dependencies. This will allow an AI to make predictions, not just of the best word to use next, but the best behavior. We will also discuss what other modular networks would be valuable to such a larger system. For instance, we would want modules that correspond to video input, mental imagery, and motor output. GPT-3 can currently interact with other programs, but these programs are far from being tightly integrated. To create a conscious system, we want multiple neural networks with different functional specializations to be tightly interwoven just like they are in our brain. What role would an AI like GPT-3 play in this larger system? It would probably play the role of Broca’s area, the human brain’s language region. For all the details, please stay tuned for the next entry.

Also you might want to watch my YouTube video lecture on working memory and consciousness: