Friday, July 2, 2021

How AIs Put Their Sentences Together: Natural Language Generation

AI that can produce natural language is a hot topic today. Here we are going to discuss how it is structured, how it works, how it learns, and how it could possibly be improved.

Natural language processing (NLP) is a subfield of AI concerned with recognizing and analyzing natural language data. Alexa, Siri, and Google Assistant all use NLP techniques. Capabilities of NLP software include speech recognition, language translation, sentiment analysis, and language generation. Here we are primarily interested in natural language generation, which means the creation of written text. There is a long history of software that can produce language but only in the last few years has it approached human-level capability.

There are many state-of-the-art systems we could discuss, but here we are going to focus on one called GPT-3. It is an exciting new AI system that has proven to be highly adept at natural language generation. It can answer questions, write computer code, summarize long texts, and even write its own essays. Its writing is so good that often it seems as if it was written by a human.

You can feed GPT-3 the first two sentences of a news article and it will write the rest of the article in a convincing manner. You can ask it to write a poem in the style of a certain author, and its output may be indistinguishable from an actual poem by that author. In fact, one blogger created a blog where they only posted GPT-3 text as entries. The entries were so good that people were convinced it was written by a human and started subscribing to the blog.

Take a look at a few examples of its responses to simple questions:


Traditionally AI does poorly with common sense, but as you can see many of GPT-3’s responses are highly logical. GPT-3 was trained on thousands of websites, books, and most of Wikipedia. This enormous and diverse corpus of unlabeled text amounted to hundreds of billions of words. Despite the fact that what it is doing is simple and mechanical, because GPT-3 has so much memory, and has been exposed to such a high volume of logical writing from good authors, it is able to unconsciously piece together sentences of great complexity and meaning. The way it is structured is fascinating and I hope that by the end of this post you have strong intuitive understanding of how it works.


What is Natural Language Generation Doing?

NLP uses distributional semantics. This means keeping track of which words tend to appear together in the same sentences and how they are ordered. Linguist John Firth (1890 – 1960) said, “You shall know a word by the company it keeps.” NLP systems keep track of when and how words accompany each other statistically. These systems are fed huge amounts of data in the form of paragraphs and sentences, and they analyze how the words tend to be distributed. They then use this probabilistic knowledge in reverse to generate language.

As they write, NLP systems are “filling in the blank” in a process called “next word prediction.” That’s right, GPT has no idea what it is going to say next, it literally only focuses on one word at a time, one after another. GPT-3 “knows” nothing. It only appears to have knowledge about the world because of the intricate statistics it keeps on the mathematical relationships between words from works written by human authors. GPT-3 is basically saying: “Based on the training data I have been exposed to, if I had to predict what the next word in this sentence was, I would guess that it would be _____.”

When you give an NLP system a single word, they will find the most statistically appropriate word to follow it. If you give it half a sentence, it will use all the words to calculate the next most appropriate word. Then after these NLP systems make the first recommendation, they use that word, along with the rest of the sentence, to recommend the next word. They compile sentences iteratively in this manner, word by word. They are not thinking. They are not using logic, mental imagery, concepts, ideas, semantic or episodic memory. Rather, they are using a glorified version of the autocomplete in your Google search bar, or your phone’s text messaging app.

To really get a sense of this, open the text app on your phone. Type one word, then see what the phone offers you as an autocomplete suggestion for the next word. Select their recommendation. You can keep selecting their recommendation to string together a sentence. Depending on the algorithm the phone uses (likely Markovian) the sentence may make vague sense or may make no sense at all. In principle though, this is how GPT and all other modern language generating models work. The screenshots below show a search on Google, and some sentences generated by my phone’s predictive text feature.


A.   A. Google using autocomplete to give you likely predictions for your search. B. Using the autocomplete suggestions above my phone’s keyboard to generate nonsense sentences.

 

GPT-3 Has a Form of Attention

Most autocomplete systems are much more myopic than GPT. They may only take the previous word, or previous two words into consideration. This is partially because it becomes very computationally expensive to look back further than a couple of words. The more previous variables that are tracked, the more expensive. A computer program that had both a list of every word in the English language and the word that is most likely to follow each word, would take up very little space in computer memory and require very little processing resources. However, what GPT-3 does is much more complex because it looks at the last several words to make its decisions.

The more words, the more context. The more context, the better the prediction. Let’s say you were give the word “my” and asked to predict the next word. Not very easy, right? What if you were given “is my”?  Still not very easy. How about, “today is my”. Now those three words might give you the context you need to predict that the next word is “birthday.” Words occuring along a timeline are not independent or equiprobable. Rather, there are correlations and conditional dependencies between sucessive words. What comes later is dependent on what came before. In that four word string “today is my birthday” there is a short-term dependency between “today” and “birthday.” So being able to have a working memory of previous words is very helpful. More sophisticated AIs like GPT-3 can deal with long-term dependencies too. This is when, an entire paragraph later, GPT-3 can still reference the fact that today is someone’s birthday.

By attending to preceding words, GPT-3 has a certain degree of precision and appropriateness, and is able to stay on track. For instance, it can remember the beginning of the sentence (or paragraph), and acknowledge it or elaborate on it. Of course, this is essential to good writing. It’s attentional resources enable it to remember cues over many time steps allowing its behavior to retain pertinence by accounting for what came earlier. While it was trained, the GPT-3 software was able to learn what to pay attention to given the context it was considering. This way it does not have to keep everything that came earlier in mind, it only stores what it predicts will be important in the near future.

If you can remember that you were talking about the vice-president two sentences ago, then you will be able to use the pronoun “she” when referring to her again. In this case your use of “she” is dependent on a noun that you used several seconds ago. This is an example of an event being used as a long-term dependency. Long-term dependencies structure our thinking processes, and they allow us to predict what will happen next, what our friend will do next, and they help us finish each other’s sentences. To a large extent, intelligence is the ability to capture, remember, manage, and act on short- and long-term dependencies.

GPT-3 uses its attention to keep track of several long-term dependencies at a time. It selectively prioritizes the most relevant of recent items so that it can refer back to them. This is how it is able to keep certain words “in mind” so that it doesn’t stray from the topic as it writes. GPT-3 is 2048 tokens wide, where tokens are generally equivalent to words. So, it has a couple thousand words as its “context window” or attention span. This is clearly much larger than what a human has direct access to from the immediate past (Most people cannot remember a 10 digit number?). Its attention is what allows it to write in a rational human-like way. Reading the following text from GPT-2 can you spot places where it used its backward memory span to attend to short and long-term dependencies?


As you can see GPT-2 takes the context from the human-written prompt above and creates an entire story. Its story retains many of the initial elements introduced by the prompt and expands on them. You can also see how it is able to introduce related words and concepts and then refer back to them paragraphs later in a reasonable way.

 

Some Technical but Interesting Details About GPT-3

GPT-3 was introduced in May 2020 by Open AI Inc. which was founded by Elon Musk and Sam Altman. GPT-3 stands for Generative Pre-trained Transformer 3. The “generative” in the name means that it can create its own content. The word “pre-trained” means that it has already learned what it needs to know. Its learning is actually now complete (for the most part) and thus its synaptic weights have been frozen. The word “transformer” refers to the type of neural network it is (a version of a recurrent network). The transformer architecture, by the way, is relatively simple. It has also been used in other language models such as Google’s BERT and Microsoft’s Turing Natural Language Generation (T-NLG).

The 3 in GPT-3 denotes that it is a third-generation product coming after GPT and GPT-2 as the third iteration of the GPT-n series. GPT-1 and 2 were also groundbreaking and similarly seen as technologically disruptive. GPT-3 has a wider attention span than GPT-2 and many more layers. GPT-2 had 1.5 billion parameters, and GPT-3 has a total of 175 billion parameters. Thus, it is over 100 times larger than its impressive predecessor which came two years before it. What are those 175 billion parameters? The parameters are the number of synaptic learning changes that can take place between its neurons. The more parameters, the more memory it has, and the more structural complexity to its memory.

You can make a rough comparison between the 175 billion parameters in GPT-3 to the 100 trillion synapses in the human brain. That should give you a sense of how much more information your brain is capable of holding (over 500x). It cost $4.6 million to train GPT-3. At that rate, trying to scale it up to the size of the brain would cost an unwieldy $2.5 billion. However, considering the fact that neural network training efficiency has been doubling every 16 months, by 2032 scientists may be able to create a system with the memory capacity of the human brain (100 trillion parameters) for around the same cost of GPT-3 ($5 million). This is one reason why many people are excited about the prospect of keeping the GPT architecture and just throwing more compute at it to achieve superintelligence.

It is worth mentioning that scaling up from GPT-2 to GPT-3 has not yet resulted in diminishing returns. That is, its performance has increased on a straight line. This suggests that just throwing more computing power at the same architecture could lead to equally stunning performance for GPT-4. This has led many researchers to wonder how big this can get, and how far we can take it. I think that it will continue to scale well for a while longer, but I don’t think the transformer architecture will ever approach any form of sentient consciousness. Most forms of AI (machine learning and deep learning) are one trick ponies. They perform well, but only in one specific domain. My belief is that a specialized system like GPT will continue to be used in the future but will make modular contributions to more generalist systems. I cover that in the next blog entry which you can read here.

GPT-3 is a closed book system, which means that it does not query a database to find its answers, it “speaks from its own knowledge.” It has read Wikipedia, but (Unlike IBM’s Jeopardy champion “Watson”) does not have Wikipedia saved verbatim in files on its hard drive. Rather, it “read” or traversed through Wikipedia and saved its impressions of it that did not already match its existing structure. In other words, it saved information about the incorrect predictions it made about Wikipedia. It is important to keep in mind that it is not a simple lookup table. It is an autoregressive language model, meaning that it predicts future values from its memories of past values. It interpolates and extrapolates from what it remembers. It is amazing at this, and its abilities generalize to a wide variety of tasks. GPT-3 outperforms many fine-tuned, state of the art models in a range of different domains, tasks, and benchmarks. In fact, some of its accomplishments are mind blowing.

Human accuracy at detecting articles that were produced by GPT-3 (and not another human) is barely above chance at 52%. This means that it is very difficult to tell if you are reading something written by it or by a real human. It also means that GPT-3 has nearly passed a written form of the Turing test. GPT-3 really is an amazing engineering feat. It shows that simply taking a transformer network and exposing it to millions of sentences from text online, you can get a system that appears intelligent. It appears intelligent even though it is not modeled after the brain and is missing most of the major features thought by brain scientists, and psychologists to be instrumental to intelligence.

It writes as if it has understanding. But in reality, it understands nothing. It cannot build abstract conceptual structures and cannot reliably synthesize new intellectual or academic ideas. It has, however, shown glimmers of a simple form of reasoning that allows it to create true content that was not in its training set. For example, although it cannot add 10-digit numbers (which a pocket calculator can do with ease) it can add 2- and 3-digit numbers (35 + 67) and do lots of other math that it was never trained to do and never encountered an example of. Its designers claim that it has never seen any multiplication tables. Specialists are now arguing about what it means that it can do math that it has never seen.

In the example at the beginning of this blog entry GPT-3 knew that there are no animals with three legs. This knowledge was not explicitly programmed into it by a programmer, nor was it spelled out explicitly in its training data. Pretty amazing. If GPT-3 were designed differently and got its knowledge from a database of a long laundry list of facts (like the Cyc AI project) programmed by hand it wouldn’t easily interface and interact with other neural networks. But since GPT-3 is a neural network, it should play constructively and collaboratively with other neural networks. This really underscores the potential value of architectures like this in the future.

GPT-3 is very different from something like IBM’s Watson, the Jeopardy champion (which I have written about here). Watson was programmed with thousands of lines of code, annotations, and conditionals. This programming helped it respond to particular contingencies that were identified by humans in advance. GPT-3 has very little in the way of formal rules. The logical structure that comes out of it is coming from the English language content that it has “read.”

GPT-3 was trained on prodigious amounts of data in Microsoft’s cloud using graphics processing units (GPUs). GPUs are often used to train neural networks because they have a large number of cores. In other words, a GPU is like having many weak CPUs all of which can work on different small problems at the same time. This is only useful if the computer’s work can be broken down into separate threads (parallelizable). This makes a GPU well-suited for the highly parallel task of modeling individual neurons because each neuron can be modeled independently. Multicore GPUs have only been accessible to public consumers for the last ten years. They started with 2 cores, then 4, and today they can have thousands. What was the impetus for engineers to build fantastically complex multicore GPUs? It was the demand for videogames. Gaming computers and videogame consoles require better and better graphics cards to handle state of the art games. Less than 10 years ago AI scientists realized that they could take advantage of this and use GPUs themselves. This is a major reason why deep learning AI is performing at high levels and is such a hot topic today.

OpenAI and Microsoft needed hundreds of GPUs to train GPT-3. If they had only used one $8,000 RTX 8000 GPU (15 TFLOPS) it would have taken more than 600 years to process all of the training that took place. If they would have used the GPU on your home computer, it would have likely taken thousands of years. That gives you an idea of how much processing time and resources went into fine-tuning this network. But what is involved in fine-tuning a network. Let’s discuss what is happening under the hood. (Apart from training, querying the pretrained model is also resource expensive. GPT-2 was able to run on a single machine at inference time, but GPT-3 must run on a cluster.)

 

How Neural Networks Work

This next section will offer an explanation for how artificial neural networks operate. It will then show how neural networks are applied to language processing and NLP networks like GPT-3. It is important to point out that this explanation highly simplified and incomplete, but it should communicate the gist and give you some helpful intuitions about how to think about how AI functions.

To understand how a neural network works, first let’s look at a single neuron with three inputs. The three circles on the left in the figure below represent neurons (or nodes). These neurons, X1, X2, and X3, are in the input layer and they are taking information directly from an outside source (the data you input). Each neuron is capable of holding a value from 0 to 1. It sends this value to the next neuron to its right. The neuron on the right takes the values from these three inputs and adds them together to see if they sum above a certain threshold. If they do, the neuron will fire. If not, it won’t. Just as in the brain, the artificial neuron’s pattern of firing over time is dependent on its inputs.

We all know that neural networks learn, so how could this simple system learn? It learns by tweaking its synaptic weights, W1, W2, and W3. In other words, if the neuron on the right learns that inputs 1 and 2 are important but that input 3 is not important, it will increase the weights of the first two and decrease the weight of the third. As you can see in the formula below it multiplies each input by its associated weight and then adds the products together to get a number. Again, if this number is higher than its threshold for activation it will fire. When it fires, it sends information to the neurons that it is connected to. Remember, this simple four neuron example would just be a miniscule fraction of a contemporary neural network.


The figure below shows a bunch of neurons connected together to create a network. This is a very simple, vanilla neural network sometimes referred to as a multilayer perceptron. Each neuron is connected to every other neuron in the adjacent layer. This is referred to as being “fully connected.” All these connections are associated with a weight, which can be changed by experience, and thus provide the network with many different possible ways to learn. Each weight is tuning knob that is adjusted through trial and error in an effort to find an optimal network configuration.

The network below recognizes digits, and here it is shown correctly recognizing the handwritten number 3. A picture of the handwritten 3 is fed into the system. The photo is 28 pixels wide, by 28 pixels tall, for a total of 784 pixels. This means that the input layer is going to need 784 neurons, one for each pixel. The brightness of each pixel, on a scale from 0 to 1 are fed into the input neurons as you can see below. These numbers pass through the network from left to right. As they do this they will be multiplied by their associated weights at each layer until the “activation energy” from the pattern of input results in a pattern of output.

How many output neurons would you expect this network to have? Well, if it recognizes digits, then it should have 10 outputs, one for each digit (0-9). After the activation energy from the inputs passes through the network, one of the ten neurons in the output layer will be activated more than the others. This will be the network’s answer or conclusion. In the example below, the output neuron corresponding to the number 3 is activated the most at an activation level of .95. The system correctly recognized the digit as a 3. Please note the ellipses in each column which indicate that the neurons are so numerous that some of them are not pictured.


These artificial neurons and the connections between them amount to an artificial neural network. It is modeled and run inside of a computer. It is kind of like a videogame in the sense that these structures don’t actually exist in the real world, but are simulated at high fidelity and speed using a computer. This rather simple network captures some of the important cognitive mechanisms found in the human brain. The neural network is considered a form a machine learning, and when it contains more than three hidden layers, it is referred to as deep learning. Some modern networks have thousands of hidden layers. Now the amazing thing about these mathematical neurons is that if you connect a large number of them up in the right way and choose a good learning rule (a method of updating its weights), it can learn just about any mathematical expression. After the network is trained on data, it will capture patterns in the data, becoming one big mathematical function that can be used for things like classification.

In the next diagram we see a neural network that is being used to classify produce. It is using object features (like color, size, and texture) in the first layer to determine the type of produce (orange, apple, or lettuce) in the hidden layer. Then these are associated with either the fruit or vegetable classification in the output layer. The boldness of the lines indicates the strength of the weights. You can see that red, small, and smooth are all connected strongly to cherry. This means that whenever all three of these are activated by an input pattern, cherry will be selected in the hidden layer. You can also see that cherry is more strongly connected to fruit than vegetable in the output layer. So, by using only an object’s features this system could tell a fruit from a vegetable.

Please keep in mind that this example is a simplified, “toy” example. However, neural networks do work hierarchically in this way. At each layer, they take simple features and allow them to converge on more complex features, culminating in some kind of conclusion. In the example involving digit recognition above, the first hidden layer generally recognizes short line segments, and the layers after it recognize increasingly complex line segments including loops and curves. Finally, the output layer puts these together to form whole numbers (as when two loops create the number 8). Again, neural networks go from concrete features to abstract categories because of the way that the neurons in low-order layers (to the left) project to neurons in high-order layers (to the right).



The next diagram shows that neural networks can take many forms of input and come up with appropriate output. Let’s start just by looking at the first of the three networks in the diagram. That top network received a picture of a cat and recognized it as a cat. To do this it had to take the pixel brightness of each pixel and turn them into a long list of numbers. There is one input neuron for each pixel so if there were 2,073,600 (1080 x 1920) pixels then there must be that many input neurons in the input layer. The numbers (vectors) then flow mathematically through the network and toward the two output neurons, dog and cat. Cat ended up with a higher activation level than dog. Thus, the system is “guessing” the object in the photo is a cat. But to guess correctly the system must first be trained.


Now let’s talk about learning. When the system gives a correct answer, the connections responsible for its decision are strengthened. When it gives a wrong answer, the connections are weakened. This is similar to the reward and punishment that goes on to influence neuroplasticity in the human brain. In the example above, the network correctly categorized the picture as a cat. After it did this it was then told by the program it interacts with that it got it right. So, it then went back and strengthened all the weights responsible for helping it make that decision. If it had falsely recognized the picture as a dog, then it would have been told that it that it got it wrong, and it would go back and weaken all of the weights responsible for helping it make the wrong decision. Going back and making these adjustments based on the outcome of the guess is known as backpropagation. Backprop, as it is sometimes called, is one of the fundamental algorithms responsible for the success of neural networks.

As you can see this system requires supervision. This is known as supervised machine learning. It must be told when it is right and when it is wrong and that necessitates that its data is prelabeled. To create the training data for this system a person had to collect and then label thousands of pictures of dogs and cats so that the AI could be told when it is right and when it is wrong.

Next, let’s look at the middle network in the diagram above. This is an optical digit recognizer like the one we saw earlier. This AI system is shown correctly recognizing the number six. The network is behaving in much the same way as the cat/dog classifier, except here you can see that it has 10 outputs rather than just two. This is because it must be able to differentiate between the numbers 0 through 9.

The last network in the diagram is a natural language processing system and it works in a way that is very similar to the first two networks. It is given the first four words in a sentence, “This is my pet…” It is shown correctly predicting the word “cat” as the most probable next word. But this system does not only distinguish cats from dogs. This network must differentiate between all the words in the English language, so it has an input neuron and an output neuron corresponding to every word in the dictionary. That’s right, natural language generating AIs like GPT-3 need a dictionary worth of inputs and a dictionary worth of outputs.

There are around 170,000 words that are currently used in the English language. However, most people only use around 20,000 to 30,000 words conversationally. Many AI natural language models therefore use around 50,000 of the most common words. This means that there are 50,000 different possible outputs for the neural network. Each output is built into the network’s structure and as the activation energy passes from the inputs, through the hidden layers, and toward the outputs, one of those words will be more highly activated than any other. That word will be the one the network chooses.

The next diagram shows how a natural language processing network makes its decisions. The neuron for the word “pet” spreads its activation to all the neurons in the first hidden layer because it is connected to each one. Again, due to their weights, some neurons value this input more than others. These then fire at the next hidden layers until they reach the output layer activating all the neurons there, at least to some extent. One neuron in the output layer values this pattern of activation more than any other. The neuron that is activated the most, “cat,” is the one chosen by the network as being the most likely to follow the word “pet.” This is a helpful diagram, but it is a huge oversimplification. This is because, when GPT-3 chooses the word “cat” it is not because one word (pet) selected it, it is because the vectors for many words converged on it together. Remember? We said that GPT-3 has an attention 2048 tokens wide. That gives you an idea of just how many inputs are being considered simultaneously to select the next output.



Now, let’s put all of this together and consider what is happening when a natural language processing system like GPT-3 is undergoing training. Luckily, its training data does not need to be labeled by a person and its training process does not have to be supervised. Why you ask? Because the right answer is already there in the text. The trick is, it hides the next word from itself. As it reads, the next word is hidden from its view. It must predict what that next word will be. After it guesses, it is allowed to see if it was right. If it gets it right, it learns. If it gets it wrong, it unlearns.

With the cat and dog classifier, the system would make a prediction, learn and then start all over again with a new picture. Natural language generating AIs do not start over with each word. Rather, they keep reading and using the previous words to update their attention in an iterative fashion. The diagram below gives an example of this. In the example, the system uses the context it is holding to guess the first two words accurately (“is” and “my”) but gets the next two wrong (“pet” and “cat”). When a system like GPT-3 reads through Wikipedia it is constantly making errors, but because its attention is so wide, after extensive training it develops a preternatural ability to make inferences about what word could be coming next. 

 


So to recap, GPT-3 takes a series of words (starting with the words you give it as a prompt or with the series that it is currently in the process of generating) and then fills in the next blank. It gives some consideration to each word in the dictionary every time it chooses the next word. To decide which one to use it pushes data input through a mathematical network of neurons, toward its entire dictionary of outputs. Whichever word receives the most spreading activation out of all the potential outputs is assigned the highest probability and is used to fill in the next blank. To accomplish this using mathematics the words themselves are represented as vectors (strings of numbers) and these vectors interact with the numerical structure of the existing network (through matrix multiplication).

Another way to frame this is to point out that GPT is basically asking, “given the previous words what is the probability distribution for the next word?” Once it finds that word, it adds it to the list, and it samples again on that distribution.

The diagram above shows how when given the string of words, “this is my pet…” an AI that had not finished training could come up with a word like “dog.” The right word was “cat.” So, when GPT-3 gets it wrong it will learn from its mistake. The mathematical difference between “cat” and “dog” is calculated and used to update the system. Of course, this is an arbitrary distinction (and much of what the system learns is arbitrary). There is nothing wrong with saying “this is my pet dog.” But if this phrase occurred in an article about cats there might be something wrong with it. GPT-3’s attention is wide enough to recognize an article about cats the learning might be more helpful because it would help train the system to group similar words together.

Before training takes place, the system’s outputs are generated at random because its parameters (synaptic weights) are set to random values (like most neural networks). But during training inappropriate responses are compared to correct responses, the error of the response is calculated, and then this error value is used to alter the model’s parameters so it is more likely to choose the right word next time. When it does this, it changes the way that it mathematically represents the word “cat” in its network. For instance, the words “cat” and “feline” may not be related in its memory at all, but during training they will come to be more closely related because they are likely to pop up in the same sentences. Another way of saying this is that the system will learn to group things that appear close together in time (temporal contiguity). The way these two words (cat and feline) are encoded in memory as numbers (vectors of floats) will become more and more similar. This places semantically related words closer and closer together in a multidimensional web of definitions.

Thus far, we have explained how an NLP system learns to make predictions about language, but here we are interested in natural language generation. So how would you get such a system to write its own content? It is easy, you simply start a sentence for it, or ask it a question. That will fill its attention with context that it will then use to predict what the next word should be. It continues on, adding words to the end, generating highly complex, synthetic speech. A chain of predictions becomes a narrative. By now you should be able to see why a recent article refers to modern language models as “stochastic parrots.” They are. They do a fantastic job of mimicking human language in a pantomime, chaotic, difficult-to-predict way.

 

Conclusion

For every word in the English language, there is one word and only one word that is most likely to follow it (my … name). Some words will be slightly less likely to follow it (my … cat). Other words may have almost no probability of following (my … thus). Natural language models from 40 years ago would predict the next word only from the single word that directly preceded it. Most of the time they could not formulate a coherent phrase much less a sentence or paragraph. But as you know, the newer language models look back much further than just one word. Their attention span maintains whole paragraphs in memory while constantly adding new words and subtracting the words that have been there the longest. Like my model of consciousness, they exhibit a form of working memory that evolves gradually through time. They use, what I call, “iterative updating” and “incremental change in state-spanning coactivity.” 

But the human working memory doesn’t just track a succession of words, it tracks a succession of events. Just as there is one most probable word to follow any series of words, there is one most probable event to follow a sequence of events. We use our working memory to record events as they happen to help us decide what to predict next.

In the next blog entry, we will consider how a system like GPT-3 could be incorporated into a much larger system of neural networks to better approximate human working memory by using, not just words, but events as long-term dependencies. This will allow an AI to make predictions, not just of the best word to use next, but the best behavior. We will also discuss what other modular networks would be valuable to such a larger system. For instance, we would want modules that correspond to video input, mental imagery, and motor output. GPT-3 can currently interact with other programs, but these programs are far from being tightly integrated. To create a conscious system, we want multiple neural networks with different functional specializations to be tightly interwoven just like they are in our brain. What role would an AI like GPT-3 play in this larger system? It would probably play the role of Broca’s area, the human brain’s language region. For all the details, please stay tuned for the next entry.


Also you might want to watch my YouTube video lecture on working memory and consciousness:



No comments:

Post a Comment