AI that can produce natural language is a hot topic today.
Here we are going to discuss how it is structured, how it works, how it learns,
and how it could possibly be improved.
Natural language processing (NLP) is a subfield of AI
concerned with recognizing and analyzing natural language data. Alexa, Siri,
and Google Assistant all use NLP techniques. Capabilities of NLP software
include speech recognition, language translation, sentiment analysis, and
language generation. Here we are primarily interested in natural language
generation, which means the creation of written text. There is a long history
of software that can produce language but only in the last few years has it
approached human-level capability.
There are many state-of-the-art systems we could discuss,
but here we are going to focus on one called GPT-3. It is an exciting new AI
system that has proven to be highly adept at natural language generation. It
can answer questions, write computer code, summarize long texts, and even write
its own essays. Its writing is so good that often it seems as if it was written
by a human.
You can feed GPT-3 the first two sentences of a news article
and it will write the rest of the article in a convincing manner. You can ask
it to write a poem in the style of a certain author, and its output may be
indistinguishable from an actual poem by that author. In fact, one blogger
created a blog where they only posted GPT-3 text as entries. The entries were
so good that people were convinced it was written by a human and started
subscribing to the blog.
Take a look at a few examples of its responses to simple questions:
Traditionally AI does poorly with common sense, but as you
can see many of GPT-3’s responses are highly logical. GPT-3 was trained on
thousands of websites, books, and most of Wikipedia. This enormous and diverse
corpus of unlabeled text amounted to hundreds of billions of words. Despite the
fact that what it is doing is simple and mechanical, because GPT-3 has so much
memory, and has been exposed to such a high volume of logical writing from good
authors, it is able to unconsciously piece together sentences of great
complexity and meaning. The way it is structured is fascinating and I hope that
by the end of this post you have strong intuitive understanding of how it
works.
What is Natural Language Generation Doing?
NLP uses distributional semantics. This means keeping track
of which words tend to appear together in the same sentences and how they are
ordered. Linguist John Firth (1890 – 1960) said, “You shall know a word by the
company it keeps.” NLP systems keep track of when and how words accompany each
other statistically. These systems are fed huge amounts of data in the form of
paragraphs and sentences, and they analyze how the words tend to be
distributed. They then use this probabilistic knowledge in reverse to generate
language.
As they write, NLP systems are “filling in the blank” in a
process called “next word prediction.” That’s right, GPT has no idea what it is
going to say next, it literally only focuses on one word at a time, one after
another. GPT-3 “knows” nothing. It only appears to have knowledge about the
world because of the intricate statistics it keeps on the mathematical
relationships between words from works written by human authors. GPT-3 is
basically saying: “Based on the training data I have been exposed to, if I had
to predict what the next word in this sentence was, I would guess that it would
be _____.”
When you give an NLP system a single word, they will find
the most statistically appropriate word to follow it. If you give it half a
sentence, it will use all the words to calculate the next most appropriate
word. Then after these NLP systems make the first recommendation, they use that
word, along with the rest of the sentence, to recommend the next word. They
compile sentences iteratively in this manner, word by word. They are not
thinking. They are not using logic, mental imagery, concepts, ideas, semantic
or episodic memory. Rather, they are using a glorified version of the
autocomplete in your Google search bar, or your phone’s text messaging app.
To really get a sense of this, open the text app on your
phone. Type one word, then see what the phone offers you as an autocomplete
suggestion for the next word. Select their recommendation. You can keep
selecting their recommendation to string together a sentence. Depending on the
algorithm the phone uses (likely Markovian) the sentence may make vague sense
or may make no sense at all. In principle though, this is how GPT and all other
modern language generating models work. The screenshots below show a search on
Google, and some sentences generated by my phone’s predictive text feature.
A. A. Google using autocomplete to give you likely predictions for your search. B. Using the autocomplete suggestions above my phone’s keyboard to generate nonsense sentences.
GPT-3 Has a Form of Attention
Most autocomplete systems are much more myopic than GPT.
They may only take the previous word, or previous two words into consideration.
This is partially because it becomes very computationally expensive to look
back further than a couple of words. The more previous variables that are
tracked, the more expensive. A computer program that had both a list of every
word in the English language and the word that is most likely to follow each
word, would take up very little space in computer memory and require very
little processing resources. However, what GPT-3 does is much more complex
because it looks at the last several words to make its decisions.
The more words, the more context. The more context, the
better the prediction. Let’s say you were give the word “my” and asked to
predict the next word. Not very easy, right? What if you were given “is
my”? Still not very easy. How about,
“today is my”. Now those three words might give you the context you need to
predict that the next word is “birthday.” Words occuring along a timeline are
not independent or equiprobable. Rather, there are correlations and conditional
dependencies between sucessive words. What comes later is dependent on what
came before. In that four word string “today is my birthday” there is a
short-term dependency between “today” and “birthday.” So being able to have a
working memory of previous words is very helpful. More sophisticated AIs like
GPT-3 can deal with long-term dependencies too. This is when, an entire
paragraph later, GPT-3 can still reference the fact that today is someone’s
birthday.
By attending to preceding words, GPT-3 has a certain degree
of precision and appropriateness, and is able to stay on track. For instance,
it can remember the beginning of the sentence (or paragraph), and acknowledge
it or elaborate on it. Of course, this is essential to good writing. It’s
attentional resources enable it to remember cues over many time steps allowing
its behavior to retain pertinence by accounting for what came earlier. While it
was trained, the GPT-3 software was able to learn what to pay attention to
given the context it was considering. This way it does not have to keep everything
that came earlier in mind, it only stores what it predicts will be important in
the near future.
If you can remember that you were talking about the
vice-president two sentences ago, then you will be able to use the pronoun
“she” when referring to her again. In this case your use of “she” is dependent
on a noun that you used several seconds ago. This is an example of an event
being used as a long-term dependency. Long-term dependencies structure our
thinking processes, and they allow us to predict what will happen next, what
our friend will do next, and they help us finish each other’s sentences. To a
large extent, intelligence is the ability to capture, remember, manage, and act
on short- and long-term dependencies.
GPT-3 uses its attention to keep track of several long-term
dependencies at a time. It selectively prioritizes the most relevant of recent
items so that it can refer back to them. This is how it is able to keep certain
words “in mind” so that it doesn’t stray from the topic as it writes. GPT-3 is
2048 tokens wide, where tokens are generally equivalent to words. So, it has a
couple thousand words as its “context window” or attention span. This is
clearly much larger than what a human has direct access to from the immediate
past (Most people cannot remember a 10 digit number?). Its attention is what
allows it to write in a rational human-like way. Reading the following text
from GPT-2 can you spot places where it used its backward memory span to attend
to short and long-term dependencies?
As you can see GPT-2 takes the context from the
human-written prompt above and creates an entire story. Its story retains many
of the initial elements introduced by the prompt and expands on them. You can
also see how it is able to introduce related words and concepts and then refer
back to them paragraphs later in a reasonable way.
Some Technical but Interesting Details About GPT-3
GPT-3 was introduced in May 2020 by Open AI Inc. which was
founded by Elon Musk and Sam Altman. GPT-3 stands for Generative Pre-trained
Transformer 3. The “generative” in the name means that it can create its own
content. The word “pre-trained” means that it has already learned what it needs
to know. Its learning is actually now complete (for the most part) and thus its
synaptic weights have been frozen. The word “transformer” refers to the type of
neural network it is (a version of a recurrent network). The transformer
architecture, by the way, is relatively simple. It has also been used in other
language models such as Google’s BERT and Microsoft’s Turing Natural Language
Generation (T-NLG).
The 3 in GPT-3 denotes that it is a third-generation product
coming after GPT and GPT-2 as the third iteration of the GPT-n series. GPT-1
and 2 were also groundbreaking and similarly seen as technologically
disruptive. GPT-3 has a wider attention span than GPT-2 and many more layers.
GPT-2 had 1.5 billion parameters, and GPT-3 has a total of 175 billion
parameters. Thus, it is over 100 times larger than its impressive predecessor
which came two years before it. What are those 175 billion parameters? The
parameters are the number of synaptic learning changes that can take place
between its neurons. The more parameters, the more memory it has, and the more
structural complexity to its memory.
You can make a rough comparison between the 175 billion
parameters in GPT-3 to the 100 trillion synapses in the human brain. That
should give you a sense of how much more information your brain is capable of
holding (over 500x). It cost $4.6 million to train GPT-3. At that rate, trying
to scale it up to the size of the brain would cost an unwieldy $2.5 billion.
However, considering the fact that neural network training efficiency has been
doubling every 16 months, by 2032 scientists may be able to create a system
with the memory capacity of the human brain (100 trillion parameters) for
around the same cost of GPT-3 ($5 million). This is one reason why many people
are excited about the prospect of keeping the GPT architecture and just
throwing more compute at it to achieve superintelligence.
It is worth mentioning that scaling up from GPT-2 to GPT-3
has not yet resulted in diminishing returns. That is, its performance has
increased on a straight line. This suggests that just throwing more computing
power at the same architecture could lead to equally stunning performance for
GPT-4. This has led many researchers to wonder how big this can get, and how
far we can take it. I think that it will continue to scale well for a while
longer, but I don’t think the transformer architecture will ever approach any
form of sentient consciousness. Most forms of AI (machine learning and deep
learning) are one trick ponies. They perform well, but only in one specific
domain. My belief is that a specialized system like GPT will continue to be
used in the future but will make modular contributions to more generalist
systems. I cover that in the next blog entry which you can read here.
GPT-3 is a closed book system, which means that it does not
query a database to find its answers, it “speaks from its own knowledge.” It
has read Wikipedia, but (Unlike IBM’s Jeopardy champion “Watson”) does not have
Wikipedia saved verbatim in files on its hard drive. Rather, it “read” or
traversed through Wikipedia and saved its impressions of it that did not
already match its existing structure. In other words, it saved information
about the incorrect predictions it made about Wikipedia. It is important to
keep in mind that it is not a simple lookup table. It is an autoregressive
language model, meaning that it predicts future values from its memories of
past values. It interpolates and extrapolates from what it remembers. It is
amazing at this, and its abilities generalize to a wide variety of tasks. GPT-3
outperforms many fine-tuned, state of the art models in a range of different
domains, tasks, and benchmarks. In fact, some of its accomplishments are mind
blowing.
Human accuracy at detecting articles that were produced by
GPT-3 (and not another human) is barely above chance at 52%. This means that it
is very difficult to tell if you are reading something written by it or by a
real human. It also means that GPT-3 has nearly passed a written form of the
Turing test. GPT-3 really is an amazing engineering feat. It shows that simply
taking a transformer network and exposing it to millions of sentences from text
online, you can get a system that appears intelligent. It appears intelligent
even though it is not modeled after the brain and is missing most of the major
features thought by brain scientists, and psychologists to be instrumental to
intelligence.
It writes as if it has understanding. But in reality, it
understands nothing. It cannot build abstract conceptual structures and cannot
reliably synthesize new intellectual or academic ideas. It has, however, shown
glimmers of a simple form of reasoning that allows it to create true content
that was not in its training set. For example, although it cannot add 10-digit
numbers (which a pocket calculator can do with ease) it can add 2- and 3-digit
numbers (35 + 67) and do lots of other math that it was never trained to do and
never encountered an example of. Its designers claim that it has never seen any
multiplication tables. Specialists are now arguing about what it means that it
can do math that it has never seen.
In the example at the beginning of this blog entry GPT-3
knew that there are no animals with three legs. This knowledge was not
explicitly programmed into it by a programmer, nor was it spelled out
explicitly in its training data. Pretty amazing. If GPT-3 were designed
differently and got its knowledge from a database of a long laundry list of
facts (like the Cyc AI project) programmed by hand it wouldn’t easily interface
and interact with other neural networks. But since GPT-3 is a neural network,
it should play constructively and collaboratively with other neural networks.
This really underscores the potential value of architectures like this in the
future.
GPT-3 is very different from something like IBM’s Watson,
the Jeopardy champion (which I have written about here). Watson was programmed
with thousands of lines of code, annotations, and conditionals. This
programming helped it respond to particular contingencies that were identified
by humans in advance. GPT-3 has very little in the way of formal rules. The
logical structure that comes out of it is coming from the English language
content that it has “read.”
GPT-3 was trained on prodigious amounts of data in
Microsoft’s cloud using graphics processing units (GPUs). GPUs are often used
to train neural networks because they have a large number of cores. In other
words, a GPU is like having many weak CPUs all of which can work on different
small problems at the same time. This is only useful if the computer’s work can
be broken down into separate threads (parallelizable). This makes a GPU
well-suited for the highly parallel task of modeling individual neurons because
each neuron can be modeled independently. Multicore GPUs have only been
accessible to public consumers for the last ten years. They started with 2
cores, then 4, and today they can have thousands. What was the impetus for
engineers to build fantastically complex multicore GPUs? It was the demand for
videogames. Gaming computers and videogame consoles require better and better
graphics cards to handle state of the art games. Less than 10 years ago AI
scientists realized that they could take advantage of this and use GPUs
themselves. This is a major reason why deep learning AI is performing at high
levels and is such a hot topic today.
OpenAI and Microsoft needed hundreds of GPUs to train GPT-3.
If they had only used one $8,000 RTX 8000 GPU (15 TFLOPS) it would have taken
more than 600 years to process all of the training that took place. If they
would have used the GPU on your home computer, it would have likely taken
thousands of years. That gives you an idea of how much processing time and
resources went into fine-tuning this network. But what is involved in
fine-tuning a network. Let’s discuss what is happening under the hood. (Apart
from training, querying the pretrained model is also resource expensive. GPT-2
was able to run on a single machine at inference time, but GPT-3 must run on a
cluster.)
How Neural Networks Work
This next section will offer an explanation for how
artificial neural networks operate. It will then show how neural networks are
applied to language processing and NLP networks like GPT-3. It is important to
point out that this explanation highly simplified and incomplete, but it should
communicate the gist and give you some helpful intuitions about how to think
about how AI functions.
To understand how a neural network works, first let’s look
at a single neuron with three inputs. The three circles on the left in the
figure below represent neurons (or nodes). These neurons, X1, X2, and X3, are
in the input layer and they are taking information directly from an outside
source (the data you input). Each neuron is capable of holding a value from 0
to 1. It sends this value to the next neuron to its right. The neuron on the
right takes the values from these three inputs and adds them together to see if
they sum above a certain threshold. If they do, the neuron will fire. If not,
it won’t. Just as in the brain, the artificial neuron’s pattern of firing over
time is dependent on its inputs.
We all know that neural networks learn, so how could this
simple system learn? It learns by tweaking its synaptic weights, W1, W2, and
W3. In other words, if the neuron on the right learns that inputs 1 and 2 are
important but that input 3 is not important, it will increase the weights of
the first two and decrease the weight of the third. As you can see in the
formula below it multiplies each input by its associated weight and then adds
the products together to get a number. Again, if this number is higher than its
threshold for activation it will fire. When it fires, it sends information to
the neurons that it is connected to. Remember, this simple four neuron example
would just be a miniscule fraction of a contemporary neural network.
The
figure below shows a bunch of neurons connected together to create a network.
This is a very simple, vanilla neural network sometimes referred to as a
multilayer perceptron. Each neuron is connected to every other neuron in the
adjacent layer. This is referred to as being “fully connected.” All these connections
are associated with a weight, which can be changed by experience, and thus
provide the network with many different possible ways to learn. Each weight is
tuning knob that is adjusted through trial and error in an effort to find an
optimal network configuration.
The
network below recognizes digits, and here it is shown correctly recognizing the
handwritten number 3. A picture of the handwritten 3 is fed into the system.
The photo is 28 pixels wide, by 28 pixels tall, for a total of 784 pixels. This
means that the input layer is going to need 784 neurons, one for each pixel. The
brightness of each pixel, on a scale from 0 to 1 are fed into the input neurons
as you can see below. These numbers pass through the network from left to
right. As they do this they will be multiplied by their associated weights at
each layer until the “activation energy” from the pattern of input results in a
pattern of output.
How
many output neurons would you expect this network to have? Well, if it
recognizes digits, then it should have 10 outputs, one for each digit (0-9).
After the activation energy from the inputs passes through the network, one of
the ten neurons in the output layer will be activated more than the others.
This will be the network’s answer or conclusion. In the example below, the
output neuron corresponding to the number 3 is activated the most at an activation
level of .95. The system correctly recognized the digit as a 3. Please note the
ellipses in each column which indicate that the neurons are so numerous that some
of them are not pictured.
These
artificial neurons and the connections between them amount to an artificial
neural network. It is modeled and run inside of a computer. It is kind of like
a videogame in the sense that these structures don’t actually exist in the real
world, but are simulated at high fidelity and speed using a computer. This
rather simple network captures some of the important cognitive mechanisms found
in the human brain. The neural network is considered a form a machine learning,
and when it contains more than three hidden layers, it is referred to as deep
learning. Some modern networks have thousands of hidden layers. Now the amazing
thing about these mathematical neurons is that if you connect a large number of
them up in the right way and choose a good learning rule (a method of updating
its weights), it can learn just about any mathematical expression. After the
network is trained on data, it will capture patterns in the data, becoming one
big mathematical function that can be used for things like classification.
In the
next diagram we see a neural network that is being used to classify produce. It
is using object features (like color, size, and texture) in the first layer to
determine the type of produce (orange, apple, or lettuce) in the hidden layer. Then
these are associated with either the fruit or vegetable classification in the
output layer. The boldness of the lines indicates the strength of the weights.
You can see that red, small, and smooth are all connected strongly to cherry.
This means that whenever all three of these are activated by an input pattern,
cherry will be selected in the hidden layer. You can also see that cherry is
more strongly connected to fruit than vegetable in the output layer. So, by
using only an object’s features this system could tell a fruit from a vegetable.
Please
keep in mind that this example is a simplified, “toy” example. However, neural
networks do work hierarchically in this way. At each layer, they take simple
features and allow them to converge on more complex features, culminating in
some kind of conclusion. In the example involving digit recognition above, the
first hidden layer generally recognizes short line segments, and the layers
after it recognize increasingly complex line segments including loops and
curves. Finally, the output layer puts these together to form whole numbers (as
when two loops create the number 8). Again, neural networks go from concrete
features to abstract categories because of the way that the neurons in
low-order layers (to the left) project to neurons in high-order layers (to the
right).
Now
let’s talk about learning. When the system gives a correct answer, the
connections responsible for its decision are strengthened. When it gives a wrong
answer, the connections are weakened. This is similar to the reward and
punishment that goes on to influence neuroplasticity in the human brain. In the
example above, the network correctly categorized the picture as a cat. After it
did this it was then told by the program it interacts with that it got it right.
So, it then went back and strengthened all the weights responsible for helping
it make that decision. If it had falsely recognized the picture as a dog, then it
would have been told that it that it got it wrong, and it would go back and
weaken all of the weights responsible for helping it make the wrong decision. Going
back and making these adjustments based on the outcome of the guess is known as
backpropagation. Backprop, as it is sometimes called, is one of the fundamental
algorithms responsible for the success of neural networks.
As you
can see this system requires supervision. This is known as supervised machine
learning. It must be told when it is right and when it is wrong and that necessitates
that its data is prelabeled. To create the training data for this system a
person had to collect and then label thousands of pictures of dogs and cats so
that the AI could be told when it is right and when it is wrong.
Next,
let’s look at the middle network in the diagram above. This is an optical digit
recognizer like the one we saw earlier. This AI system is shown correctly
recognizing the number six. The network is behaving in much the same way as the
cat/dog classifier, except here you can see that it has 10 outputs rather than just
two. This is because it must be able to differentiate between the numbers 0
through 9.
The last network in the diagram is a natural language
processing system and it works in a way that is very similar to the first two
networks. It is given the first four words in a sentence, “This is my pet…” It
is shown correctly predicting the word “cat” as the most probable next word.
But this system does not only distinguish cats from dogs. This network must
differentiate between all the words in the English language, so it has an input
neuron and an output neuron corresponding to every word in the dictionary.
That’s right, natural language generating AIs like GPT-3 need a dictionary
worth of inputs and a dictionary worth of outputs.
There are around 170,000 words that are currently used in
the English language. However, most people only use around 20,000 to 30,000
words conversationally. Many AI natural language models therefore use around
50,000 of the most common words. This means that there are 50,000 different
possible outputs for the neural network. Each output is built into the
network’s structure and as the activation energy passes from the inputs,
through the hidden layers, and toward the outputs, one of those words will be
more highly activated than any other. That word will be the one the network
chooses.
The next diagram shows how a natural language processing
network makes its decisions. The neuron for the word “pet” spreads its
activation to all the neurons in the first hidden layer because it is connected
to each one. Again, due to their weights, some neurons value this input more
than others. These then fire at the next hidden layers until they reach the
output layer activating all the neurons there, at least to some extent. One
neuron in the output layer values this pattern of activation more than any
other. The neuron that is activated the most, “cat,” is the one chosen by the
network as being the most likely to follow the word “pet.” This is a helpful
diagram, but it is a huge oversimplification. This is because, when GPT-3
chooses the word “cat” it is not because one word (pet) selected it, it is
because the vectors for many words converged on it together. Remember? We said
that GPT-3 has an attention 2048 tokens wide. That gives you an idea of just
how many inputs are being considered simultaneously to select the next output.
Now, let’s put all of this together and consider
what is happening when a natural language processing system like GPT-3 is
undergoing training. Luckily, its training data does not need to be labeled by
a person and its training process does not have to be supervised. Why you ask? Because
the right answer is already there in the text. The trick is, it hides the next
word from itself. As it reads, the next word is hidden from its view. It must
predict what that next word will be. After it guesses, it is allowed to see if
it was right. If it gets it right, it learns. If it gets it wrong, it unlearns.
With the cat and dog classifier, the system would
make a prediction, learn and then start all over again with a new picture.
Natural language generating AIs do not start over with each word. Rather, they
keep reading and using the previous words to update their attention in an
iterative fashion. The diagram below gives an example of this. In the example,
the system uses the context it is holding to guess the first two words accurately
(“is” and “my”) but gets the next two wrong (“pet” and “cat”). When a system
like GPT-3 reads through Wikipedia it is constantly making errors, but because
its attention is so wide, after extensive training it develops a preternatural
ability to make inferences about what word could be coming next.
So to recap, GPT-3 takes a series of words (starting with
the words you give it as a prompt or with the series that it is currently in the
process of generating) and then fills in the next blank. It gives some
consideration to each word in the dictionary every time it chooses the next
word. To decide which one to use it pushes data input through a mathematical
network of neurons, toward its entire dictionary of outputs. Whichever word
receives the most spreading activation out of all the potential outputs is
assigned the highest probability and is used to fill in the next blank. To
accomplish this using mathematics the words themselves are represented as
vectors (strings of numbers) and these vectors interact with the numerical
structure of the existing network (through matrix multiplication).
Another way to frame this is to point out that GPT is
basically asking, “given the previous words what is the probability
distribution for the next word?” Once it finds that word, it adds it to the
list, and it samples again on that distribution.
The diagram above shows how when given the string of words,
“this is my pet…” an AI that had not finished training could come up with a
word like “dog.” The right word was “cat.” So, when GPT-3 gets it wrong it will
learn from its mistake. The mathematical difference between “cat” and “dog” is
calculated and used to update the system. Of course, this is an arbitrary
distinction (and much of what the system learns is arbitrary). There is nothing
wrong with saying “this is my pet dog.” But if this phrase occurred in an
article about cats there might be something wrong with it. GPT-3’s attention is
wide enough to recognize an article about cats the learning might be more
helpful because it would help train the system to group similar words together.
Before training takes place, the system’s outputs are
generated at random because its parameters (synaptic weights) are set to random
values (like most neural networks). But during training inappropriate responses
are compared to correct responses, the error of the response is calculated, and
then this error value is used to alter the model’s parameters so it is more likely
to choose the right word next time. When it does this, it changes the way that
it mathematically represents the word “cat” in its network. For instance, the
words “cat” and “feline” may not be related in its memory at all, but during
training they will come to be more closely related because they are likely to
pop up in the same sentences. Another way of saying this is that the system
will learn to group things that appear close together in time (temporal
contiguity). The way these two words (cat and feline) are encoded in memory as
numbers (vectors of floats) will become more and more similar. This places
semantically related words closer and closer together in a multidimensional web
of definitions.
Thus far, we have explained how an NLP system learns to make
predictions about language, but here we are interested in natural language
generation. So how would you get such a system to write its own content? It is
easy, you simply start a sentence for it, or ask it a question. That will fill
its attention with context that it will then use to predict what the next word
should be. It continues on, adding words to the end, generating highly complex,
synthetic speech. A chain of predictions becomes a narrative. By now you should
be able to see why a recent article refers to modern language models as
“stochastic parrots.” They are. They do a fantastic job of mimicking human
language in a pantomime, chaotic, difficult-to-predict way.
Conclusion
For every word in the English language, there is one word
and only one word that is most likely to follow it (my … name). Some words will
be slightly less likely to follow it (my … cat). Other words may have almost no
probability of following (my … thus). Natural language models from 40 years ago
would predict the next word only from the single word that directly preceded
it. Most of the time they could not formulate a coherent phrase much less a
sentence or paragraph. But as you know, the newer language models look back
much further than just one word. Their attention span maintains whole
paragraphs in memory while constantly adding new words and subtracting the
words that have been there the longest. Like my model of consciousness, they
exhibit a form of working memory that evolves gradually through time. They use,
what I call, “iterative updating” and “incremental change in state-spanning
coactivity.”
But the human working memory doesn’t just track a succession
of words, it tracks a succession of events. Just as there is one most probable
word to follow any series of words, there is one most probable event to follow
a sequence of events. We use our working memory to record events as they happen
to help us decide what to predict next.
In the next blog entry, we will consider how a system like GPT-3 could be incorporated into a much larger system of neural networks to better approximate human working memory by using, not just words, but events as long-term dependencies. This will allow an AI to make predictions, not just of the best word to use next, but the best behavior. We will also discuss what other modular networks would be valuable to such a larger system. For instance, we would want modules that correspond to video input, mental imagery, and motor output. GPT-3 can currently interact with other programs, but these programs are far from being tightly integrated. To create a conscious system, we want multiple neural networks with different functional specializations to be tightly interwoven just like they are in our brain. What role would an AI like GPT-3 play in this larger system? It would probably play the role of Broca’s area, the human brain’s language region. For all the details, please stay tuned for the next entry.
Also you might want to watch my YouTube video lecture on working memory and consciousness: