Friday, July 2, 2021

How AIs Put Their Sentences Together: Natural Language Generation

AI that can produce natural language is a hot topic today. Here we are going to discuss how it is structured, how it works, how it learns, and how it could possibly be improved.

Natural language processing (NLP) is a subfield of AI concerned with recognizing and analyzing natural language data. Alexa, Siri, and Google Assistant all use NLP techniques. Capabilities of NLP software include speech recognition, language translation, sentiment analysis, and language generation. Here we are primarily interested in natural language generation, which means the creation of written text. There is a long history of software that can produce language but only in the last few years has it approached human-level capability.

There are many state-of-the-art systems we could discuss, but here we are going to focus on one called GPT-3. It is an exciting new AI system that has proven to be highly adept at natural language generation. It can answer questions, write computer code, summarize long texts, and even write its own essays. Its writing is so good that often it seems as if it was written by a human.

You can feed GPT-3 the first two sentences of a news article and it will write the rest of the article in a convincing manner. You can ask it to write a poem in the style of a certain author, and its output may be indistinguishable from an actual poem by that author. In fact, one blogger created a blog where they only posted GPT-3 text as entries. The entries were so good that people were convinced it was written by a human and started subscribing to the blog.

Take a look at a few examples of its responses to simple questions:

Traditionally AI does poorly with common sense, but as you can see many of GPT-3’s responses are highly logical. GPT-3 was trained on thousands of websites, books, and most of Wikipedia. This enormous and diverse corpus of unlabeled text amounted to hundreds of billions of words. Despite the fact that what it is doing is simple and mechanical, because GPT-3 has so much memory, and has been exposed to such a high volume of logical writing from good authors, it is able to unconsciously piece together sentences of great complexity and meaning. The way it is structured is fascinating and I hope that by the end of this post you have strong intuitive understanding of how it works.

What is Natural Language Generation Doing?

NLP uses distributional semantics. This means keeping track of which words tend to appear together in the same sentences and how they are ordered. Linguist John Firth (1890 – 1960) said, “You shall know a word by the company it keeps.” NLP systems keep track of when and how words accompany each other statistically. These systems are fed huge amounts of data in the form of paragraphs and sentences, and they analyze how the words tend to be distributed. They then use this probabilistic knowledge in reverse to generate language.

As they write, NLP systems are “filling in the blank” in a process called “next word prediction.” That’s right, GPT has no idea what it is going to say next, it literally only focuses on one word at a time, one after another. GPT-3 “knows” nothing. It only appears to have knowledge about the world because of the intricate statistics it keeps on the mathematical relationships between words from works written by human authors. GPT-3 is basically saying: “Based on the training data I have been exposed to, if I had to predict what the next word in this sentence was, I would guess that it would be _____.”

When you give an NLP system a single word, they will find the most statistically appropriate word to follow it. If you give it half a sentence, it will use all the words to calculate the next most appropriate word. Then after these NLP systems make the first recommendation, they use that word, along with the rest of the sentence, to recommend the next word. They compile sentences iteratively in this manner, word by word. They are not thinking. They are not using logic, mental imagery, concepts, ideas, semantic or episodic memory. Rather, they are using a glorified version of the autocomplete in your Google search bar, or your phone’s text messaging app.

To really get a sense of this, open the text app on your phone. Type one word, then see what the phone offers you as an autocomplete suggestion for the next word. Select their recommendation. You can keep selecting their recommendation to string together a sentence. Depending on the algorithm the phone uses (likely Markovian) the sentence may make vague sense or may make no sense at all. In principle though, this is how GPT and all other modern language generating models work. The screenshots below show a search on Google, and some sentences generated by my phone’s predictive text feature.

A.   A. Google using autocomplete to give you likely predictions for your search. B. Using the autocomplete suggestions above my phone’s keyboard to generate nonsense sentences.


GPT-3 Has a Form of Attention

Most autocomplete systems are much more myopic than GPT. They may only take the previous word, or previous two words into consideration. This is partially because it becomes very computationally expensive to look back further than a couple of words. The more previous variables that are tracked, the more expensive. A computer program that had both a list of every word in the English language and the word that is most likely to follow each word, would take up very little space in computer memory and require very little processing resources. However, what GPT-3 does is much more complex because it looks at the last several words to make its decisions.

The more words, the more context. The more context, the better the prediction. Let’s say you were give the word “my” and asked to predict the next word. Not very easy, right? What if you were given “is my”?  Still not very easy. How about, “today is my”. Now those three words might give you the context you need to predict that the next word is “birthday.” Words occuring along a timeline are not independent or equiprobable. Rather, there are correlations and conditional dependencies between sucessive words. What comes later is dependent on what came before. In that four word string “today is my birthday” there is a short-term dependency between “today” and “birthday.” So being able to have a working memory of previous words is very helpful. More sophisticated AIs like GPT-3 can deal with long-term dependencies too. This is when, an entire paragraph later, GPT-3 can still reference the fact that today is someone’s birthday.

By attending to preceding words, GPT-3 has a certain degree of precision and appropriateness, and is able to stay on track. For instance, it can remember the beginning of the sentence (or paragraph), and acknowledge it or elaborate on it. Of course, this is essential to good writing. It’s attentional resources enable it to remember cues over many time steps allowing its behavior to retain pertinence by accounting for what came earlier. While it was trained, the GPT-3 software was able to learn what to pay attention to given the context it was considering. This way it does not have to keep everything that came earlier in mind, it only stores what it predicts will be important in the near future.

If you can remember that you were talking about the vice-president two sentences ago, then you will be able to use the pronoun “she” when referring to her again. In this case your use of “she” is dependent on a noun that you used several seconds ago. This is an example of an event being used as a long-term dependency. Long-term dependencies structure our thinking processes, and they allow us to predict what will happen next, what our friend will do next, and they help us finish each other’s sentences. To a large extent, intelligence is the ability to capture, remember, manage, and act on short- and long-term dependencies.

GPT-3 uses its attention to keep track of several long-term dependencies at a time. It selectively prioritizes the most relevant of recent items so that it can refer back to them. This is how it is able to keep certain words “in mind” so that it doesn’t stray from the topic as it writes. GPT-3 is 2048 tokens wide, where tokens are generally equivalent to words. So, it has a couple thousand words as its “context window” or attention span. This is clearly much larger than what a human has direct access to from the immediate past (Most people cannot remember a 10 digit number?). Its attention is what allows it to write in a rational human-like way. Reading the following text from GPT-2 can you spot places where it used its backward memory span to attend to short and long-term dependencies?

As you can see GPT-2 takes the context from the human-written prompt above and creates an entire story. Its story retains many of the initial elements introduced by the prompt and expands on them. You can also see how it is able to introduce related words and concepts and then refer back to them paragraphs later in a reasonable way.


Some Technical but Interesting Details About GPT-3

GPT-3 was introduced in May 2020 by Open AI Inc. which was founded by Elon Musk and Sam Altman. GPT-3 stands for Generative Pre-trained Transformer 3. The “generative” in the name means that it can create its own content. The word “pre-trained” means that it has already learned what it needs to know. Its learning is actually now complete (for the most part) and thus its synaptic weights have been frozen. The word “transformer” refers to the type of neural network it is (a version of a recurrent network). The transformer architecture, by the way, is relatively simple. It has also been used in other language models such as Google’s BERT and Microsoft’s Turing Natural Language Generation (T-NLG).

The 3 in GPT-3 denotes that it is a third-generation product coming after GPT and GPT-2 as the third iteration of the GPT-n series. GPT-1 and 2 were also groundbreaking and similarly seen as technologically disruptive. GPT-3 has a wider attention span than GPT-2 and many more layers. GPT-2 had 1.5 billion parameters, and GPT-3 has a total of 175 billion parameters. Thus, it is over 100 times larger than its impressive predecessor which came two years before it. What are those 175 billion parameters? The parameters are the number of synaptic learning changes that can take place between its neurons. The more parameters, the more memory it has, and the more structural complexity to its memory.

You can make a rough comparison between the 175 billion parameters in GPT-3 to the 100 trillion synapses in the human brain. That should give you a sense of how much more information your brain is capable of holding (over 500x). It cost $4.6 million to train GPT-3. At that rate, trying to scale it up to the size of the brain would cost an unwieldy $2.5 billion. However, considering the fact that neural network training efficiency has been doubling every 16 months, by 2032 scientists may be able to create a system with the memory capacity of the human brain (100 trillion parameters) for around the same cost of GPT-3 ($5 million). This is one reason why many people are excited about the prospect of keeping the GPT architecture and just throwing more compute at it to achieve superintelligence.

It is worth mentioning that scaling up from GPT-2 to GPT-3 has not yet resulted in diminishing returns. That is, its performance has increased on a straight line. This suggests that just throwing more computing power at the same architecture could lead to equally stunning performance for GPT-4. This has led many researchers to wonder how big this can get, and how far we can take it. I think that it will continue to scale well for a while longer, but I don’t think the transformer architecture will ever approach any form of sentient consciousness. Most forms of AI (machine learning and deep learning) are one trick ponies. They perform well, but only in one specific domain. My belief is that a specialized system like GPT will continue to be used in the future but will make modular contributions to more generalist systems. I cover that in the next blog entry which you can read here.

GPT-3 is a closed book system, which means that it does not query a database to find its answers, it “speaks from its own knowledge.” It has read Wikipedia, but (Unlike IBM’s Jeopardy champion “Watson”) does not have Wikipedia saved verbatim in files on its hard drive. Rather, it “read” or traversed through Wikipedia and saved its impressions of it that did not already match its existing structure. In other words, it saved information about the incorrect predictions it made about Wikipedia. It is important to keep in mind that it is not a simple lookup table. It is an autoregressive language model, meaning that it predicts future values from its memories of past values. It interpolates and extrapolates from what it remembers. It is amazing at this, and its abilities generalize to a wide variety of tasks. GPT-3 outperforms many fine-tuned, state of the art models in a range of different domains, tasks, and benchmarks. In fact, some of its accomplishments are mind blowing.

Human accuracy at detecting articles that were produced by GPT-3 (and not another human) is barely above chance at 52%. This means that it is very difficult to tell if you are reading something written by it or by a real human. It also means that GPT-3 has nearly passed a written form of the Turing test. GPT-3 really is an amazing engineering feat. It shows that simply taking a transformer network and exposing it to millions of sentences from text online, you can get a system that appears intelligent. It appears intelligent even though it is not modeled after the brain and is missing most of the major features thought by brain scientists, and psychologists to be instrumental to intelligence.

It writes as if it has understanding. But in reality, it understands nothing. It cannot build abstract conceptual structures and cannot reliably synthesize new intellectual or academic ideas. It has, however, shown glimmers of a simple form of reasoning that allows it to create true content that was not in its training set. For example, although it cannot add 10-digit numbers (which a pocket calculator can do with ease) it can add 2- and 3-digit numbers (35 + 67) and do lots of other math that it was never trained to do and never encountered an example of. Its designers claim that it has never seen any multiplication tables. Specialists are now arguing about what it means that it can do math that it has never seen.

In the example at the beginning of this blog entry GPT-3 knew that there are no animals with three legs. This knowledge was not explicitly programmed into it by a programmer, nor was it spelled out explicitly in its training data. Pretty amazing. If GPT-3 were designed differently and got its knowledge from a database of a long laundry list of facts (like the Cyc AI project) programmed by hand it wouldn’t easily interface and interact with other neural networks. But since GPT-3 is a neural network, it should play constructively and collaboratively with other neural networks. This really underscores the potential value of architectures like this in the future.

GPT-3 is very different from something like IBM’s Watson, the Jeopardy champion (which I have written about here). Watson was programmed with thousands of lines of code, annotations, and conditionals. This programming helped it respond to particular contingencies that were identified by humans in advance. GPT-3 has very little in the way of formal rules. The logical structure that comes out of it is coming from the English language content that it has “read.”

GPT-3 was trained on prodigious amounts of data in Microsoft’s cloud using graphics processing units (GPUs). GPUs are often used to train neural networks because they have a large number of cores. In other words, a GPU is like having many weak CPUs all of which can work on different small problems at the same time. This is only useful if the computer’s work can be broken down into separate threads (parallelizable). This makes a GPU well-suited for the highly parallel task of modeling individual neurons because each neuron can be modeled independently. Multicore GPUs have only been accessible to public consumers for the last ten years. They started with 2 cores, then 4, and today they can have thousands. What was the impetus for engineers to build fantastically complex multicore GPUs? It was the demand for videogames. Gaming computers and videogame consoles require better and better graphics cards to handle state of the art games. Less than 10 years ago AI scientists realized that they could take advantage of this and use GPUs themselves. This is a major reason why deep learning AI is performing at high levels and is such a hot topic today.

OpenAI and Microsoft needed hundreds of GPUs to train GPT-3. If they had only used one $8,000 RTX 8000 GPU (15 TFLOPS) it would have taken more than 600 years to process all of the training that took place. If they would have used the GPU on your home computer, it would have likely taken thousands of years. That gives you an idea of how much processing time and resources went into fine-tuning this network. But what is involved in fine-tuning a network. Let’s discuss what is happening under the hood. (Apart from training, querying the pretrained model is also resource expensive. GPT-2 was able to run on a single machine at inference time, but GPT-3 must run on a cluster.)


How Neural Networks Work

This next section will offer an explanation for how artificial neural networks operate. It will then show how neural networks are applied to language processing and NLP networks like GPT-3. It is important to point out that this explanation highly simplified and incomplete, but it should communicate the gist and give you some helpful intuitions about how to think about how AI functions.

To understand how a neural network works, first let’s look at a single neuron with three inputs. The three circles on the left in the figure below represent neurons (or nodes). These neurons, X1, X2, and X3, are in the input layer and they are taking information directly from an outside source (the data you input). Each neuron is capable of holding a value from 0 to 1. It sends this value to the next neuron to its right. The neuron on the right takes the values from these three inputs and adds them together to see if they sum above a certain threshold. If they do, the neuron will fire. If not, it won’t. Just as in the brain, the artificial neuron’s pattern of firing over time is dependent on its inputs.

We all know that neural networks learn, so how could this simple system learn? It learns by tweaking its synaptic weights, W1, W2, and W3. In other words, if the neuron on the right learns that inputs 1 and 2 are important but that input 3 is not important, it will increase the weights of the first two and decrease the weight of the third. As you can see in the formula below it multiplies each input by its associated weight and then adds the products together to get a number. Again, if this number is higher than its threshold for activation it will fire. When it fires, it sends information to the neurons that it is connected to. Remember, this simple four neuron example would just be a miniscule fraction of a contemporary neural network.

The figure below shows a bunch of neurons connected together to create a network. This is a very simple, vanilla neural network sometimes referred to as a multilayer perceptron. Each neuron is connected to every other neuron in the adjacent layer. This is referred to as being “fully connected.” All these connections are associated with a weight, which can be changed by experience, and thus provide the network with many different possible ways to learn. Each weight is tuning knob that is adjusted through trial and error in an effort to find an optimal network configuration.

The network below recognizes digits, and here it is shown correctly recognizing the handwritten number 3. A picture of the handwritten 3 is fed into the system. The photo is 28 pixels wide, by 28 pixels tall, for a total of 784 pixels. This means that the input layer is going to need 784 neurons, one for each pixel. The brightness of each pixel, on a scale from 0 to 1 are fed into the input neurons as you can see below. These numbers pass through the network from left to right. As they do this they will be multiplied by their associated weights at each layer until the “activation energy” from the pattern of input results in a pattern of output.

How many output neurons would you expect this network to have? Well, if it recognizes digits, then it should have 10 outputs, one for each digit (0-9). After the activation energy from the inputs passes through the network, one of the ten neurons in the output layer will be activated more than the others. This will be the network’s answer or conclusion. In the example below, the output neuron corresponding to the number 3 is activated the most at an activation level of .95. The system correctly recognized the digit as a 3. Please note the ellipses in each column which indicate that the neurons are so numerous that some of them are not pictured.

These artificial neurons and the connections between them amount to an artificial neural network. It is modeled and run inside of a computer. It is kind of like a videogame in the sense that these structures don’t actually exist in the real world, but are simulated at high fidelity and speed using a computer. This rather simple network captures some of the important cognitive mechanisms found in the human brain. The neural network is considered a form a machine learning, and when it contains more than three hidden layers, it is referred to as deep learning. Some modern networks have thousands of hidden layers. Now the amazing thing about these mathematical neurons is that if you connect a large number of them up in the right way and choose a good learning rule (a method of updating its weights), it can learn just about any mathematical expression. After the network is trained on data, it will capture patterns in the data, becoming one big mathematical function that can be used for things like classification.

In the next diagram we see a neural network that is being used to classify produce. It is using object features (like color, size, and texture) in the first layer to determine the type of produce (orange, apple, or lettuce) in the hidden layer. Then these are associated with either the fruit or vegetable classification in the output layer. The boldness of the lines indicates the strength of the weights. You can see that red, small, and smooth are all connected strongly to cherry. This means that whenever all three of these are activated by an input pattern, cherry will be selected in the hidden layer. You can also see that cherry is more strongly connected to fruit than vegetable in the output layer. So, by using only an object’s features this system could tell a fruit from a vegetable.

Please keep in mind that this example is a simplified, “toy” example. However, neural networks do work hierarchically in this way. At each layer, they take simple features and allow them to converge on more complex features, culminating in some kind of conclusion. In the example involving digit recognition above, the first hidden layer generally recognizes short line segments, and the layers after it recognize increasingly complex line segments including loops and curves. Finally, the output layer puts these together to form whole numbers (as when two loops create the number 8). Again, neural networks go from concrete features to abstract categories because of the way that the neurons in low-order layers (to the left) project to neurons in high-order layers (to the right).

The next diagram shows that neural networks can take many forms of input and come up with appropriate output. Let’s start just by looking at the first of the three networks in the diagram. That top network received a picture of a cat and recognized it as a cat. To do this it had to take the pixel brightness of each pixel and turn them into a long list of numbers. There is one input neuron for each pixel so if there were 2,073,600 (1080 x 1920) pixels then there must be that many input neurons in the input layer. The numbers (vectors) then flow mathematically through the network and toward the two output neurons, dog and cat. Cat ended up with a higher activation level than dog. Thus, the system is “guessing” the object in the photo is a cat. But to guess correctly the system must first be trained.

Now let’s talk about learning. When the system gives a correct answer, the connections responsible for its decision are strengthened. When it gives a wrong answer, the connections are weakened. This is similar to the reward and punishment that goes on to influence neuroplasticity in the human brain. In the example above, the network correctly categorized the picture as a cat. After it did this it was then told by the program it interacts with that it got it right. So, it then went back and strengthened all the weights responsible for helping it make that decision. If it had falsely recognized the picture as a dog, then it would have been told that it that it got it wrong, and it would go back and weaken all of the weights responsible for helping it make the wrong decision. Going back and making these adjustments based on the outcome of the guess is known as backpropagation. Backprop, as it is sometimes called, is one of the fundamental algorithms responsible for the success of neural networks.

As you can see this system requires supervision. This is known as supervised machine learning. It must be told when it is right and when it is wrong and that necessitates that its data is prelabeled. To create the training data for this system a person had to collect and then label thousands of pictures of dogs and cats so that the AI could be told when it is right and when it is wrong.

Next, let’s look at the middle network in the diagram above. This is an optical digit recognizer like the one we saw earlier. This AI system is shown correctly recognizing the number six. The network is behaving in much the same way as the cat/dog classifier, except here you can see that it has 10 outputs rather than just two. This is because it must be able to differentiate between the numbers 0 through 9.

The last network in the diagram is a natural language processing system and it works in a way that is very similar to the first two networks. It is given the first four words in a sentence, “This is my pet…” It is shown correctly predicting the word “cat” as the most probable next word. But this system does not only distinguish cats from dogs. This network must differentiate between all the words in the English language, so it has an input neuron and an output neuron corresponding to every word in the dictionary. That’s right, natural language generating AIs like GPT-3 need a dictionary worth of inputs and a dictionary worth of outputs.

There are around 170,000 words that are currently used in the English language. However, most people only use around 20,000 to 30,000 words conversationally. Many AI natural language models therefore use around 50,000 of the most common words. This means that there are 50,000 different possible outputs for the neural network. Each output is built into the network’s structure and as the activation energy passes from the inputs, through the hidden layers, and toward the outputs, one of those words will be more highly activated than any other. That word will be the one the network chooses.

The next diagram shows how a natural language processing network makes its decisions. The neuron for the word “pet” spreads its activation to all the neurons in the first hidden layer because it is connected to each one. Again, due to their weights, some neurons value this input more than others. These then fire at the next hidden layers until they reach the output layer activating all the neurons there, at least to some extent. One neuron in the output layer values this pattern of activation more than any other. The neuron that is activated the most, “cat,” is the one chosen by the network as being the most likely to follow the word “pet.” This is a helpful diagram, but it is a huge oversimplification. This is because, when GPT-3 chooses the word “cat” it is not because one word (pet) selected it, it is because the vectors for many words converged on it together. Remember? We said that GPT-3 has an attention 2048 tokens wide. That gives you an idea of just how many inputs are being considered simultaneously to select the next output.

Now, let’s put all of this together and consider what is happening when a natural language processing system like GPT-3 is undergoing training. Luckily, its training data does not need to be labeled by a person and its training process does not have to be supervised. Why you ask? Because the right answer is already there in the text. The trick is, it hides the next word from itself. As it reads, the next word is hidden from its view. It must predict what that next word will be. After it guesses, it is allowed to see if it was right. If it gets it right, it learns. If it gets it wrong, it unlearns.

With the cat and dog classifier, the system would make a prediction, learn and then start all over again with a new picture. Natural language generating AIs do not start over with each word. Rather, they keep reading and using the previous words to update their attention in an iterative fashion. The diagram below gives an example of this. In the example, the system uses the context it is holding to guess the first two words accurately (“is” and “my”) but gets the next two wrong (“pet” and “cat”). When a system like GPT-3 reads through Wikipedia it is constantly making errors, but because its attention is so wide, after extensive training it develops a preternatural ability to make inferences about what word could be coming next. 


So to recap, GPT-3 takes a series of words (starting with the words you give it as a prompt or with the series that it is currently in the process of generating) and then fills in the next blank. It gives some consideration to each word in the dictionary every time it chooses the next word. To decide which one to use it pushes data input through a mathematical network of neurons, toward its entire dictionary of outputs. Whichever word receives the most spreading activation out of all the potential outputs is assigned the highest probability and is used to fill in the next blank. To accomplish this using mathematics the words themselves are represented as vectors (strings of numbers) and these vectors interact with the numerical structure of the existing network (through matrix multiplication).

Another way to frame this is to point out that GPT is basically asking, “given the previous words what is the probability distribution for the next word?” Once it finds that word, it adds it to the list, and it samples again on that distribution.

The diagram above shows how when given the string of words, “this is my pet…” an AI that had not finished training could come up with a word like “dog.” The right word was “cat.” So, when GPT-3 gets it wrong it will learn from its mistake. The mathematical difference between “cat” and “dog” is calculated and used to update the system. Of course, this is an arbitrary distinction (and much of what the system learns is arbitrary). There is nothing wrong with saying “this is my pet dog.” But if this phrase occurred in an article about cats there might be something wrong with it. GPT-3’s attention is wide enough to recognize an article about cats the learning might be more helpful because it would help train the system to group similar words together.

Before training takes place, the system’s outputs are generated at random because its parameters (synaptic weights) are set to random values (like most neural networks). But during training inappropriate responses are compared to correct responses, the error of the response is calculated, and then this error value is used to alter the model’s parameters so it is more likely to choose the right word next time. When it does this, it changes the way that it mathematically represents the word “cat” in its network. For instance, the words “cat” and “feline” may not be related in its memory at all, but during training they will come to be more closely related because they are likely to pop up in the same sentences. Another way of saying this is that the system will learn to group things that appear close together in time (temporal contiguity). The way these two words (cat and feline) are encoded in memory as numbers (vectors of floats) will become more and more similar. This places semantically related words closer and closer together in a multidimensional web of definitions.

Thus far, we have explained how an NLP system learns to make predictions about language, but here we are interested in natural language generation. So how would you get such a system to write its own content? It is easy, you simply start a sentence for it, or ask it a question. That will fill its attention with context that it will then use to predict what the next word should be. It continues on, adding words to the end, generating highly complex, synthetic speech. A chain of predictions becomes a narrative. By now you should be able to see why a recent article refers to modern language models as “stochastic parrots.” They are. They do a fantastic job of mimicking human language in a pantomime, chaotic, difficult-to-predict way.



For every word in the English language, there is one word and only one word that is most likely to follow it (my … name). Some words will be slightly less likely to follow it (my … cat). Other words may have almost no probability of following (my … thus). Natural language models from 40 years ago would predict the next word only from the single word that directly preceded it. Most of the time they could not formulate a coherent phrase much less a sentence or paragraph. But as you know, the newer language models look back much further than just one word. Their attention span maintains whole paragraphs in memory while constantly adding new words and subtracting the words that have been there the longest. Like my model of consciousness, they exhibit a form of working memory that evolves gradually through time. They use, what I call, “iterative updating” and “incremental change in state-spanning coactivity.” 

But the human working memory doesn’t just track a succession of words, it tracks a succession of events. Just as there is one most probable word to follow any series of words, there is one most probable event to follow a sequence of events. We use our working memory to record events as they happen to help us decide what to predict next.

In the next blog entry, we will consider how a system like GPT-3 could be incorporated into a much larger system of neural networks to better approximate human working memory by using, not just words, but events as long-term dependencies. This will allow an AI to make predictions, not just of the best word to use next, but the best behavior. We will also discuss what other modular networks would be valuable to such a larger system. For instance, we would want modules that correspond to video input, mental imagery, and motor output. GPT-3 can currently interact with other programs, but these programs are far from being tightly integrated. To create a conscious system, we want multiple neural networks with different functional specializations to be tightly interwoven just like they are in our brain. What role would an AI like GPT-3 play in this larger system? It would probably play the role of Broca’s area, the human brain’s language region. For all the details, please stay tuned for the next entry.

Thursday, July 1, 2021

Use the Chimpanzee Pant Hoot to Rehabilitate Your Breathing and Voice


I believe that humans can benefit from producing a chimpanzee vocalization called the pant hoot. The pant hoot engages the muscles of respiration, especially the diaphragm, in a way that increases their strength and resilience. On the website for my self-care system,, I have written about how people stifle their voice and their breathing to show modesty. This creates loads of unnecessary tension in the diaphragm, which makes breathing shallow and short. The diaphragm is the muscle that contracts to push air past your vocal cords providing the power behind your speech. So weakness in this muscle is audible when you speak.  I believe that the immodest pant hoot can do a great deal to help rehabilitate this tension, the diaphragm, and the voice. 

The chimpanzee pant hoot is a well-studied, long-distance call that serves as a ventilating display. It lasts several seconds (usually 5 to 10) and is used to demonstrate strength and dominance. Chimpanzees put a lot of effort into their calls necessitating powerful respiratory muscles and robust vocal cords. Males and females make the call, but high-ranking males make the loudest and longest calls and make them more frequently. The higher a chimp’s dominance rank within its party, the more intensely it pant hoots (Fedurek et al., 2014). The ones with the most powerful and agile pant hoots are those that have the most practice. Of course, the ones that have the most practice are the ones that are dominant or have been dominant the longest. They were able to develop skill because they were confident enough to show off.

Several males often use the pant hoot together to advertise the strength of their party and defend territory from neighboring groups. It is most commonly used within a single group though, especially during the presence of high-quality food or females in estrus. This indicates that it is used during social competition to designate status and priority over resources. It is also thought to be involved in sexual selection, signifying high quality to prospective mates. Basically, a spirited, bellowing pant hoot is sexy.

The pant hoot consists of a series of low, breathy “hoo’s” that grow increasingly loud. They often proceed through four phases: an introductory phase, a build-up, a climax, and a let-down. Here we are interested in the climax because it provides the most exercise for the respiratory muscles. When I started imitating the pant hoot, not only was it difficult to sustain, but it created a deep aching sensation in my chest. It ached so much that at first, I thought it wasn't healthy for me. After a few sessions, though I realized that the aching was subsiding, that it was becoming easier, that I could do it for longer, and that it actually started feeling good. I believe that it hurt at first because I had been holding a certain portion of my diaphragm immobile for years.

Social concerns and propriety cause us to feel awkward yelling and screaming like a chimpanzee. Not only do we not do it in polite company, but most of us don’t do it at all. Young children engaged in sports or games are often highly vocal and can be heard making rough and rugged calls for hours at a time. The most dominant children are the most vocal, and the least dominant children are the least vocal. This extends into adulthood and we all learn to become more and more modest with our voices. As we get older, the self-suppression of the vocal and respiratory muscles keeps us from exhibiting exuberance with our voices. Sadly, this even extends to our breathing.

Most experts now know that shallow breathing, where the diaphragm is used minimally, plays a prominent role in anxiety and depression. As might be expected, studies have shown that the more traumatized a mammal is, the less mobility its diaphragm exhibits and the shallower its breaths. Even mice and rats that have been exposed to stress have immobile, tense diaphragms. Chronic tension in a muscle leads to chronic pain and this is why producing the pant hoot hurt me when I started. Coughing, laughing, hiccupping, and deep breathing all engage and stimulate the diaphragm. The pant hoot though is probably the most strenuous way to rehabilitate your diaphragm. Used gradually and carefully, it may also be the most efficacious.

I believe that the pant hoot simulates impulsive, impetuous, devil-may-care yelling that all but the most dominant of apes have spent their lives inhibiting. I also think that dominant chimpanzees use the call to prove to others that they have spent most of their lives socially relaxed. They are putting their lungs on display and showing others that they can use their voice to lash out violently and unhesitatingly if need be. It shows that they do not have a history of victimization but rather a history of victories. You want this for yourself, and you can get it within a few months by practicing the pant hoot. Because it is probably not something you would feel comfortable doing in public, do it at home, in a closet, or in your car where you don’t have to worry about others hearing you. If you practice it enough, it will help you speak more commandingly and assertively. It will also desensitize your vocal apparatus to abrupt eruptions of loud speech, making you more charismatic, confident, and less likely to be interrupted by others.

The pant hoot was popularized on the Arsenio Hall talk show in the 1990s. There it was used in a collegial, energetic way to show support for the host or his guests. The crowd would often swing their arms above their head and hoot resolutely and rowdily. The way it is done on the show is an excellent example of how I would like you to do it in the exercise below. The exercise also partly overlaps with a yoga technique called kapalbhati breathing. Exposing yourself to information about that technique will help you better recruit your diaphragm with each hoot.

The pant hoot is a power play that most of us are usually too self-conscious to use. Most of us are afraid to use the diaphragm in that way. Our lack of exercising it has created a disuse injury. Insufficient skill in coordinating the transition between rapid inhalations and exhalations is a manifestation of trauma. Building better coordination over the switch between breaths may make you a less hesitant, more confident breather. Chimpanzees perform their pant hoots daily. I want to encourage you to use it daily. Before trying it yourself, listen to a chimp perform it in the pant hoot compilation video I created below.

This video is coming later this week.

The exercise here (Program Peace Exercise 11.16) will ask you to hoot using a series of deep yells that take place while rapidly switching between inhalations and exhalations. Each cycle should last between a half and a quarter of a second. You will find that hooting is very strenuous at first yet becomes facile over just a few days. This may seem like another eccentric exercise, but remember, Program Peace is all about finding weakness in the body and rehabilitating it to find new strength, primarily when this strength is associated with dominance in closely related animals.

Breathing Exercise # 11.16: Pant Hooting

Practice breathing in and out at very short intervals while vocalizing loudly. Alternate between inhalation and exhalation around two to five times per second. Do it rhythmically and with control. You should be able to see and feel your abdomen contract with each pant. This indicates that you are using the diaphragm to power the exhalations. During each exhalation, yell “hoo” very loudly and deeply. You should be imitating the chimpanzee pant hoot that you listened to in the online video above. 

Time yourself using your phone’s stopwatch. At first, try to reach 30 seconds of intense hooting. Over several weeks you should be able to do it for more than a minute. Once you reach proficiency, you can try doing what the chimps do and vocalize, not just on the exhalations but also on the inhalations. This is much more tiring because it creates more resistance against the breath.

As you do this, you will notice that your breathing will falter every few times you switch. Your timing will be off because you don’t have fine enough control over the transitions between inhalations and exhalations. This kind of poor respiratory coordination may be a contributor to autonomic dysregulation. Use this exercise to iron through these irregularities and unbrace and strengthen the muscles involved. 

After a minute of pant hooting intensely, your chest and voice will feel agitated. It may even start to burn. However, if you concentrate on letting the muscles go limp afterward and practicing deep breathing, they will relax like never before. You will feel calmer and notice that your voice is more substantial and deeper for up to a day. The first few days may be irritating, but this will pass. The more you use paced diaphragmatic breathing guided by a breath metronome (before and after the exercise) the less irritated you will feel.


Duration: One minute. Proficiency: Four sessions a week for 24 weeks. Maintenance: Four times per month. Five stars.

I believe that both humans and chimps add tension to the aspect of the diaphragm responsible for the pant hoot. They do this as part of a submissive display that self-handicaps the bark and roar, which are both generated with force produced by the diaphragm. This is why only the most dominant chimps are capable of performing an optimal pant hoot. You will know that your pant hoot is optimal when you can do it vigorously for an entire 60 seconds without any sore or achy feeling in your chest. Now, a
fter pant hooting I feel endorphins and a tremendous sense of relief.

People are not just going to give you respect. For you to earn their respect you have to prove to them that you are capable of speaking forcefully and nimbly. This exercise will help you coordinate the activity of breathing and vocalizing in a convincingly powerful way. Even if you have a long history of repressing your vocal power this exercise will help you produce rapid, brisk, and spontaneous contractions of the diaphragm. People will be able to hear the stability and vitality of your diaphragm as you speak. I believe that it helped me become a better communicator and a better public speaker. I think it is worth mentioning that my ex-girlfriend liked it when I practiced the pant hoot and she showed me that it gave her goosebumps which I could see and feel on her skin.

The exercise above will drive an unused aspect of your diaphragm into full fatigue, allowing it to recover from the tension you have imposed on it for years. Unnecessary tension in the diaphragm may be the root cause of fear, neuroticism, and submissiveness. Pant hooting loudly and joyfully will optimize this function of your diaphragm making it agile and powerful. I firmly believe that after just a handful of sessions, you will realize that the pant hoot achieves a diaphragmatic detox, reaching into and purging the nucleus of your anxiety.

Fedurek P, Donnellan E, Slocombe KE. 2014. Social and ecological correlates of long-distance pant hoot calls in male chimpanzees. Behavioral Ecology and Sociobiology. 68(8): 1345-1355.

Monday, June 21, 2021

Amsco Marvel World Playset: Description, Pictures, Scans, and Video

As a long-time comic book reader I wanted to post an entry about the Marvel World Adventure Playset. It debuted in 1975 during the Bronze Age of comic books and was created by Amsco, the toy division of board game maker Milton Bradley. The playset depicts five of the major buildings from Marvel Comics. These include Doctor Strange’s Sanctum Sanctorum, The Fantastic Four’s Baxter Building and plane, Peter Parker’s home, the Daily Bugle, and the Avengers Mansion. It is made of die-cut, heavyweight cardboard. Every piece of cardboard has color graphics printed on both sides. Click on any of the pictures below for a larger version.

Here is the front of the set with the characters:

And here is the back:

Before you continue reading you might want to watch the video below that I uploaded to youtube about the set. That video covers much of what is discussed here. It also shows the set in 360 degrees. 

In the comics all these buildings are located in New York City. And in this set, they are all found on the same block. However, there is a minor inaccuracy here because Peter Parker and his Aunt May traditionally live Queens and the other buildings are in Manhattan. But this certainly doesn’t detract from the charm. Other than that, the buildings and characters are highly comic accurate which was rather rare for merchandise at the time. It sold for $6.95. And, if you ordered it by mail, there was an 89 cent postage and handling fee. It was marketed to ages “five and up.” Me? Uh… I believe I’m over five.

And talking about immaturity, excuse me but I got a kick out of placing this cityscape in the context of a downtown skyline. This picture was taken from the roof of a 20 story apartment building (Promenade Towers) just north of downtown LA.

The set comes with 34 cardboard figurines that fit into small plastic stands. Each is between 2 and 3 inches tall. The set I bought did not come with all of the figurines and included no stands, so I used a scan provided to me by the seller to print the characters onto heavy cardstock. I left a flap at the bottom of each cutout and used double sided tape to attach each flap to a quarter. A quarter seemed to be a good size and weight. It worked pretty well, and if you print your characters from the scans below you might try doing the same.

Here are the contents of the box after they have been punched out, but before they have been put together. I wiped them down with a dry cloth before putting them together. Assembling the buildings takes at least 20 minutes and I had fun doing it. I cannot imagine a five year old bending the corners and making all of the panels fit properly, but maybe they made five-year-olds differently in the 70s. 

Here I am using the assembly instructions to find out how to mount the buildings onto the base (which was the bottom of the box).

Here is a close up of the base.

Here are the assembly instructions. Full scans of the set and the instructions can be found closer to the bottom of this post.

And here it is fully assembled. 

Marvel World is a very detailed and colorful set with fantastic art. The consensus on the web is that the characters were probably drawn or inked by either John Romita Sr. or Sal Buscema (or a combination of the two). Often the artist who draws in pencil is different from the artist who inks the work, and this may be the reason why it is very difficult to discern. Romita and Buscema were very popular Marvel artist in the 70s. Online commenters suggest that the buildings were inked comic-style by Marvel artist Dave Hunt. Sections of the artwork seem to be inspired by some of the other early greats. For example, Reed Richard’s lab has some outlandish equipment that is styled after Steve Kirby drawings, and the Sanctum contains mystic art that channels Steve Ditko’s work. It definitely has the look and feel of 1970s Marvel.

Here are some closeups of the characters. Who do you think the artist was?

The set itself is quite rare, and there is not much about it online. Sometimes it is referred to as the “holy grail” of Marvel toy collecting and it is arguably the mother to dozens of other subsequent playsets. Searching online I only found low-res pictures and no video footage. I created this post and the video above to address this. I got the set pictured here in the mail recently from a great guy named Steve in San Francisco whose auction I won on eBay. I set an automatic search notification for the set many months prior and this was the first one I was alerted about. 

I wish I had a complete and undamaged set to show you. Unfortunately, this set is missing a few characters and has a bit of damage. However, I was able to print out the missing characters and fix most of the tears. I carefully pulled off the pieces of old tape and applied glue to broken sections. The set was also missing two roof panels, so I scanned the opposite (symmetric) side, flipped it digitally, printed it, cut it out, and pasted it to the set.


The detail is exacting. Take Peter Parker and Aunt May’s Victorian house for instance. It is highly decorated with classical entablature, egg-and-dart molding, an ocular window, a skylight, a basement, a mechanical door bell, and a window sill flower planter. I believe this is supposed to be their home in Forest Hills in Queens New York. The address on the front says #220, and this is similar to the address in the comics, 20 Ingram Street. 

The first floor appears to be Aunt May’s living room. It is a cozy and welcoming space. It features a writing desk, with a stool, rotary telephone and mirror. There is a blazing fireplace with a candelabra, a clock, and a porcelain dish on the mantle.  Above the mantle is a detailed painting of a man leading a cow to a classical temple in front of a mountain. 

Here is a closeup of the painting over Aunt May's fireplace.

The second floor is Peter’s room complete with the chemistry equipment he uses to make his webbing fluid. The dresser to the left (not seen) features a photo of his Aunt and his girlfriend Mary Jane. You can also see a trophy, a football, a shelf lined with books, a lab coat, a record player, and headphones. Clues to his secret identity as Spiderman can be seen in the double locks on the door. You can also see that somehow, despite his full time job, student, boyfriend, and superhero status, he had the time to make his bed this morning.

The Sanctum Sanctorum is a three-story Victorian-style brownstone townhouse (just as in the comics). It is in the French Baroque style replete with fancy masonry and a Mansard roof. It contains the expected skylight known as the Window of Worlds containing the seal of the Vishanti, under which lies his inner sanctum. The yellow flashes of light behind the windows imply that there is magic going on inside. 

If you remove the Sanctum Sanctorum you can see the entire wall of the Avengers Mansion. It has cosmic art on it which is intended to be seen when the front door of the Sanctum is opened.

...this is that view from the front door.

Also visible from the door of the Sanctum is this otherworldly scene on back of the garage door to Avengers Mansion.

And this psychedelic strip of the cosmos.

The set has a few interactive features that are enticing for imaginary play. Namely an elevator, a trap door, and a break away wall. The elevator is found in the Baxter building and it can hold characters. 


On the right you can see the front door to the Baxter Building. Because it is the back side of the elevator it opens when the elevator goes up and closes when the elevator goes down.

The Avengers Mansion (I like to think of it as Avenger's H.Q.) has a garage door that can open and close to reveal the interior. Above that door and attached to it, is a trap door that characters can fall through. 

Avengers Mansion (called Avenger's Townhouse on the box) is the least developed of the five buildings. There is little indication that the brick building behind the Sanctum Santorum indeed belongs to the Avengers. When you open the garage door there is more machinery very similar to that in Mr. Fantastic's lab. This machinery is opening a three dimensional portal to the negative zone. We know it is the negative zone because it is advertised on the box and the multi-layered effect is pretty cool. Within that portal to the negative zone you can see planets, a nebula, a comet, and an asteroid. 

You can even see the inside of the Sanctum Sanctorum if you push your cell phone camera inside that portal cut out to the negative zone. You can see a burning caldron on one side and a crystal ball (possibly the Orb or Agamotto) on a pedestal on the other. 

Here is the view of the reverse side of the Sanctum when that piece is removed and straightened. Of course, much of this view is not visible when the set is fully assembled.


On the side of the Daily Bugle there is a hinged door cut to look like bricks which appears broken when swung open. Not including this brick cut-out-door or the trap door, there are four other hinging doors on the set and four open windows.

The Fantastic Four's airplane is also included. The promotional material calls this the Air Car, so it is not their more popular, multiple-passenger Fantastic Car or the Pogo Plane.  

This is the top floor of the Daily Bugle, J Jonah Jameson's office. 

Gotta love that Spiderman dartboard on the wall.

Here is the disheveled bottom floor of the Daily Bugle with an unmade bed, an unkept drawer, pin up swimsuit model, bookcase, mail slot, broken mirror and elevator.

This is the bottom floor of the Baxter Building. It seems to be some kind of entrance area or elevator lobby. 

This seems to be the workout room in the Baxter Building with weights, rings, a meeting table and a computer.

And the third and final floor of the Baxter Building is clearly Mr. Fantastic's laboratory. It definitely has that Jack Kirby feel to it.

So 1975. This was a world after Star Trek (1966) but before Star Wars (1977). Marvel had been named “Marvel” since 1961 and thus had only existed for 13 years. The same goes for Spiderman and Iron Man who were both created in 1961. The X-Men had been around for 11 years but most of their iconic characters had yet to be introduced in Giant Size X-Men # 1 (later than same year in 1975). This is why there are no X-Men in the Marvel World figurine lineup. They just hadn’t become popular enough yet. The characters in the set were Marvel’s major players at the time. And remember in 1975 these characters took a backseat to DC comic’s stable of more popular characters.

I was born a few years after this came out, but I had never heard of it until two years ago. I believe that it is very much overlooked as there is currently very little about it online, but if awareness increases these sets may increase significantly in value. Consider the fact that the set contains Loki, Falcon, Vision, and Scarlett Witch, the currently most popular MCU Marvel characters. In other words, the set comes up every several months on eBay or other auction websites and it might be a worthwhile investment.

The set includes the following heroes: Spider-Man, Captain America, Iron Man, Thor, Hulk, Vision, Scarlet Witch, Hawkeye, Falcon, Red Wing, Dr. Strange, Daredevil, Luke Cage, Shang Chi, Thing, Invisible Woman, Mr. Fantastic, Human Torch, Silver Surfer, Sub-Mariner, Captain Marvel, Valkyrie, Sif, J. Jonah Jameson, Mary Jane Watson, and Aunt May. It also includes the villains Galactus, Dr. Doom, Kraven, Dr. Octopus, Loki, The Red Skull, The Lizard, and The Green Goblin.


Some of the figurines have their alter egos on the reverse side. This seems to be the case for those that had secret identities at the time. Opposite Spiderman was Peter Parker. Captain America had Steve Rogers, Iron man had Tony Stark, Thor had Donald Blake, Hulk had Bruce Banner, Green Goblin had Norman Osborn, Captain Marvel had Mar-vell, and curiously Sif had Jane Foster (who is actually a totally different character).


The figurines are not all to scale. For instance, Captain Marvel is larger than the Hulk and Galactus, which is absurd. When I reprinted the characters I resized 10 of the them so that they were closer to their appropriate proportions.


Amsco had at least 5 other similar cardboard (Amsco calls it fiberboard) playsets with different themes that were popular in the 70s. Some of them were pretty cool, and worth checking out:

Planet of the Apes

Space: 1999

The Waltons

The Pioneer Village


Roy Rogers Magic Play Around

So again, Amsco was the toy division of Milton Bradley. Milton Bradley was an American board game manufacturer established by a Mr. Milton Bradley in Springfield Massachusetts in 1860. So it makes sense that Milton Bradley would use the die cutting facilities that they used to make board games to make cardboard playsets like these. Milton Bradley was bought out by Hasbro toys in 1984 ending 124 years of family ownership. Hasbro then purchased Milton Bradley’s archrival, Parker Brothers in 1991. Hasbro still has a working contract with Marvel Comics and they are responsible for the successful Marvel Legends line of action figures which can be found at just about every Target and Walmart in the US.

Marvel World was probably inspired by “The Amazing Spiderman Playset” from Ideal which came out in 1973. It contained plastic backgrounds and carboard Spiderman characters on stands.

Among the playsets inspired by Marvel World there were also a few other similar cardboard sets that are worth mentioning. These include “Spiderman American Bricks” from Playskool in 1977.

There were also the "Marvel Super Heroes" and "Spiderman Adventure Set" from Colorforms in 1983. 

The 1980 board game “Superhero Strategy” from Milton Bradley actually contains several of the same minifigures from Marvel World. 

In fact, the plastic stands from these board game figures fit perfectly with the Marvel World characters. Also, the game contains an additional figure, the Mandarin, who is not included in Marvel World. ...But you can always add him to your set if you want to.

Here is an advertisement for Marvel World that ran in the pages of old 70s comics.


Here is the 1975 Amsco catalog and its description of the Marvel World Playset.


Here are some pictures of my first shot at recreating the playset from scans. The building and spaceship below were printed on heavy cardstock. I folded them after scoring the edges with a razor. Below that, you can see it in miniature. 

Before I got this, I figured that I could find scans on the internet and just print and build my own playset from home. I found a lot of people in forums, comments sections, and message boards requesting scans, but I couldn’t actually find any. Consequently, I am including high res scans of the characters and the buildings here in case you want to make this little beauty yourself. Some of the panels are larger than 11 x 17 so this would require an industrial printer. You could even use Adobe Illustrator to make a miniature version or an oversized version. I hope you have as much fun with this as I have. For reference, the height of the Daily Bugle is exactly 9.5 inches not including the tabs at the base. I should also mention that the thickness of the cardboard is about 1.75 mm. The box dimensions are 19.75 x 13.5  x 1 inches. The serial number is Amsco set No. 9256. © 1975 Marvel Comics Group.

Here is a link to a colorful and entertaining take on the set:

Here is another link to a fun story about a guy's childhood experience with the set:

Finally, here are some of the sales pitches that were used in promotional materials:

“Made of durable fiberboard. Completely die-cut, ready and fun to assemble. Complete illustrated instructions. A complete play experience right from your favorite Marvel Comics. Featuring the Baxter Building, the Daily Bugle Offices, Peter Parker's apartment, the Avenger's Town House, Dr. Strange's mansion, the Negative Zone, the Fantastic Four Air Car, a working elevator, and a secret trap door. Favorite Stand-Up Marvel characters including Spider-Man, Thor, Captain America, the Fantastic Four, the Avengers, Hulk, Dr. Strange, The Red Skull, The Green Goblin, Dr. Doom, Loki, the Lizard, Galactus, the Sub-Mariner, J. Jonah Jameson, Shang-Chi the Master of Kung Fu and more!”

“Amsco joins forces with Marvel Comics to present a colorful, three-dimensional playset that includes all the popular superheroes and villains from the top-selling Marvel Comics Collection. Main structure includes the Negative Zone, the Avengers’ townhouse, Peter Parker’s apartment, and a trap door and elevator. Superstructure builds up on the slotted box bottom and comes complete with figures and an air car.”

“Be the first one on your block - to have this block! The homes of the super-heroes! The Avengers Mansion! The Baxter Building! The home of Dr. Strange! All this plus the Daily Bugle + 36 Marvel Characters.”


Thanks for reading. And here are some more pictures:

All of the photos taken of Marvel World here were taken by me and I release them into the public domain.