Large language models reason across vast strings of text, relying on statistical prediction over many thousands of words. Human cognition, on the other hand, unfolds within compact scenes composed of only a handful of interrelated concepts. This essay argues that the human capacity to represent and iteratively update such bounded scenes, rather than simply track long contextual histories, is central to understanding consciousness and to constructing artificial systems that think more like minds.
Large language
models predict the next word (or token) by using the words that came earlier. These
words sit in their context window. They can use from one word up to around one
million. And when they do this, they use all the words in their context window
to help them choose the next word. So, they don’t build sequences of singular scenes
to help them think, instead they use a long, unbounded list of precursor words.
When humans
think, they use a series of concepts to help direct the train of thought. As
they do so, they form cohesive scenes composed of just a handful of concepts. Each
scene is a framed vignette or unit of experience. Such a scene might involve
people, a place, and certain actions. And then generally visual and auditory
imagery are generated to envision this scene. AI language models don’t create
imagery to help inform their processing. But more to the point of this post,
they also don’t create a bite sized scene to depict the pertinent concepts.
Rather, they lean into the long list of words in their context window. That
window holds useful context but is generally too big to represent a single concrete
occurrence or episode.
I believe the ability to experience (or imagine) an isolated
event as a mental tableau is crucial to human-like intelligence. Our mental
lives work this way because we have a focus of attention which can only hold
about 4 to 7 working memory items simultaneously. That size is large for an
animal but small for an AI. However, it is sufficient for holding the relevant
constructs necessary to model most static situations. How can an AI hope to
model or experience a distinct situation when it is using thousands of words to
do so? Moreover, how can an AI develop a form of consciousness or
the ability to experience a distinct situation when its attention is full of
such a large number of incongruous elements?
Now LLMs do
use attention within their context window to narrow down their focus to the
most pertinent words. They assign each word a score and relate them to each
other. But I think that AI should be built from the ground up to generate
individually meaningful, semantically-bounded scenes. If you were training an
AI to do this from text, you could use each sentence as an individual scene.
That sentence could be used to prompt an imagery generation system to depict it
visually and auditorily. These depiction would offer helpful information not
contained in the words themselves. For instance, when you read a book, many
details are left out, but the readers imagination fills them in. In fact,
imagination and visualization are necessary to understand passages in many
books from fiction to nonfiction. AI needs an imagination to achieve these
insights. To acquire one, it must break the information it is processing down
into experience-sized chunks.
Right now, many
psychologists, neuroscientists and machine learning experts stand up, wave
their hands around in the air excitedly, and proclaim that large language
models are nothing like the human brain. Many of those claim that AI will never
reach human aptitude, and in their present form are going nowhere. I disagree wholeheartedly,
because I believe large language models already capture the two most important
abstract features of the brain. These are, 1) having two forms of working
memory, a context window (akin to synaptic potentiation) and attention (akin to
the focus of attention), and 2) having these memory stores updated iteratively
as time passes.
Language
models have both of these memory stores and iterate on them as each new word enters
the context window. But the third ingredient they are missing is the ability to
form scenes. Once they can form scenes, they can iterate on them. Iterating on
the scenery that is built is crucial because it turns a mental picture into a
narrative. This would ultimately allow for the creation of chains of
interlinked experiences. What I am trying to say is that AI is closer than most
people realize to capturing the important aspects of mental machinery. Scenery may
be one of the last necessary pieces of the puzzle.
It is
important to mention that the entire context window could still be very helpful
and could function as a parallel to the mammalian short term memory store (short-term
potentiation and priming). I explain how this could work at my website
aithought.com, which details a cognitive architecture in accordance with these
ideas.
Instead of
reasoning across thousands of tokens, an AI system should instantiate a bounded semantic workspace, roughly equivalent to a human’s focus of
attention. This workspace could hold, say, 4–7 (or less than a dozen) scene objects or latent constructs. Each iteration (or “thought”)
would involve:
1.
Updating the store of concepts with a new associative
addition
2.
Updating the semantic bindings between
those constructs
3. Generating
a new scene from this updated set, the next mental state.
The 4 to 7
item limit on human working memory may represent an evolutionarily optimized
selection of elements large enough to model useful complexity. 4-7 items is
certainly less than optimal for a computer because of metabolic constraints on
our prehistoric ancestors. However, the capacity of the AI’s focus of attention
should be limited to a reduced subset of its total context window with
scene-sized granularity.
Essentially, language
models use token-based reasoning and humans use scene-based cognition. LLMs reason
over a flat
temporal stream, a list of words, each embedded in a very
large contextual lattice. Humans reason over a temporally chunked
scene-bounded,
imagistic construct held in working memory. LLMs rely on syntactic
continuity within
the context window. Humans rely on semantic coherence within a
scene and between them. The LLM’s context
window functions somewhat like a cognitively unmanageable short-term
buffer.
Both feature context-dependent iterative updating, but with radically different
chunk sizes and representational units.
Boundedness is
not a shortcoming, it is a condition for awareness.
No comments:
Post a Comment