Friday, October 17, 2025

From Context Windows to Cognitive Scenes: Why AI Thinking Should Be Event-Oriented

 

Large language models reason across vast strings of text, relying on statistical prediction over many thousands of words. Human cognition, on the other hand, unfolds within compact scenes composed of only a handful of interrelated concepts. This essay argues that the human capacity to represent and iteratively update such bounded scenes, rather than simply track long contextual histories, is central to understanding consciousness and to constructing artificial systems that think more like minds.



Large language models predict the next word (or token) by using the words that came earlier. These words sit in their context window. They can use from one word up to around one million. And when they do this, they use all the words in their context window to help them choose the next word. So, they don’t build sequences of singular scenes to help them think, instead they use a long, unbounded list of precursor words.

 

When humans think, they use a series of concepts to help direct the train of thought. As they do so, they form cohesive scenes composed of just a handful of concepts. Each scene is a framed vignette or unit of experience. Such a scene might involve people, a place, and certain actions. And then generally visual and auditory imagery are generated to envision this scene. AI language models don’t create imagery to help inform their processing. But more to the point of this post, they also don’t create a bite sized scene to depict the pertinent concepts. Rather, they lean into the long list of words in their context window. That window holds useful context but is generally too big to represent a single concrete occurrence or episode.

 

I believe  the ability to experience (or imagine) an isolated event as a mental tableau is crucial to human-like intelligence. Our mental lives work this way because we have a focus of attention which can only hold about 4 to 7 working memory items simultaneously. That size is large for an animal but small for an AI. However, it is sufficient for holding the relevant constructs necessary to model most static situations. How can an AI hope to model or experience a distinct situation when it is using thousands of words to do so? Moreover, how can an AI develop a form of consciousness or the ability to experience a distinct situation when its attention is full of such a large number of incongruous elements?

 

Now LLMs do use attention within their context window to narrow down their focus to the most pertinent words. They assign each word a score and relate them to each other. But I think that AI should be built from the ground up to generate individually meaningful, semantically-bounded scenes. If you were training an AI to do this from text, you could use each sentence as an individual scene. That sentence could be used to prompt an imagery generation system to depict it visually and auditorily. These depiction would offer helpful information not contained in the words themselves. For instance, when you read a book, many details are left out, but the readers imagination fills them in. In fact, imagination and visualization are necessary to understand passages in many books from fiction to nonfiction. AI needs an imagination to achieve these insights. To acquire one, it must break the information it is processing down into experience-sized chunks.

 

Right now, many psychologists, neuroscientists and machine learning experts stand up, wave their hands around in the air excitedly, and proclaim that large language models are nothing like the human brain. Many of those claim that AI will never reach human aptitude, and in their present form are going nowhere. I disagree wholeheartedly, because I believe large language models already capture the two most important abstract features of the brain. These are, 1) having two forms of working memory, a context window (akin to synaptic potentiation) and attention (akin to the focus of attention), and 2) having these memory stores updated iteratively as time passes.

 

Language models have both of these memory stores and iterate on them as each new word enters the context window. But the third ingredient they are missing is the ability to form scenes. Once they can form scenes, they can iterate on them. Iterating on the scenery that is built is crucial because it turns a mental picture into a narrative. This would ultimately allow for the creation of chains of interlinked experiences. What I am trying to say is that AI is closer than most people realize to capturing the important aspects of mental machinery. Scenery may be one of the last necessary pieces of the puzzle.

 

It is important to mention that the entire context window could still be very helpful and could function as a parallel to the mammalian short term memory store (short-term potentiation and priming). I explain how this could work at my website aithought.com, which details a cognitive architecture in accordance with these ideas.

 

Instead of reasoning across thousands of tokens, an AI system should instantiate a bounded semantic workspace, roughly equivalent to a human’s focus of attention. This workspace could hold, say, 4–7 (or less than a dozen) scene objects or latent constructs. Each iteration (or “thought”) would involve:

 

1.             Updating the store of concepts with a new associative addition

2.             Updating the semantic bindings between those constructs

   3.       Generating a new scene from this updated set, the next mental state.

 

The 4 to 7 item limit on human working memory may represent an evolutionarily optimized selection of elements large enough to model useful complexity. 4-7 items is certainly less than optimal for a computer because of metabolic constraints on our prehistoric ancestors. However, the capacity of the AI’s focus of attention should be limited to a reduced subset of its total context window with scene-sized granularity.

 

Essentially, language models use token-based reasoning and humans use scene-based cognition. LLMs reason over a flat temporal stream, a list of words, each embedded in a very large contextual lattice. Humans reason over a temporally chunked scene-bounded, imagistic construct held in working memory. LLMs rely on syntactic continuity within the context window. Humans rely on semantic coherence within a scene and between them. The LLM’s context window functions somewhat like a cognitively unmanageable short-term buffer. Both feature context-dependent iterative updating, but with radically different chunk sizes and representational units.

 

Boundedness is not a shortcoming, it is a condition for awareness.

 


No comments:

Post a Comment