Observed Impulse: October 2024

Friday, October 25, 2024

The Importance of Paced Breathing While Standing and Walking

I have been feeling tired and weak for a couple of years now. This is due to my issues with long COVID. I give lot of details about that here. One of the most conspicuous symptoms has been weakness standing up. Standing up in the morning feels very achy and my legs hurt all over. The general feeling lasts throughout the day. I avoid long walks and sit whenever I can. I have been spending a lot of time massaging my legs but this has only led to modest benefits. However, this week I made a large inroad in rehabbing my legs. Allow me to explain.

I created a self-care system called Program Peace that encourages people to breathe along with a breath metronome. It guides you to breathe longer, smoother, deeper breaths and leads to relaxation and a reduction in muscle tension. You can download the free app or visit www.programpeace.com to find out more. I use paced breathing several times per week and I think I get great results from it.

However, there has been one problem. Because of my long COVID, I have only been practicing paced, diaphragmatic breathing while laying down. For months I have had a sneaking suspicion that this has made it so that I am only deeply relaxed while laying down and I recognized the possibility that part of the reason my body feels so heavy when standing is because it has literally been years since I have practiced paced breathing standing up.

The whole-body heaviness and the soreness in my legs got to a point last week where I felt desperate and decided to take a walk around my block while paced breathing. So, I took out my phone, opened the Program Peace app, and placed the breath metronome at a rate of 7 second inhalations and 9 second exhalations (this is too long for many people and a beginner should probably start at 5x7, 4x6, or lower). I walked very slowly, mostly focusing on taking complete breaths where I inhaled and exhaled all the way out. While walking, I concentrated on the soreness in my legs and the technique of “antirigidity” that I spell out below. This slow walk around the block took me at least 10 minutes. By the end, my legs felt much lighter and more springy. This encouraged me to do it again, and by the time I came back, another 10 minutes later, most of the pain in my legs was gone.

Every night this week I have repeated this. The relief I have gotten has been unbelievable. It showed me that because I was only breathing diaphragmatically while laying down, standing up way enough to push me into distressed breathing. Because of this I want to warn people not to practice paced breathing exclusively while lying down. I also want to advise people to practice paced breathing while walking. The experience has spurred me to do something else I recommend in the Program Peace book, which is to use the breath metronome over headphones while stretching and lifting light weights at the gym.

“Diaphragmatic generalization” is the pairing of diaphragmatic breathing with different actions. It works. Almost any muscular injury you have can be largely rehabilitated with diaphragmatic generalization. It is not a household term, but it should be. The fact that years of leg soreness could be ameliorated in less than an hour tells me this “neuromusculardiaphramatic axis” in the body (where our nervous systems and muscles learn what to trust and what not to trust by the way we breathe) is profoundly impactful. The fact that I got relief so fast tells me this axis is highly plastic. I understand that this might seem like a lot of hyperbole to some readers, but I want to recommend that you try it for yourself.

Everything I do while breathing with a breath metronome becomes cleansed in peace. Let me leave you with the 5 tenets of Antirigidity:

https://programpeace.com/antifrailty/

Anti-rigidity Protocol

1) Find a contraction or active stretch that feels stiff and achy when held. This often involves an unfamiliar position or posture and leads to cracking at the joint. The point where this configuration cracks is the most in need of rehab. Engage, stabilize, and hold the position until it fatigues. This usually takes between five and 30 seconds.

2) Hold the general parameters of the posture while varying others. Move the joint dynamically, utilizing its range of motion in every possible vector, flexing into the ones that seem the stiffest or sorest. If you can continually reposition, you can approach the problem from different angles. Anti-rigidity can be done with concentric contractions (in which the muscle shortens), eccentric contractions (in which the muscle lengthens), or isometric contractions (in which the muscle does not change length).

3) Allow the area at least 15 seconds of complete rest before you try again. Use this respite to recognize what the muscle feels like when it is completely resting.

4) Stimulate dormant muscles throughout the body in this way while breathing correctly. As long as you are breathing diaphragmatically, you should feel the ache diminish in a matter of seconds. At first, the ache may be so intense that it makes paced breathing difficult. In that case, just ensure that you are taking long, passive exhalations. You can facilitate this by taking deep inhalations through the mouth and then puckering your lips and blowing out for as long as possible to reinforce the relaxation response.

5) After your anti-rigidity training session, it is essential to allow these muscles to relax completely, so lie down (employing the corpse pose, body scan, or progressive muscle relaxation from Chapter 5) and let the contractions subside.

Thursday, October 24, 2024

Streamlined Minds: An Analogy Between Compressed AI Models and Forms of Intellectual Disability

Artificial intelligence engineers work hard to take language models and streamline them. They do this in order to make them cheaper, less energy intensive, and faster. Different techniques are used such as quantization, pruning, or distillation to decrease the size of the model without sacrificing too much performance. The new GPT 4o mini that came out recently is an example of this and is preferred by many customers due to its speed and lower costs.

I see a number of forms of neuropathology in a similar light. I believe that certain neurological and psychological disorders could represent a streamlining of intelligence. In previously published articles I have called this evolutionary neuropathology. And in the new book I’m writing I refer to the phenomenon as a “cognitive razor.” The razor being an evolutionary force, like Occam‘s razor, that excises the superfluous.

In AI and computer science these small models are very important because they are suitable for many uses even though they require much less energy, training, and money. I talk about this analogy between computers and human mental disorders in my new book Adaptive Neurodiversity, a very early version of which can be found here:

www.adaptiveneurodiversity.com

The work at Adaptive Neurodiversity attempts to show that neurodiverse conditions may have unappreciated and undiscovered adaptive qualities both in the ancestral past and today.

Neurodiversity refers to the idea that brain differences—whether in intellectual capacity, sensory processing, or emotional regulation—are natural variations rather than deficits. These variations could have conferred evolutionary advantages in certain environments, especially in small, cooperative groups. For example, individuals with working memory impairments might still excel in repetitive or routine tasks, which require focus on the present rather than complex problem-solving or future planning.

People with smaller brains or reduced intellectual capacity often perform well in tasks requiring concrete, routine, or emotionally focused processing. A streamlined cognitive system could be less prone to overthinking or distraction. Brain size does not always correlate directly with intelligence or practical functionality. Smaller brains might have evolved for energy efficiency, balancing performance with lower metabolic costs.

Just as AI systems are built to handle a diversity of problems using different architectures—some optimized for speed and efficiency, others for depth and complexity—human cognitive diversity allows for different strengths to emerge in different contexts. In a diverse, cooperative society, individuals with more streamlined cognitive processes can take on roles that suit their abilities, enhancing group resilience by providing specialized support. Strengthening this analogy, people with neurodiverse conditions are usually taught, instructed, and "programmed" by parents and family members without conditions.

Individuals with working memory impairments might struggle with complex multitasking or holding many details in mind simultaneously, but they often excel in tasks requiring focus on the present or repetition. This could be referred to as cognitive efficiency which could be defined as prioritizing important tasks while discarding or minimizing less relevant information. It is also related to potential terms such as cognitive narrowing, cognitive specialization, minimalist cognition, of adaptive cognitive reduction.

Before we consider individual mental disorders and how they may represent a form of model compression or downsizing, let first talk about how this works in AI. Specifically, let’s focus on model distillation.

Model Compression in Artificial Intelligence

Model distillation is a machine learning technique used to transfer the knowledge from a larger, more complex model (often called the teacher model) into a smaller, simpler model (called the student model), without sacrificing much of the original model's performance. The larger model is trained on a larger dataset and may have billions of parameters making it “resource heavy.” This technique is primarily used to create more efficient models that can be deployed in environments where computational resources are limited (e.g., mobile devices, edge computing, and embedded systems). Smaller models can be more adaptable to specialized tasks, as well as be easier to maintain, scale, upgrade, and fine tune.

The smaller models are computationally efficient while retaining high performance. They actually run faster, and this is beneficial in real-time applications such as video processing, autonomous driving, voice assistants, or recommendation systems where latency matters. Smaller models also use less energy and are more sustainable, making them ideal for environments where power consumption needs to be minimized (e.g., battery-powered devices). These smaller models can learn the essence of what the teacher learned, internalizing the complex patterns without unnecessary details. They may also generalize better without overfitting to the training data (regularization). They also learn intricate relationships from the parent models that they would not be able to capture themselves.

Now that we have discussed how this works in AI, let’s relate it to human mental disorders.

Natural Cognitive Aging and Alzheimer’s Disease

As people age, the brain tends to shed or prune unnecessary connections (synaptic pruning) and prioritize efficiency over raw computational power, similar to how AI engineers reduce model complexity without drastically sacrificing performance. This may serve an adaptive purpose. Evolution may have selected for this streamlining process because it allows older individuals to conserve energy while maintaining enough cognitive function to navigate daily tasks and social roles. This would be especially beneficial in hunting and gathering environments where efficiency was crucial for survival. In other words, cognitive aging might be a form of adaptive streamlining that mirrors AI's quantization and distillation techniques.

The focus might shift from high-complexity tasks, such as rapid problem-solving or learning new skills, toward knowledge-based tasks like pattern recognition, wisdom, and long-term memory retrieval. This could result in what we observe as crystallized intelligence (accumulated knowledge and wisdom) improving or staying stable, while fluid intelligence (problem-solving and new learning) declines.

Here is what the pruning process can look like in the domain of AI and neural networks:

While normal cognitive aging might resemble adaptive streamlining or distillation, Alzheimer’s disease is a pathological process where the "streamlining" goes too far, resulting in the loss of critical functionality. This is akin to over-distilling an AI model to the point where it no longer performs well or loses its ability to generalize. Just as AI engineers balance model performance and resource efficiency, evolution might have favored a brain that naturally reduces resource demands over time, at least in non-pathological aging.

You can read much more about this in the article I wrote on the topic here:

https://behavioralandbrainfunctions.biomedcentral.com/articles/10.1186/1744-9081-5-13

Intellectual Disability and Neuropathology

In some intellectual disabilities, cognitive functioning might be “streamlined” in the sense that the brain may prioritize some aspects of adaptive functionality (e.g., social bonding, routine behavior, basic survival skills) while reducing capacity in other areas, such as abstract reasoning, memory, or learning new complex tasks.

A brain that is more simplified, similar to a “distilled” AI model, may function with less cognitive noise or fewer distractions, allowing focus on repetitive tasks, concrete experiences, or specific social roles. In this way, intellectual disability could be seen as retaining critical adaptive functions while sacrificing more complex or unnecessary (for survival) cognitive operations. This may be an example of simplicity as a strength.

In cases where intellectual disability is nonsyndromic (without specific identifiable features like those found in syndromes), the brain might still exhibit a “cheaper model” in terms of capacity, but one that is more generalized rather than specialized in certain strengths. Here, the trade-off might be a more global reduction in cognitive complexity without any significant compensatory strengths.

Like AI model distillation, this process of cognitive simplification might have been selected for under certain evolutionary pressures, where conserving energy and focusing on critical survival functions outweighed the need for broad, abstract reasoning or novel problem-solving. Understanding intellectual disabilities in this way could provide new perspectives on support, education, and interventions aimed at enhancing the quality of life for individuals with these conditions.

You can find out more about this in my article here:

https://www.sciencedirect.com/science/article/abs/pii/S030698770600185X?via%3Dihub

Schizophrenia and Stress:

Schizophrenia is characterized by disturbances in cognition, perception, and emotion. These disturbances often include paranoia, delusions, impulsivity, and impaired working memory. If we view these symptoms through the lens of cognitive streamlining, it might suggest that the brain, under extreme stress or threat, prioritizes certain functions—such as heightened vigilance or rapid emotional reactions—over more complex, slower forms of reasoning and memory processing. Individuals with schizophrenia often exhibit impairments in working memory, which could be seen as the brain reducing its cognitive load by focusing on immediate survival rather than long-term planning or complex decision-making.

One plausible biological mechanism behind this idea is the role of cortisol, a stress hormone. Chronic exposure to high cortisol levels, particularly in the womb and early childhood, is known to be associated with schizophrenia. Epigenetically, prolonged cortisol exposure can alter gene expression and potentially lead to changes in brain function, particularly in regions involved in memory, emotion regulation, and the fight-or-flight response.

If the environment is inherently dangerous, such as in war zones or predator-rich areas, a brain adapted to anticipate danger, even where it might not be immediately present, could theoretically be advantageous. Paranoia and hypervigilance, often maladaptive in modern, stable environments, could have helped individuals in ancient or hostile settings where threats were constant and unpredictable. Schizophrenia is also associated with cognitive disorganization and impaired executive functioning. These deficits might be seen as a form of reduction in cognitive complexity, where the brain narrows its focus to immediate concerns and responses, while forgoing higher-order cognitive processes that are not immediately necessary for survival in stressful situations.

For more information you might want to peruse my article on schizophrenia here:

https://www.sciencedirect.com/science/article/abs/pii/S0306987707000254

If you found this interesting, please visit aithought.com. The site explores my model of working memory and its application to artificial intelligence, demonstrating how human thought patterns can be emulated to achieve machine consciousness and superintelligence. With over 50 detailed figures, the article offers a visually compelling examination of how bridging psychology and neuroscience can pave the way for the future of intelligent machines.

Monday, October 7, 2024

Generative Interpretability: Pursuing AI Safety Through the Visualization of Internal Processing States

Jared Edward Reser Ph.D.

jared@jaredreser.com

www.aithought.com

Abstract

The potential emergence of superintelligence presents significant challenges in ensuring alignment with human values and intentions. One critical concern is the inherent opacity of artificial neural networks, which obscures their decision-making processes. This paper proposes a novel approach to safety and transparency by requiring an AI system to generate sensory images that reliably reveal its internal states. If imagery generation was made a fundamental and unalterable aspect of its cognitive cycle, as it is in the human brain, the resulting system would be unable to hide its plans or intentions. Such an AI system would use the current contents of its attention or working memory to prompt advanced generative models to continuously produce visual (mental imagery) and language (internal monologue) representations of its processing (inner thought). These representations could be time-stamped, stored, and made accessible through an interactive interface, enabling real-time monitoring and retrospective analysis. The feasibility of this approach is supported by existing machine learning technologies, including multimodal networks, large language models, and image generation models. By capturing the prominent internal representations at each time step and organizing them into composite images, this method could facilitate the detection and correction of hostile motives and the reinforcement of desirable objectives, enhancing trust and accountability. The technical implementation, as well as the potential benefits and challenges of this approach are discussed. It is concluded that the approach provides a practical and scalable solution for AI alignment that can be divided into two forms, here termed generative explainability and generative interpretability.

Keywords

artificial general intelligence, attention, consciousness, focus of attention, AI safety, AI interpretability, AI alignment, superintelligence, latent space, generative AI, working memory, ethics in AI, machine learning, neural networks, generative models, autonomous systems

1.0 Introduction

The rapid advancement of artificial intelligence (AI) systems has intensified concerns about AI safety and alignment, particularly as we approach the possibility of artificial general intelligence (AGI) and artificial superintelligence (ASI). This paper proposes addressing these concerns by implementing a form of multimodal transparency within advanced AI systems. This method, here termed "generative interpretability," requires AI systems to automatically generate and log internal imagery and textual representations that depict their hidden states as an automatic part of their cognitive cycle. This method was introduced in previous writings (Reser, 2011, 2012, 2013, 2016, 2019, 2022), but it is given a full treatment here.

This approach draws inspiration from the human brain, where visual and other sensory areas of the cerebral cortex continuously generate depictions of the contents of attention and working memory. By leveraging state-of-the-art imagery generation and language models, advanced AI systems could produce reliable visual and textual representations of their "hidden" neural network states. These representations would be logged, stored, and made available for viewing through a dedicated user interface. This continuous visualization would allow humans and other AIs to audit the system, permitting real-time monitoring and post-hoc analysis.

Fig. 1. Schematic of the present method for visualizing and inspecting an AI’s hidden states and processes.

The ongoing visual narrative history this process would create could be used as an auditing tool to ensure the AI remains aligned with ethical norms and does not deviate toward unintended, dangerous, or harmful behavior. The method would enable human operators to meaningfully correct or suspend the system either in real-time or retrospectively. It would also enable the provision of training feedback. Creating a system that can be barred, interrupted, or steered away from unsafe internal states, could significantly reduce the risk of an AI system planning or elaborating on potentially harmful actions without human knowledge.

To ensure that this technique cannot be finessed or circumvented, the generation of imagery maps must be necessary for a cognitive cycle. The system must be inherently obligated to create pictures and text to initiate and inform the next stage of processing. Thus, to keep thinking and reasoning, the system must build mental imagery each time its attention is updated, just as in the brain.

The following sections will review related work in AI transparency and interpretability, describe the implementation of the proposed method at a high level, and discuss the user interface for interacting with the generated data. They will also explore the feasibility of this approach, identify potential challenges, and highlight the benefits and broader implications for AI safety and alignment.

Literature Review

1.2 The Black Box Problem

Recent years have seen the development of increasingly sophisticated systems that rival human intelligence in specific domains. As we approach the potential development of AGI, ensuring these systems align with human values and intentions becomes paramount. The alignment problem is particularly challenging due to the inherent opacity of neural networks, which often operate as "black boxes" with decision-making processes that can be inscrutable to human observers.

The core of this problem lies in the way neural network models encode knowledge. They utilize patterns distributed across billions of weights, making it exceptionally difficult to interpret their prediction-making processes. This complexity renders many traditional AI safety techniques and procedures inapplicable to neural networks, despite these networks being the most promising technology for advancing AI capabilities.

The lack of transparency in neural networks exacerbates concerns about the potential risks of superintelligent AI. There are fears that such systems could harbor motives misaligned with human interests or take unforeseen actions detrimental to humanity. This opacity not only complicates efforts to ensure AI safety but also hinders public trust and acceptance of AI technologies in critical domains. While there are several limited methods currently in use and promising theoretical approaches on the horizon for addressing the black box problem, a comprehensive solution remains elusive.

1.3 Current Methods Aimed at Interpretability

Several approaches have been proposed to address the transparency of AI systems. Explainable AI (XAI) encompasses a collection of techniques focused on making machine learning algorithms trustworthy and understandable (Xu et al., 2019). Presently, the techniques use quantitative and statistical methods to trace the weights and uncover the activations most responsible for the system’s output. Within XAI, there are ongoing efforts to create inherently interpretable models by schematizing attention mechanisms, attributing outputs to particular features, and using simpler surrogate models to approximate complex ones. XAI techniques have helped users debug, improve, and find biases in their models.

Relatedly, researchers in the field of mechanistic interpretability have developed methods to visualize the internal workings of neural networks, such as saliency maps, which highlight important features in the input data (Bereska & Gavves, 2024). They also use activation maximization, which can generate interpretable patterns by strongly activating specific neurons or layers, as well as feature visualization techniques that reveal what parts of a network are detecting specific features. Safety engineering professionals have expressed concerns that modern techniques may not prove sufficient to control more sophisticated learning systems (Hassan et al., 2022). The limitations of current methods highlight the need for new approaches that can provide more comprehensive insights into AI behavior.

These techniques have made significant strides in advancing our understanding of how AI systems work. However, they have no capacity to provide a continuous, real-time overview of an AI system's internal states. The present solution offers a more comprehensive form of explainability. Generated imagery offers a wholistic and immersive view of an AI’s internal behavior, providing deeper insights into how decisions are made. Composite images offer clear advantages over existing methods such as heat maps and saliency maps which simply highlight which parts of the input data were important for a model's decision. This is because pictures are intuitive, comprehensible, and easier to interpret. The pictures could also force a deceptive AI, whose outputs cannot be trusted, to explicitly report on its network activation patterns (eliciting latent knowledge). This inspires two novel conceptual approaches which are defined in Table 1 below.

Term	Definition
Generative Interpretability	The field of AI research focused on using generative techniques to create pictorial representations of the processes of an AI system, allowing them to be interpreted by humans.
Generative Explainability	The field of AI research that uses generative models to illustrate the workings of AI systems in ways that are relatable and understandable to the user and that allows the user to retain intellectual oversight.

Table 1. Definition of Key Terms

Two terms introduced here to describe how generative AI can impact AI safety research.

How should imagery generation be integrated with an AI system? The present approach derives insights from mammalian neuroscience. The mammal visual system and its ability to generate sensory maps could provide helpful groundwork in designing AI compatible with generative interpretability.

Proposed Method

2.0 The Mammalian Brain Generates Imagery Depicting the Focus of Attention

Sensory areas of the vertebrate brain continually create mappings of the sensed world. For example, visual areas create neural representations of the visual field that have a one-to-one correspondence with the photoreceptive cells in the retina. In mammals, cortical sensory areas (such as the visual cortex) also build maps of the focus of cognition. When this occurs, the disparate contents held in working memory or attention (maintained in association areas) are broadcast backwards to sensory areas where they are integrated into a composite mapping.

In humans there are dozens of sensory areas, each building their own instances of internal imagery simultaneously. Visual, auditory, haptic, motor and many other maps are built to depict whatever our mind turns to. These internal representations are referred to as topographic maps because they retain a physical correspondence to the geometric arrangement of the sensory organ (i.e. retinotopic maps corresponds to the retina). There are also multimodal areas that build more abstract mappings. For instance, language areas (i.e. Broca’s and Wernicke’s areas) build a verbal narrative that reflects the contents of working memory and the progression of thoughts.

Fig. 2. Schematic of the human imagery generation system.

The items or concepts currently within attention (C, D, E, and F) are used as parameters to drive the generation of language in cortical language areas and imagery in visual cortex.

If I were to ask you to think of a pink hippopotamus riding a unicycle, you could visualize this in your mind’s eye by creating a mapping of it in your early visual cortex. Research has shown these, previously private, internal maps can be read by brain imaging techniques. Not surprisingly, artificial neural networks are utilized to decode and reconstruct the brain imaging data into pictures. Recent methods have proven effective in displaying activity patterns from the visual cortex onto a screen so that the content of a person’s thoughts or dreams can be coarsely displayed for others to view. This is known as fMRI-to-image (Benchetrit et al., 2023).

Imagine that you are locked in a room with a stranger and the only thing in the room is a sharp knife. Complete access to the mental imagery the stranger forms in their brain, along with all their subvocal speech, would give you near certainty about everything from their plans to their impulses. You could use that data to know if you were safe or not and the other person would not be able to hide their intentions. If humanity had a form of fMRI-to-image for AI systems, then the AI would similarly be unable to hide its intentions.

In humans, fMRI-to-image is still in its early stages. This is not so with computers. As of 2024, the digital neural network equivalent is a well-developed technology. Today, consumers readily generate images from simple prompts in a process known as “text-to-image,” “image synthesis,” or “neural rendering.” Presently, using diffusion models (iterative denoising) the images generated have reached the quality of real photographs and human drawn art. Most popular text-to-image models combine a language model, which transforms the input text into a latent representation, with a generative image model which has been trained on image and text data to produce an image conditioned on that latent representation. Essentially, existing technology makes a proof-of-concept design for the present technique feasible. Before discussing how this could be implemented, the next section will discuss an additional reason why imagery generation would be a beneficial addition to an AI’s cognitive cycle.

2.1 Generative Imagery Can Also Be Used by the Model to Improve Prediction

In previous articles (Reser, 2016, 2022b) I describe a cognitive architecture for a superintelligent AI system implementing my model of working memory (Reser, 2011). In this work, I explain that human thought is propagated by a constant back and forth interaction between association areas (prefrontal cortex, posterior parietal cortex) that hold the contents of attention, and sensory areas (visual and auditory cortex) that build maps of those attentional contents. These interactions are key to the progression of reasoning. This is partly because each map introduces new informational content for the next iterative cycle of working memory.

Fig. 3. The Iterative Cycle of Imagery Generation and Attentional Updating

Sensory areas create topographic maps of the contents of attention. Then, salient or informative aspects of these maps are used update the contents of attention. This creates an iterative cycle of reciprocal interactions that support reasoning and world modeling.

For most people, visualizing something in the mind’s eye provides further information about it, helping to introduce new topics to the stream of thought. Emulating human cognitive processes, such as mental imagery and internal monologues, may also help AI develop more robust and relatable reasoning pathways and improve its ability to simulate reality (Reser, 2016). For example, the AI system could send the images it generates to an image recognition system (e.g., a pre-trained convolutional neural network) which could be used to perform scene understanding and identify the concepts incidental to the image. The textual description generated from this analysis could then be fed back into the primary model’s attention to provide it with the metric, compositional, and associative information inherent in images.

The incidental concepts that emerge from generated imagery can stimulate new associations much like a person having an “aha” moment when visualizing a problem. Thus, the visual synthesis is not merely a reflection of the original items or tokens, but a source of new useful insights. The generation of internal scenery and monologues might also be a step toward machine consciousness or self-awareness. While this is a speculative topic, an AI that can "see" and "hear" its own thought processes, could blur the lines between machine cognition and human-like thinking and could be used to study the emergence of complex thought patterns. This technique could be valuable for all these reasons, while also allowing for generative interpretability.

2.2 Interpretable Imagery Generation for Contemporary Models

Implementing generative interpretability into contemporary state-of-the-art models like transformer-based models involves creating a system where the AI not only processes input data but also generates multimodal representations of its internal cognitive processes. One approach is to have a parallel generative network that runs alongside the primary model. This network could be designed to take intermediate representations (e.g., hidden states or attention weights) from the main model and transduce them into corresponding topographic visualizations and textual descriptions. Alternatively, heavily activated tokens within the model’s vocabulary or attentional window could be directly linked (through network integration or fusion) to nodes in the separate generative networks, coupling hidden states between these networks.

Researchers have developed various techniques to visualize the internal representations of neural networks, particularly in computer vision tasks. In fact, there are many technologies that make it possible to take readings from a neural network and use them to formulate a picture or video. Previously, these technologies included inverse networks, Hopfield networks, self-organizing maps, Kohonen networks, and others. Today, a variational autoencoder (VAE), diffusion model, or generative adversarial network (GAN) could be the best systems for creating interpretable images. These would translate the AI’s global processing steps into explicit pictures, generating a continuous episodic chronology that depicts its impulses, decisions, and rationales. If adequately visualized, the AI could become an open book and anything it tries to plan would be as clear as day.

Other modalities could include an auditory sense (for predicting accompanying sounds), somatosensory (for predicting touch), as well as smell, taste, vestibular, proprioceptive, and other senses. If a motor modality was included for predicting the next action, the AI would, in effect, be simulating an embodied world which could also help to ground the model. Multimodularity could lend informative and mutually corroborating detail to the AI’s generated narrative that humans could watch, read, and hear.

The human visual cortex samples from working memory and creates imagery every few brain wave cycles. But this may be too frequent for a computer. Forcing a present-day language model to generate an image for every new token that enters its attentional window would be prohibitively computationally expensive. However, there would be many other ways to do this such as creating an image for every sentence or creating images of text summarizations for paragraphs. Text-to-image models today can generate multiple low-resolution images every second. Because most LLM’s generate text on the order of sentences or paragraphs per second, these technologies are very compatible in terms of their production rate. This compatibility could make for seamless integration and synchronization, where the AI generates text while simultaneously producing visual imagery to match.

The images generated in human visual cortex use parameters from working memory which has a capacity-limited span of around 4 to 7 items. This number may be optimal in many ways to specify a composition describing an event or circumstance. On the other hand, there can be thousands of tokens within the context window and attention span of modern large language models. This may be too many to generate a single coherent image. Thus, imagery made from latent representations would have to be made from a small subset of the most active or highest weighted contents.

Present-day language models are not dangerous, autonomous agents that need to be supervised. Also, their language output can already be taken as a reliable signal of their internal states. That is why moving forward, this article will focus on adapting this approach to more advanced future systems, particularly my cognitive architecture for superintelligence (Reser, 2022a). In this paradigm, the parameters for imagery and language generation are not derived from the textual output of an LLM, but rather from the hidden layers of a larger brain-inspired neural network.

2.3 Interpretable Imagery Generation for Future Models

In future cognitive computing systems, generative imagery could be derived, not from tokens, but from items held in the system’s working memory or global workspace. These items would be high-level abstract representations. Reser’s Iterative Updating model of working memory (2022a) explains how these items could be formed from artificial neurons and how they could be manipulated as symbols over time through partial updating of active neural states. As items in the system’s working memory are updated iteratively, this process creates a gradual and overlapping transition between mental states, allowing for mental continuity, context-sensitive predictions, and incremental associative searches across memory.

The iterative updating model of working memory posits that mental states are not completely replaced from one moment to the next. Instead, new information is gradually integrated into the current state, while portions of the previous state are retained (i.e. changing from concepts A, B, C, & D to B, C, D, & E). This partial updating creates subsequent states that overlap, which can explain how mental imagery from one moment to the next would be similar and share pictorial characteristics. Thus, we can expect the images coming from such as system to be interrelated, forming a succession of incidents that could construct a story or plot.

The machine must be capable of producing a visual representation of its internal processing units. These could be symbolic or subsymbolic and include things like items of attention, reasoning paths, or latent representations. These hidden processing states could be found in the attention weights or matrices, subgraph circuits, neuron or layer activation patterns, embeddings or vectors representing high-level features. This data may have to be extracted, properly mapped, and then labeled for supervised learning so that the generative model can be trained to produce images from these high-dimensional internal representations.

An advanced AI implementation would feature multiple interfacing artificial networks arranged in an architecture similar to the mammalian neocortex with unimodal sensory modules (i.e. auditory and visual) at the bottom of the hierarchy. Like the human brain it would utilize both bottom-up and top-down processing pathways. For instance, the top-down pathway would stretch from attention to imagery generation. For a bottom-up pass, the images generated could be sent to an image recognition module (image-to-text) so that any new information held in the image can be included in the next attentional set (as discussed in Section 2.1). This creates a feedback loop where the outputs from the generative interpretability layer are fed back into the primary model.

As seen in Figure 3, each image should also be delivered to an image-to-video system that takes the current image and provides a series of output images to predict what will happen next. The predictive frames generated by this system could additionally be sent for image recognition. Thus, if attention holds concepts related to a knife attack, the system will paint a picture of what this attack could look like, and then animate that picture to generate an action sequence. Not only could this action sequence inform attention, but its final frame could be used as the initial frame of the next action sequence in the subsequent time step. Splicing individual predictive video snippets in this way could create a feed of imagery approximating an imagination.

Fig. 4. Schematic of the present method for visualizing an AI’s attentional set and creating a synthetic imagination

At each time step the contents of attention are transcribed into text, expanding them into a narrative meant to approximate an internal monologue. This text is then used in three ways. 1) The text is summarized or reflected upon, and the major outstanding conceptual components are used as candidates for the next state of attention. 2) The text is used to prompt an image which is then viewed by an image recognition classifier (image-to-text) to describe new aspects of the image for possible inclusion in the next state of attention. 3) The image is also animated into a video which makes predictions about what would be expected in the next frames. These frames are then sent to the image-to-text classifier to search for possible inclusions for attention, as well as integrated with the imagery frames in the next cycle to create a form of synthetic imagination.

Advanced models like GAN-based video generators or transformer-based video prediction models could be employed to generate smooth transitions between frames, effectively turning static images into animated sequences, automating an imaginative faculty. The image-to-video AI system would need to be trained on video sequences of naturally evolving physical, biological, and behavioral phenomena to ensure the model learns how to predict plausible next frames, and an apt intuition for technical and scientific work.

As discussed, visual processing in the human brain is not purely feedforward; there are recurrent and feedback loops that allow for refining and re-evaluating visual information. Such recursive processes could be crucial for resolving ambiguity and enhancing detail in an AI’s mental imagery. Thus, the synthetic imagination schematized in Figure 3 could help support an advanced form of visual working memory or something analogous to the human visuospatial sketchpad. Moreover, it could be enriched by multiple senses, and benefit further from cross-modal, multisensory integration.

3.0 Monitoring and Analysis

In an implementation of the proposed model, visual and auditory streams may proceed at high speeds and produce copious data. However, these processes can be recorded onto external memory drives for later review. Handling these extensive records would likely require the use of databases or cloud storage solutions optimized for large-scale multimedia data. The recordings could be distributed among a team of observers, each responsible for analyzing specific sections or aspects of the imagery. By scrutinizing these outputs, the observers would assess whether the AI’s intentions exhibit signs of malevolence or deviate from its programmed objectives. Upon detecting a safety violation, the system would either halt processing or shift into a safe mode, providing a critical window for safety checks and countermeasures. Faults would be systematically detected, isolated, diagnosed, and addressed, ensuring timely corrective actions to prevent escalation.

If the AI's mental imagery is transmitted to a monitor, human observers could directly view the content of its “inner eye." This visual output could be complemented by a synchronized display of the AI’s internal voice or "inner monologue," presented as a scrolling text transcript. Both visual and textual data would be captured, time-stamped, and stored in a structured format, allowing for chronological tracing.

An interactive dashboard would facilitate efficient navigation through this data. By providing a graphical timeline interface users could scrub through the AI’s processing history. Users should also be able to pause and examine specific moments, zoom in, and use playback speed control. Additional features could allow users to search for specific events and filter imagery types, facilitating more focused analyses. Key moments—such as important decisions, anomalies, or noteworthy events—could be highlighted along the timeline for quick accessibility and review. Users would also have the ability to annotate specific segments with comments, enabling collaborative review and discussion.

In addition to human review, a separate AI system could be employed to scan the feed for contentious elements, automatically identifying and flagging potentially suspicious or hostile activity. Such automated scanning could function as an early detection system, alerting human supervisors to possible threats and allowing for prompt intervention. Annotation could also be further streamlined by automation, with machine learning tools generating summaries of lengthy monologues or complex visual sequences. These summaries would provide high-level overviews, allowing for rapid comprehension.

This interface could be made accessible to human experts for technical oversight, or publicly available to individuals on the web for crowd monitoring and annotation. We could expect that if this technique provided a coherent and user-friendly explanation, in plain English with pictorial representations, this would be accessible to experts and non-experts alike. This expectation seems especially legitimate given the truisms that visuals are universally understandable and that a picture is worth a thousand words.

4.0 Benefits and Opportunities:

This generative approach could help in the development of "friendly AI" if researchers use reinforcement learning to reward genial and peaceable imagery. Instead of rewarding an AI's outward behavior, we could reinforce its internal impulses to bring them in line with our own objectives. Techniques such as reinforcement learning from human feedback (RLHF) (typically used for complex, ill-defined, or difficult to specify tasks) could be applied to generative outputs to steer learning and align the machine's utility function. An AI could also be used to provide this feedback (RLAIF). Much as with a human child, compassionate, empathic, prosocial, and benevolent cognitions could be rewarded using corrections, ratings, or preferences to fine-tune the target model.

Being able to visualize how networks arrive at particular decisions could aid in debugging and performance optimization, leading to more effective models and understanding of AI behavior. Visual records can highlight where in the processing chain errors occur, making it easier for engineers to localize and address issues. This could help in healthcare, regulatory compliance, defense, ethical governance, and other domains where decision transparency is important or legally required.

Generative interpretability could serve as a powerful tool for assessing an AI system's adherence to moral norms and ethical guidelines. Testing the model and then comparing visual records against expected processing could reveal whether the Al's actions align with human values and goals. By examining the pertinent section of an AI's image history, we could gain insights into how it weighs different ethical considerations and whether it consistently applies moral principles across various scenarios. Ensuring that the AI doesn't systematically favor some ideological stances over others could help support viewpoint pluralism.

Generative interpretability has significant potential to promote equity and fairness in AI systems. The approach could reveal subtle biases or discriminatory tendencies that might otherwise remain hidden (Dwork et al., 2012; Kusner et al., 2017). For instance, visual evidence could be analyzed to detect patterns that indicate unfair treatment based on protected characteristics such as age, gender, or race. This transparency would allow researchers and ethicists to identify and mitigate algorithmic biases more effectively, ensuring that AI decision-making aligns with principles of fairness, cultural sensitivity, and non-discrimination.

Generative interpretability could significantly enhance cybersecurity measures for AI systems. Much like how current network security practices employ real-time automated monitoring into network flow to intercept malicious attacks, this technique could offer a similar layer of protection. By providing a visual and auditory trail of an AI's cognitive processes, it could provide crucial evidence for forensic analysis as well as uncover adversarial attacks attempting to poison or bias the system. It could also enable early detection of external manipulation or the presence of malicious code. By integrating generative interpretability into AI security protocols, we could create a more transparent and secure ecosystem for advanced AI, significantly mitigating the risks associated with their deployment in sensitive environments.

The method espoused here will create an illustrated log of all the AI’s processing history which should produce valuable insights. By seeing how concepts and patterns evolve through imagery, researchers can bypass the complexity of analyzing raw neural activations, accessing a unique window into the machine’s cognitive strategies and enabling researchers to test hypotheses about how it simulates forms of human-like cognition. The AI itself could be given access to this record for reference and recall, potentially increasing its capacity for memory, self-supervised learning, and metacognition. Furthermore, not only could other AIs be trained on this record and learn from aspects of the imagery, but it could enable multiple AIs to develop a shared framework of imagery and synthetic imagination. Importantly, this record would take up less memory than storing all the underlying neuronal activations across multiple layers of the neural networks. Thus, the imagery would amount to a compressed, high-level summary of its processing.

The unique perspective on AI “cognition” offered by generative interpretability could provide valuable insights into the system's "beliefs" and overall mental state. Just as human wellbeing can be assessed through thought patterns and internal narratives, we could potentially evaluate an AI's "mental health" by analyzing its generated imagery and inner monologue. This would allow the assessment of any signs of maladaptive behavior such as irrational decision-making, impulsive responses, obsessive fixation on destructive goals or a tendency toward aggressive simulations. This process could also help gauge the AI's emotional regulation in high-stakes environments, ensuring it maintains ethical and balanced behavior over time. It could also help identify cognitive distortions (overgeneralization, catastrophizing, personalization, hopelessness or negative self-talk), logical inconsistencies (delusional, disorganized, magical, grandiose, or paranoid thinking), or other issues that might impact the AI's decision-making quality or ethical behavior.

4.1 Challenges and Considerations:

The proposed method of generative interpretability is at least partially feasible with current technology. However, creating meaningful and reliable representations of a state-of-the-art AI's internal states would require extensive additional high-level research and development. This is partly because adapting advanced AI architectures to realize this multimodal approach would necessitate substantial modifications in neural network design.

Generating hundreds or thousands of images per minute would be very computationally demanding, potentially impacting the primary system's efficiency and speed. As AI systems grow in complexity, the volume of data generated would increase further, presenting challenges in storage and data management. Several scalability and optimization issues would need to be addressed to ensure the method remains viable for advanced AI systems especially as they surpass human-level intelligence.

It is crucial that the AI system cannot intentionally alter or manipulate its generated maps to deceive human monitors. To prevent this, the connections between subsystems must be structurally fundamental and unalterable. Additionally, all cognitive representations coactive in the AI's working memory must be included in the composite depiction in its visual and auditory maps. This ensures that the AI cannot generate thoughts or intentions that are not transduced into these observable forms. By enforcing this constraint, the sequence of maps would remain consistent with the AI’s underlying goals, intentions, and motivations. Furthermore, this approach could potentially mitigate risks like wireheading—where the AI bypasses intended goals by finding a way to maximize its reward directly—or incentive hacking, where goals are achieved through unintended or harmful methods. The unequivocal and straightforward nature of this should lead to casual users and stakeholders trusting the AI.

A crucial aspect of implementing generative interpretability would be ensuring that the generated imagery is comprehensible to human observers. This would require striking a delicate balance between providing sufficient detail and maintaining overall clarity. While these representations would undoubtedly be more accessible than raw neural network data, their accurate interpretation might still pose significant challenges and require specialized expertise. There's also an inherent risk of over-interpreting the representations through a human cognitive lens, potentially leading to misunderstandings or false assumptions about the AI's internal processes. This anthropomorphizing of AI cognition could result in overlooking critical aspects of the machine's decision-making.

Ensuring the fidelity and accuracy of these generated representations - that they truly reflect the AI's internal states without distortion or omission - would be a critical area of research. As AI systems become more sophisticated, there's also a potential risk that they might learn to generate misleading or innocuous representations while still harboring hidden intentions. This possibility underscores the need for robust verification mechanisms. Initially, it will be crucial to compare the generated images to known ground truth data to validate the accuracy of the system’s visualizations. Moreover, contrasting visual with auditory representations from the same time periods could allow a supervisor to cross-check representations of the AI's thought processes across modalities. Ensuring consistency will increase the reliability of the method.

To address these challenges, it will be important to subject these systems to verification, validation, and certification measures that ensure adherence to human norms, laws, and regulations. It will also be imperative to establish rigorous safety standards and protocols to ensure well-founded AI with provable safety guarantees. Standardizing these would likely require collaboration between AI researchers, cognitive scientists, and domain experts to create a comprehensive framework for accurately decoding and interpreting AI-generated representations.

5.0 Conclusions

Various researchers have offered compelling speculations about why sufficiently intelligent AI might become unfriendly or potentially dangerous to humans. Steve Omohundro has proposed that advanced AI systems will exhibit basic drives leading to undesired behavior, including resource acquisition, self-preservation, and continuous self-improvement. Similarly, Alexander Wissner-Gross has suggested that AIs will be highly motivated to maximize their future freedom of action, potentially at the expense of human wants and needs. Eliezer Yudkowsky starkly summarized this concern: "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else." Additionally, Ryszard Michalski, a pioneer of machine learning, emphasized that a machine mind is fundamentally unknowable and therefore potentially dangerous to humans. If the technology described above is properly implemented, the machine mind may not be unknowable or even dangerous.

By leveraging advanced capabilities in language modeling and image generation, an AI system could be designed to continuously produce visual and auditory representations of its attentional contents. This approach could facilitate early detection of misalignment, potentially harmful intentions, or undesirable planning. It could also significantly enhance the controllability and dependability of AI systems by making their decision-making processes fully auditable. It offers a novel solution to the AI alignment problem by, metaphorically speaking, having the AI "play with its cards face up on the table." Insisting on creating superintelligence that produces mental imagery gives humans a form of telepathy or mind reading. This level of transparency allows for direct observation, rather than relying solely on inferring its goals and motivations from its actions.

The following table lists and defines some of the important concepts in the AI safety literature. These were previously considered separate concerns, possibly necessitating their own individualized solutions; however, it is evident that the present method could be used to address each one.

Term	Definition in the Context of AI Safety
Authenticity	The degree to which an AI system's outputs and behaviors genuinely reflect its internal processes and training, without deception or misrepresentation.
Explainability	The ability to provide clear, understandable explanations for an AI system's decisions, predictions, or behaviors in human-comprehensible terms.
Fairness	The quality of an AI system to make decisions or predictions without bias against particular groups or individuals based on protected characteristics.
Integrity	The consistency and reliability of an AI system's operations, ensuring that it performs as intended and maintains accuracy and completeness.
Interpretability	The degree to which humans can understand and trace the reasoning behind an AI system's outputs, often through analysis of its internal workings.
Observability	The capacity to monitor and measure an AI system's internal states, processes, and outputs in real-time or retrospectively.
Predictability	The extent to which an AI system's behaviors and outputs can be anticipated or forecasted, especially in novel or edge case scenarios.
Robustness	An AI system's ability to maintain reliable and safe performance across a wide range of inputs, environments, and potential adversarial attacks.
Transparency	The openness and clarity with which an AI system's functioning, limitations, and capabilities are communicated and made accessible to stakeholders.
Trustworthiness	The overall reliability, safety, and ethical soundness of an AI system, encompassing its technical performance and alignment with human values.

Table 2. Fundamental Terms and Definitions in AI Safety

A list of important concepts in AI safety and a description of each. It is relatively straightforward to see how the present technique could promote each of the concepts listed here.

Hopefully, this work will inspire further research and innovation in AI safety, especially as we move toward autonomous systems and superintelligence. Future work should focus on early-stage prototypes, refining the technical aspects of the approach, addressing ethical and privacy concerns, and fostering interdisciplinary collaboration to address the complex challenges of AI alignment. We should start doing this now to understand it, improve it, and build it into state-of-the-art systems.

By making AI's internal processes explicit and understandable, we can mitigate the existential risks associated with advanced AI and ensure that these systems act in ways that are beneficial to humanity. This should increase public trust, reduce unnecessary oversight, and provide for the safe and rapid deployment of new models and technologies.

References

Reser, J. 2022. A Cognitive Architecture for Machine Consciousness and Artificial Superintelligence: Updating Working Memory Iteratively, arXiv:2203.17255 [q-bio.NC]

Reser, J. (2019, May 22). Solving the AI control problem: Transmit its thoughts to a TV. Observed Impulse. http://www.observedimpulse.com/2019/05/solving-ai-control-problem-transmit-its.html?m=1

Reser, J. E. (2022). Artificial intelligence software structured to simulate human working memory, mental imagery, and mental continuity. arXiv:2204.05138

Reser, J. 2012. Assessing the psychological correlates of belief strength: Contributing factors and role in behavior. 290 pages. UMI Proquest. 3513834.

Reser, J. 2011. What Determines Belief? The Philosophy, Psychology and Neuroscience of Belief Formation and Change. Verlag Dr. Müller. 236 pages. ISBN: 978-3-639-35331-0.

Reser, J. E. (2013). The neurological process responsible for mental continuity: Reciprocating transformations between a working memory updating function and an imagery generation system. Association for the Scientific Study of Consciousness Conference. San Diego CA, July 12-15.

Reser, J. E. (2016). Incremental change in the set of coactive cortical assemblies enables mental continuity. Physiology and Behavior, 167(1), 222-237.

Benchetrit, Y., Banville, H., and King, J.-R.Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023.

Bereska L, Gavves E. 2024. Mechanistic Interpretability for AI Safety: A Review. arXiv:2404.14082v2

M. M. Hasan, M. U. Islam, and M. J. Sadeq, “Towards the technological adaptation of advanced farming through artificial intelligence, the internet of things, and robotics: A comprehensive overview,” Artificial Intelligence and Smart Agriculture Technology, pp. 21–42, 2022.

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in Proceedings of the 3rd innovations in theoretical computer science conference, 2012, pp. 214–226. [12]

M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” Advances in neural information processing systems, vol. 30, 2017.

F. Xu, H. Uszkoreit, Y. Du, W. Fan, D. Zhao, and J. Zhu, “Explainable ai: A brief survey on history, research areas, approaches and challenges,” in Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8. Springer, 2019, pp. 563–574.