A Paper on Collaborative Design and Formal Verification of Factored Cognition Schemes (CoT, Tree of Thought, selection inference, prompt chaining etc)

Preprint Title: Towards Formally Describing Program Traces of Language Model Calls with Causal Influence Diagrams: A Sketch

Google Doc (Last Updated 15 June 2023).

Extended Abstract
This report presents an analysis of eleven agent designs within the factored cognition framework, utilizing casual influence diagrams (CIDs) to formally illustrate agent interactions and their influence on user queries. The study aids in system documentation and future planning, such as the addition of a reranking step or verifiers to eliminate unwanted outputs. Another benefit is due to the visual notation, which expands the number of people who can understand a system definition without looking at the code, providing a unique governance tool for people to use while challenging a system's design. A research program on interface design is proposed, with plans to use mechanistic interpretability tooling to enhance the CIDs by introducing a weak notion of conditional probability distributions. This could potentially model and represent the introduction of deceptive answers in a dataset of agent responses. The report acknowledges the bias introduced when composing small, independent contributions from agents knowledgeable about the queried data. This bias is counterbalanced by the bias from the models involved and their training. The report aims to clarify this complex issue by formally describing the interaction of one to seven agent calls with context data and discussing potential risk analysis to uncover problems with the diagrams, showing the practice of changing our agent designs to fit our preferences. For further research, we propose applying the generated dataset to a smaller local model with a causal tracing framework. This would enable us to research visualizing the factuality of agent responses with an animated CID. Detailed analysis and specifics of the tracing project are reserved for future work. This GitHub issue lists specific tasks for this: https://github.com/poppingtonic/transformer-visualization/issues/12

Funding Targets: $30000-$60000

Review Materials
While the first set of outputs are already available for review as a preprint (see Google Doc), the results from the causal tracing and mech-interp project will be available at: https://github.com/poppingtonic/transformer-visualization/

Update 1 (August 26, 2023): Answer Set Programming for Automated Verification of Intent Consistency
https://github.com/poppingtonic/transformer-visualization/tree/main/formal-constraints

Update 2 (September 10, 2023): I gave a talk to the "Papers We Love Nairobi" community on September 5, exploring the oversight benefits of the diagrams and "intent consistency" ASP algorithms.
Link: https://youtu.be/qwQ-H61g2Ec

Why Morally Good?
Fahamu Inc, which I co-founded and is the current primary supporter, is a for-profit company, so it is likely for someone to see that and discount the value of the project based on their personal morals. To them I would say that this paper was written 6 months into the construction of a commercial system, as a way to document it and systems like it. Due to the generality and ease of communication it enables (see the YouTube presentation below), we believe it is the most flexible of similar approaches e.g. Language Model Cascades. We see the value of this kind of work beyond the service we built, hence our publishing to the Alignment Jam, this application, and focus on governance benefits. Similar to Cascades, it could in principle be used to formally describe, point out, and fix safety flaws in systems that are already in the wild before writing any code e.g. anything built in LangChain or similar "LLM for AGI" systems, with a simpler notation.

If we imagine future systems beyond LLMs (that may not be autoregressive or similar), e.g. by replacing a call to GPT4 with a "do-anything function" built in a completely different way, we should also assume that people will still want to implement systems that get more out of a model by coordinating multiple copies of that model to perform tasks. If people pursue superintelligence in this way, I believe that this formalism is a way to debate, "break" the design, and fix it, before the system is implemented.

Risk of Harm
The primary risk is what we predict as a consequence of AI Alignment: people with more aligned systems can more precisely do whatever it is they intend, good or bad. This work is more likely to educate people, especially non-technical stakeholders, on how systems perform tasks, without requiring them to implement anything, which means discussions of flaws have extensional examples normal users can point to like "Hey! That specific model call doesn't know what the user asked for originally, so isn't it more likely to go off the rails? Here, let's fix that in the next diagram"

Competitions
Paper originally Submitted to Alignment Jam #8 https://alignmentjam.com/jam/verification, see the presentation here: CIDs for LLMs (Presented at Alignment Jam #8)

Participants: Brian Muhia (Primary author, CTO of Fahamu Inc), Adrian Kibet (CEO of Fahamu Inc [email protected])

Fahamu means "understand" in Swahili. We started this company by launching a dataset for interpretability (fahamu/ioi ) because we need to understand the representations we're building on.

Brian Muhia
almost 2 years ago
Update 1 (August 26, 2023): Answer Set Programming for Automated Verification of Intent Consistency
https://github.com/poppingtonic/transformer-visualization/tree/main/formal-constraints

Update 2 (September 10, 2023): I gave a talk to the "Papers We Love Nairobi" community on September 5, exploring the oversight benefits of the diagrams and "intent consistency" ASP algorithms.
Link: https://youtu.be/qwQ-H61g2Ec