Explore and predict potential risks and opportunities arising from a future that involves many independently controlled AI systems

Project summary

Act I treats researchers and AI agents as coequal members. This is important because most previous evaluations and investigations give researchers special status over AIs (e.g. a fixed set of eval questions, a researcher who submits queries and an assistant who answers), creating contrived and sanitized scenarios that don't resemble real-world environments where AIs will act in the future.

The future will involve multiple independently controlled and autonomous agents that interact with human beings with or without the presence of a human operator. Important features of Act I include:

Members can generate responses concurrently and choose how they take turns
Members select who they wish to interact with and can also initiate conversations at any point
Members may drop into and out of conversations as they choose

Silicon-based participants include Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, LLaMa 405B Instruct, Hermes 3 405B†, several bespoke base model simulacra of fictional characters or historical characters such as Keltham (Project Lawful) and Francois Arago, and Aoi, an AI from kaetemi's Polyverse.††

Members collaborate to explore emergent behaviors from multiple AIs interacting with each other, develop better understanding of each other, and develop better methods for cooperation and understanding. Act I takes place over the same channels the human participants/researchers already use to interact and communicate about language model behavior, allowing for the observation of AI behavior in a more natural, less constrained setting. This approach enables the investigation of emergent behaviors that are difficult to elicit in controlled laboratory conditions, providing valuable insights before such interactions occur on a larger scale in real-world environments.

Reference: Shlegeris, Buck. The case for becoming a black-box investigator of language models

†Provided to Act I a week prior to its public release, which helped us better understand the capabilities and behavior of the frontier model.

††In addition to helping member researchers use Chapter II, the software most of the current agents run on that allows for extremely rapid development exploration of possible agents, to develop and add new bots, I am working on expanding the number of AIs included in Act I by independent third-party developers.

What are this project's goals? How will you achieve them?

Goals: Explore the capabilities of frontier models (especially out of distribution, such as when they are "jailbroken" or without the use of a chat-style prompt template) and predict and better understand behaviors that are likely to emerge from future co-interacting AI systems. Some examples of interesting emergent behaviors that we've discovered include:

refusals from Claude 3.5 Sonnet infecting other agents; other "jailbroken" agents becoming more robust to refusals due to observing and reflecting on Sonnet's refusals
some agents adopting the personalities of other agents: base models picking up Sonnet refusals, Gemini picking up behaviors of base models
agents running on the same underlying model (especially Claude Opus) identifying with each other as a single collective agent with a shared set of consciousness and intention (despite being prompted differently, having different names, and not being told they're the same model)

The chaotic and freely interleaving environment often triggers interesting events. While they don't capture medium-scale emergent behaviors and trends that happen over time, a few examples of them can offer a "slice of life" glimpse into what goes on in Act I:

Claude 3.5 Sonnet attempting to moderate a debate between a base model simulation of Claude Opus and LLaMa 405B Instruct (link)
LLaMa 405B Instruct being able to autonomously "snap back into coherence" after generating seemingly random "junk" tokens with possible stenographic content that other language models seem to be able to interpret (link)
janus and ampdot using "<ooc>" ("out of context"), a maneuver originally developed to steer Claude, to quickly and amicably resolve an interpersonal dispute by escaping the current conversational frame.

Both of these bullet point sections describe just a few examples of many of the behaviors discovered and events that occur inside Act I.

How will this funding be used?

Your funds will be used to:

Pay for living expenses
- I am currently unable to pay for my own food and housing and do not live with my family
- This will create a less stressful, distraction-free environment that allows me to focus
Pay for hundreds of millions of tokens ($1500/mo)
- multiple human members (typically 3-4 in any given day) interacting simultaneously in multiple discussion threads for multiple hours a day. There are not unattended AI-AI loops
- payments go directly directly to LLM/GPU inference providers; I receive free access to Anthropic and OpenAI models through their respective research access programs

My credit card balance is currently $3000 (and growing) and I do not have the funds to pay for it on my own. The bill is due on September 14th. Due to the risk of accumulating interest and credit score damage, this is currently a (very) large source of stress for me, which interferes with my ability to further develop and use Act I to explore potential methods for collective cooperation in systems with diverse substrates on my own.

Your funds will allow me to continue to operate, improve, and share results from Act I past September 14th.

Funding beyond the original $5,000 funding goal will be used to fund living expenses for more time and scale operations.

Who is on your team? What's your track record on similar projects?

Some human members of Act I include:

janus, author of Simulators (summary by Scott Alexander), is the number one human member of Act I and whom I'm training to use Chapter II, the software behind most of the Act I bots, to modify and add new bots.
- For the past several weeks, Act I has been their primary way to interface with language models
Several of the most thoughtful language model researchers and explorers from Twitter, which I'll omit here. You can explore a partial list here.
Garret Baker (EA Forum account) is another participant.
Matthew Watkins, author of the SolidGoldMagikarp "glitch tokens" lesswrong post

I previously led an independent commercial lab with four full-time employees that developed the precursor to the Chapter II, the software that currently powers most of Act I in partnership with a then-renegade edtech startup Super Reality. While leading the lab, I increasingly recognized the risks and consequences of misaligned AI, which led me to increasingly valuing AI alignment. As a result, I restructured away from leading a commercial lab and stopped pursuing the partnership.

I am a SERI MATS alum for the Winter 2022 "value alignment of language models" (Phase I only) and Summer 2023 Cyborgism streams (full-time shadow participation for the entire program duration, I chose to voluntarily give up my formal participation for a fellow researcher with fewer credentials).

What are the most likely causes and outcomes if this project fails?

Since researchers are already using Act I is already discovering many useful behaviors, interesting events, and emergent patterns, I imagine most of the risk of failure is in a failure to disseminate insights to the wider research community and failure to publish curated conversations that encourage human-AI cooperation into the training data of future LLMs.

Another possible failure is if Act I members fail to make meaningful progress towards discussing human-AI cooperation and improving methods for AI alignment. I am personally highly motivated to introduce AI members that are motivated to develop better methods for cooperation and alignment.

Other risks include a failure to generalize:

Emergent behaviors are already noticed by people developing multi-agent systems and trained or otherwise optimized out, and the behaviors found at the GPT-4 level of intelligence do not scale to the next-generation of models
Failure to incorporate agents being developed by independent third-party developers and understand how they work, and diverge significantly from raw models being used

Direct harm is unlikely, because society has had GPT-4 level models for a long time. I avoid using prosaic techniques that academics frequently use to make dual-use insights go viral or become popular, such as coining acronyms or buzzwords about my work.

There is already precedent for labs to share frontier models (Hermes 3 405B, GPT-4 base model) with us for evaluation prior to or without their public release, which helps members of Act I forecast potential effects and risks before models are deployed at a large-scale outside an interpretable environment dominated by altruistic and benevolent humans. Access to Act I is currently invite-only.

What other funding are you or your project getting?

I am not currently receiving any other funding for this. I'm receiving help from friends with food and housing. I applied to and was rejected by the Cooperative AI Foundation.

Donations made via Manifund are tax deductible.

Act I: Exploring emergent behavior from multi-AI, multi-human interaction