
Monitoring generative AI’s decision-making is critical for safety, but the inner workings that lead to text or image outputs remain largely opaque. A position paper released on July 15 proposes chain-of-thought (CoT) monitorability as a way to watch over the models.
The paper was co-authored by researchers from Anthropic, OpenAI, Google DeepMind, the Center for AI Safety, and other institutions. It was endorsed by high-profile AI experts, including former OpenAI chief scientist and Safe Superintelligence co-founder Ilya Sutskever, Anthropic researcher Samuel R. Bowman, Thinking Machines chief scientist John Schulman, and deep learning luminary Geoffrey Hinton.
What is chain-of-thought?
Chain-of-thought refers to the intermediate reasoning steps a generative AI model verbalizes as it works toward an output. Some deep research models produce reports to their users of what they are doing at the time. Assessing what the models are doing before they produce human-readable data is known as interpretability, a field Anthropic has heavily researched.
However, as AI becomes more advanced, monitoring the “black box” of decision-making becomes increasingly difficult. Whether chain-of-thought interpretability will work in a few years is anyone’s guess, but for now, the researchers are pursuing it with some urgency.
“Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability,” the researchers wrote.
Chain-of-thought oversight could check ‘misbehavior’ of advanced AI models
“AI systems that ‘think’ in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave,” the researchers state. Misbehavior might include gaming reward functions, manipulating users, or executing prompt injection attacks. Chains of thought sometimes reveal when the AI is pursuing one goal while obscuring their pursuit of another.
The CoT is itself AI-generated content and can contain hallucinations; therefore, the researchers are still studying its reliability. Specifically, the researchers note, “It is unclear what proportion of the CoT monitorability demonstrated in these examples is due to the necessity versus the propensity for a model to reason out loud in the tasks considered.”
More researchers should study what makes AI monitorable and how to evaluate monitorability, the authors said; this may turn into a race between LLMs that do the monitoring and LLMs that are monitored. In addition, advanced LLMs could react differently if they are informed they are being monitored. Monitorability is an important part of model safety, the authors said, and developers of frontier models should develop standards metrics for assessing it.
“CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions,” the researchers wrote. “Yet, there is no guarantee that the current degree of visibility will persist. We encourage the research community and frontier AI developers to make best use of CoT monitorability and study how it can be preserved.”
Could underwater data centers solve persistent cooling issues for the AI industry?