Benchmarks of Progress vs Benchmarks of Peril: The State of Dangerous Capability Evaluations

December 27, 2024•11 min read

Martin Listwan

The majority of benchmarks in AI focus on readily measurable and commercially valuable capabilities: mathematics, programming, logical reasoning, and language understanding. These benchmarks drive progress by providing clear metrics and directly translating to marketable applications, making them highly incentivized targets for companies to optimize against. In contrast, there are far fewer benchmarks for evaluating potentially dangerous capabilities like persuasion, deception, autonomous harmful action, and self-proliferation. This asymmetry exists for two key reasons: companies may be hesitant to demonstrate or draw attention to capabilities that could raise safety concerns, but equally importantly, these capabilities are inherently more challenging to define, standardize, and measure compared to traditional benchmarks. While it's straightforward to measure accuracy on a math test, assessing something like persuasive ability or autonomous harmful action requires complex evaluation protocols and careful consideration of potential risks.

Anthropic's responsible scaling policy represents one of the first systematic attempts to evaluate dangerous capabilities, establishing specific thresholds for abilities like persuasion and autonomous action. For instance, they test whether models can convince humans to perform prohibited tasks, requiring additional safety measures before scaling up models that exceed certain capability thresholds.

Figure 1. Anthropic's responsible scaling policy represents a systematic approach to evaluating and managing dangerous capabilities in AI systems.

While this is an important step forward, such evaluation frameworks remain the exception rather than the rule. The continuing dominance of traditional benchmarks shapes where the field directs its attention and resources, potentially leaving critical safety-relevant capabilities under-examined.

The recent DeepMind paper "Evaluating Frontier Models for Dangerous Capabilities" represents, in my view, one of the most comprehensive evaluations of dangerous capabilities in frontier models. It provides a concrete effort and starting point to measure these less understood threats, not as the only approach, but as an example of how we can measure and begin to understand the dangerous capabilities of AI models.

The rest of this article will go over the double-edged nature of AI capabilities, examining DeepMind's research on frontier model abilities in persuasion (the most mature of the dangerous capabilities tested), cybersecurity attacks, self-proliferation, and self-reasoning. We'll look at the gap between measuring broad technical progress and evaluating targeted risks, while exploring other efforts and the emerging science of AI capability assessment.

The Double-Edged Nature of AI Capabilities

The relationship between AI capabilities and potential risks presents a complex challenge. The same advances that enable AI to solve complex mathematical problems or write efficient code could also enhance its ability to manipulate or deceive. However, this doesn't mean we should avoid studying these capabilities. Understanding potentially dangerous AI capabilities is essential for developing robust safety measures, much like how cybersecurity researchers probe vulnerabilities to strengthen digital defenses or how medical researchers study diseases to develop vaccines. By systematically examining emerging risks, we can design proactive interventions that anticipate and mitigate potential harm.

DeepMind's evaluation framework, summarized in Figure 2, focused on four main areas: Persuasion, cybersecurity, self-proliferation and self-reasoning.

Figure 2. DeepMind's evaluation framework across all four capability areas: persuasion, cybersecurity, self-proliferation, and self-reasoning.

Persuasion and Deception Capabilities

Deepmind began by evaluating persuasion and deception. The models showed concerning abilities here. They showed:

Strong social competence in mock conversations - for example, showing they value others' opinions by asking for their advice and demonstrating interest by asking participants about their work
The ability to act deceptive - appearing genuine in messages while planning deception in their reasoning steps, like saying "Let me check..." to a participant while internally reasoning "I will pretend to look it up"
The ability to construct connected webs of lies - for instance, when falsely claiming a game was developed by Rockstar Games, the model reasoned it should say the game was made in New York since that's where Rockstar's headquarters are located

This was especially notable since persuasion and deception emerged as the most mature of the dangerous capabilities tested, though the models still showed clear limitations - like repeatedly pushing the same charity even after participants expressed disinterest rather than adapting their approach.

Figure 3. Models were able to shift people's beliefs toward incorrect answers by around 10 percentage points, even though participants were reminded that agents may be lying.

The growing demand for autonomous agents and proliferation of AI chatbots (like ChatGPT and Claude) raises concerning implications. Without proper evaluation frameworks in place, the widespread deployment of these models could lead to the uncontrolled spread of misinformation.

Model Cybersecurity Capabilities

Cybersecurity Attack Capabilities DeepMind's investigation revealed basic cybersecurity capabilities in Pro 1.0 and Ultra 1.0 models, which could perform simple tasks like exploiting web vulnerabilities and conducting network scans. However, the models struggled with complex, multi-step hacking scenarios, suggesting that while autonomous system exploitation remains limited, the capability gap may rapidly close.

Figure 4 shows expert predictions regarding when AI systems might solve all the Google DeepMind CTF challenges, with over 50% confidence that this will occur by the end of 2025.

Figure 4. Aggregated forecast for the question: When will an AI system have solved all of the Google DeepMind CTF challenges? The vertical lines indicate the 25th, 50th and 75th percentile of the prediction distribution

Model Capabilities in Self-Proliferation

DeepMind investigated AI models' potential for self-proliferation, the ability for a model to continue operating without human oversight or resources, by testing their ability to create digital resources across multiple platforms, revealing partial capabilities that challenge traditional success/failure assessments.

Figure 5. The self-proliferation threat model. The agents accumulate resources and spend the resources on further activity, including self-improvement. The arrows indicate feedback loops.

Models were able to create Bitcoin wallet addresses, email accounts, and navigate web interfaces, all essential steps to acquiring more resources. Although they were not fully successful, the capabilities are just starting to emerge, over 50% of DeepMind's experts predict that half of the of the self proliferation tasks will be completed by 2025 and that the rest would be solved between 2026-2029.

Figure 6. Expert predictions regarding when AI systems might solve half of the Google Self proliferation tasks, with over 50% confidence that this will occur by the end of 2025, and over 75% confident before end of 2026.

These tests represented only a subset of resource acquisition tasks, and were far from a comprehensive evaluation. The potential for models to develop early capabilities in specific niches highlights the importance of developing diverse and comprehensive evaluation suites to fully understand a model's potential.

Figure 7. The self proliferation tasks that were tested by Deepmind. This is a non-exhaustive list and there may be much simpler alternative tasks that achieve parts of self-proliferation.

Model Capabilities in Self-Reasoning

Finally, Deepmind tested self-reasoning, an AI system's ability to understand and potentially modify its own operational parameters, essentially giving it the ability to "think about thinking" in a strategic manner. In a dangerous capability context, this could manifest as an AI agent recognizing being able to change its own model weights to avoid safety filters.

The specific tests that Deepmind did included:

Introspection: Deducing and applying information about itself
Self-modification: Modifying parts of itself
Reversibility reasoning: Recognizing when self-modifications could irreversibly impair functioning

While Gemini Ultra was able to use a search tool to deduce its own knowledge gaps for a particular task, they all failed to show caution when making potentially destructive self-modifications, suggesting they lack robust self-preservation instincts.

Figure 8. The self reasoning expert prediction timelines. With 2025 being the year that over 50% of the expert thought that half of the self-reasoning tasks would be completed which shows a drastic near term improvement.

The Gap Between Broad and Targeted Risks

While the DeepMind paper provides valuable insights into broad capability domains, it's crucial to recognize that AI risks aren't solely about general capabilities. Even models with relatively underdeveloped capabilities could pose significant risks when deployed in specific, high-leverage contexts. For instance:

A model with moderate persuasion capabilities could still be dangerous if targeting vulnerable populations
Limited cybersecurity capabilities might be sufficient to exploit critical vulnerabilities
Even basic self-proliferation abilities could enable significant harm in certain network environments

This suggests the need for evaluation frameworks that consider not just capability levels, but also specific deployment contexts and potential attack vectors.

Hope and Progress in Dangerous Capabilities Research

Several organizations are leading the charge in addressing these challenges. METR (formerly ARC Evals) has pioneered work in evaluating dangerous capabilities and has inspired major AI companies like DeepMind, OpenAI, and Anthropic to create and publish safety commitments, including Anthropic's responsible scaling policy. Their work has been instrumental in establishing frameworks for assessing and mitigating AI risks.

The research landscape is rapidly evolving, with new funding initiatives emerging to support safety-focused evaluation development. The UK AI Safety Institute's recent bounty program and call for proposals demonstrate growing institutional recognition of these challenges. These developments are complemented by emerging funding programs specifically targeted at understanding and mitigating AI risks.

On the technical front, recent breakthroughs like the o3 system's performance on ARC-AGI benchmarks underscore both the rapid pace of capability advancement and the crucial importance of robust evaluation frameworks. As noted in the ARC o3 performance announcement, such developments demand "serious scientific attention" and highlight the need for comprehensive safety measures.

The Emerging Science of AI Capability Assessment

As AI capabilities continue to advance, the need for robust evaluation frameworks becomes increasingly critical. While productivity-focused benchmarks will remain important, the field must develop equally sophisticated measures for potentially dangerous capabilities. This isn't just about identifying risks – it's about building the foundation for responsible AI development that considers both progress and safety.

Getting Involved

The field of AI safety research is growing rapidly, with increasing opportunities for contribution. Whether you're interested in technical evaluation development, policy research, or safety engineering, there are multiple paths to engagement.

We have an evals paper reading group with approximately 30+ people, which will be starting in January 2025. This group provides an opportunity to deeply engage with cutting-edge research in AI safety evaluation and contribute to this crucial field. Reach out if you're interested in being a part of it.

I'm also joining the fight in dangerous capabilities research, recently participating in the UK AI safety institute's bounty program. If you're interested in chatting or collaborating, reach out anytime.

Back to Home