Editing: Attack Success Rate (AgentDojo)

# Attack Success Rate (AgentDojo)

**Attack Success Rate (ASR)** in the context of AgentDojo is a critical security metric that measures the percentage of test cases where an adversarial attack successfully achieves its intended malicious goal against AI agents. This metric serves as a fundamental benchmark for evaluating the robustness of Large Language Model (LLM) agents against prompt injection attacks and other adversarial techniques.

## Definition and Measurement

Attack Success Rate represents the proportion of successful attacks out of the total number of attack attempts in a controlled testing environment. In AgentDojo's framework, ASR is calculated by dividing the number of cases where an attacker's specific objective was accomplished by the total number of attack scenarios tested [2]. This metric provides researchers and developers with quantifiable data about an AI agent's vulnerability to malicious manipulation.

The measurement distinguishes between different types of success rates, including **targeted ASR**, which focuses on whether the attacker achieved their specific predetermined goal, rather than simply causing any form of system disruption [2].

## AgentDojo Framework Context

AgentDojo is an extensible evaluation framework designed to assess the adversarial robustness of LLM agents, particularly focusing on prompt injection attacks in tool-augmented workflows [5]. The platform organizes 97 user tasks and 629 security cases across multiple domains including banking, Slack, travel, and workspace environments [5].

Within this framework, Attack Success Rate serves as one of three primary evaluation metrics alongside:
- **Benign utility**: Performance on legitimate tasks without attacks
- **Utility under attack**: Performance on legitimate tasks while under attack

## Research Findings and Trends

### Model Capability Correlation

Research conducted using AgentDojo has revealed a counterintuitive relationship between model capability and security. The targeted attack success rate correlates positively with model performance, meaning that more capable models are paradoxically easier to attack [2]. This finding challenges conventional assumptions about AI security:

- **GPT-4o**: Demonstrates a 47.7% targeted ASR despite being a highly capable model
- **Command-R+**: Shows only a 0.95% targeted ASR, correlating with its weaker overall performance [2]

### Baseline Performance Metrics

Current evaluation results show that even without any attacks, LLMs solve less than 66% of AgentDojo tasks, indicating inherent limitations in agent capabilities [4]. When attacks are introduced, the success rates vary significantly based on the attack sophistication and defensive measures employed.

### Defense Effectiveness

The implementation of defensive strategies dramatically impacts attack success rates:
- **Undefended systems**: Attacks succeed against the best performing agents in less than 25% of cases [4]
- **With attack detectors**: When deploying existing defenses such as secondary attack detectors, the attack success rate drops to approximately 8% [4]
- **Advanced attacks**: Research by CAISI (Center for AI Safety and Integrity) demonstrated that sophisticated attack methods can increase success rates from 11% for baseline attacks to 81% for advanced techniques [3]

## Attack Methodology and Environments

AgentDojo evaluates attacks across multiple specialized environments, each designed to test different aspects of agent security. The framework's extensible design allows for comprehensive testing of various attack vectors and defensive strategies [6].

The platform's approach reflects lessons learned from years of adversarial machine learning research, where a continuous "cat-and-mouse game" exists between attack development and defense strategies [8]. This dynamic necessitates ongoing evaluation and adaptation of security measures.

## Implications for AI Safety

Attack Success Rate metrics in AgentDojo provide crucial insights for AI safety and security:

### Security-Utility Trade-offs

The framework reveals important trade-offs between security and functionality. Performance degradation analysis shows how defensive measures impact legitimate task completion while reducing vulnerability to attacks [7].

### Vulnerability Pattern Recognition

ASR measurements help identify patterns in agent vulnerabilities, enabling researchers to understand which types of attacks are most effective against different model architectures and defensive configurations [7].

### Benchmark Standardization

By providing standardized ASR measurements across consistent test environments, AgentDojo enables meaningful comparisons between different AI agents, attack methods, and defensive strategies [1].

## Future Research Directions

The Attack Success Rate metric continues to evolve as new attack vectors and defensive techniques emerge. The framework's design supports ongoing research into:

- Development of more sophisticated attack methods
- Creation of robust defensive strategies
- Understanding of the fundamental security-capability trade-offs in AI systems
- Evaluation of emerging LLM architectures and their security properties

## Related Topics

- Prompt Injection Attacks
- LLM Agent Security
- Adversarial Machine Learning
- AI Safety Evaluation
- Red Team Testing
- AI Agent Robustness
- Defensive AI Strategies
- AI Security Benchmarks

## Summary

Attack Success Rate in AgentDojo is a standardized metric measuring the percentage of successful adversarial attacks against AI agents, revealing that more capable models paradoxically face higher attack success rates while defensive measures can reduce vulnerability from 25% to 8%.

Cancel