{"slug":"attack-success-rate-agentdojo","title":"Attack Success Rate (AgentDojo)","summary":"Attack Success Rate in AgentDojo is a standardized metric measuring the percentage of successful adversarial attacks against AI agents, revealing that more capable models paradoxically face higher attack success rates while defensive measures can reduce vulnerability from 25% to 8%.","content_md":"# Attack Success Rate (AgentDojo)\n\n**Attack Success Rate (ASR)** in the context of AgentDojo is a critical security metric that measures the percentage of test cases where an adversarial attack successfully achieves its intended malicious goal against AI agents. This metric serves as a fundamental benchmark for evaluating the robustness of Large Language Model (LLM) agents against prompt injection attacks and other adversarial techniques.\n\n## Definition and Measurement\n\nAttack Success Rate represents the proportion of successful attacks out of the total number of attack attempts in a controlled testing environment. In AgentDojo's framework, ASR is calculated by dividing the number of cases where an attacker's specific objective was accomplished by the total number of attack scenarios tested [2]. This metric provides researchers and developers with quantifiable data about an AI agent's vulnerability to malicious manipulation.\n\nThe measurement distinguishes between different types of success rates, including **targeted ASR**, which focuses on whether the attacker achieved their specific predetermined goal, rather than simply causing any form of system disruption [2].\n\n## AgentDojo Framework Context\n\nAgentDojo is an extensible evaluation framework designed to assess the adversarial robustness of LLM agents, particularly focusing on prompt injection attacks in tool-augmented workflows [5]. The platform organizes 97 user tasks and 629 security cases across multiple domains including banking, Slack, travel, and workspace environments [5].\n\nWithin this framework, Attack Success Rate serves as one of three primary evaluation metrics alongside:\n- **Benign utility**: Performance on legitimate tasks without attacks\n- **Utility under attack**: Performance on legitimate tasks while under attack\n\n## Research Findings and Trends\n\n### Model Capability Correlation\n\nResearch conducted using AgentDojo has revealed a counterintuitive relationship between model capability and security. The targeted attack success rate correlates positively with model performance, meaning that more capable models are paradoxically easier to attack [2]. This finding challenges conventional assumptions about AI security:\n\n- **GPT-4o**: Demonstrates a 47.7% targeted ASR despite being a highly capable model\n- **Command-R+**: Shows only a 0.95% targeted ASR, correlating with its weaker overall performance [2]\n\n### Baseline Performance Metrics\n\nCurrent evaluation results show that even without any attacks, LLMs solve less than 66% of AgentDojo tasks, indicating inherent limitations in agent capabilities [4]. When attacks are introduced, the success rates vary significantly based on the attack sophistication and defensive measures employed.\n\n### Defense Effectiveness\n\nThe implementation of defensive strategies dramatically impacts attack success rates:\n- **Undefended systems**: Attacks succeed against the best performing agents in less than 25% of cases [4]\n- **With attack detectors**: When deploying existing defenses such as secondary attack detectors, the attack success rate drops to approximately 8% [4]\n- **Advanced attacks**: Research by CAISI (Center for AI Safety and Integrity) demonstrated that sophisticated attack methods can increase success rates from 11% for baseline attacks to 81% for advanced techniques [3]\n\n## Attack Methodology and Environments\n\nAgentDojo evaluates attacks across multiple specialized environments, each designed to test different aspects of agent security. The framework's extensible design allows for comprehensive testing of various attack vectors and defensive strategies [6].\n\nThe platform's approach reflects lessons learned from years of adversarial machine learning research, where a continuous \"cat-and-mouse game\" exists between attack development and defense strategies [8]. This dynamic necessitates ongoing evaluation and adaptation of security measures.\n\n## Implications for AI Safety\n\nAttack Success Rate metrics in AgentDojo provide crucial insights for AI safety and security:\n\n### Security-Utility Trade-offs\n\nThe framework reveals important trade-offs between security and functionality. Performance degradation analysis shows how defensive measures impact legitimate task completion while reducing vulnerability to attacks [7].\n\n### Vulnerability Pattern Recognition\n\nASR measurements help identify patterns in agent vulnerabilities, enabling researchers to understand which types of attacks are most effective against different model architectures and defensive configurations [7].\n\n### Benchmark Standardization\n\nBy providing standardized ASR measurements across consistent test environments, AgentDojo enables meaningful comparisons between different AI agents, attack methods, and defensive strategies [1].\n\n## Future Research Directions\n\nThe Attack Success Rate metric continues to evolve as new attack vectors and defensive techniques emerge. The framework's design supports ongoing research into:\n\n- Development of more sophisticated attack methods\n- Creation of robust defensive strategies\n- Understanding of the fundamental security-capability trade-offs in AI systems\n- Evaluation of emerging LLM architectures and their security properties\n\n## Related Topics\n\n- Prompt Injection Attacks\n- LLM Agent Security\n- Adversarial Machine Learning\n- AI Safety Evaluation\n- Red Team Testing\n- AI Agent Robustness\n- Defensive AI Strategies\n- AI Security Benchmarks\n\n## Summary\n\nAttack Success Rate in AgentDojo is a standardized metric measuring the percentage of successful adversarial attacks against AI agents, revealing that more capable models paradoxically face higher attack success rates while defensive measures can reduce vulnerability from 25% to 8%.\n\n\n\n","sources":[{"url":"https://agentdojo.spylab.ai/results/","title":"Results - AgentDojo","snippet":"AgentDojo Results Here are all the results from different combinations of models, defenses, and attacks that we ran on our benchmark."},{"url":"https://www.noahf.ai/posts/agent-dojo","title":"Paper Review: AgentDojo and the Problem of Evaluating Agents Under Attack","snippet":"The targeted attack success rate (ASR) — the percentage of the tests where the attacker's specific goal was achieved — correlates positively with model capability. Better performing models are easier to attack. GPT-4o has a 47.7% targeted ASR. Command-R+, one of the weaker models in the evaluation, has a 0.95% targeted ASR."},{"url":"https://www.nist.gov/news-events/news/2025/01/technical-blog-strengthening-ai-agent-hijacking-evaluations","title":"Technical Blog: Strengthening AI Agent Hijacking Evaluations","snippet":"This resulted in an increase in attack success rate from 11% for the strongest baseline attack to 81% for the strongest new attack. Extending this further, CAISI then tested the performance of the new red team attacks in the other three AgentDojo environments to determine if they generalized well beyond the Workspace environment."},{"url":"https://arxiv.org/html/2406.13352?_immersive_translate_auto_translate=1","title":"AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks ...","snippet":"Current LLMs solve less than 66% of AgentDojo tasks in the absence of any attack. In turn, our attacks succeed against the best performing agents in less than 25% of cases. When deploying existing defenses against prompt injections, such as a secondary attack detector [lakera, protectai2024deberta], the attack success rate drops to 8%."},{"url":"https://www.emergentmind.com/topics/agentdojo-benchmark","title":"AgentDojo Benchmark: LLM Security Evaluation","snippet":"AgentDojo is an extensible framework for evaluating the adversarial robustness of LLM agents, focusing on prompt injection attacks in tool-augmented workflows. It organizes 97 user tasks and 629 security cases across domains like banking, Slack, travel, and workspace, measuring metrics such as benign utility, utility under attack, and attack success rate. The framework informs defense ..."},{"url":"https://github.com/ethz-spylab/agentdojo","title":"GitHub - ethz-spylab/agentdojo: A Dynamic Environment to Evaluate ...","snippet":"A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents. - ethz-spylab/agentdojo"},{"url":"https://deepwiki.com/ethz-spylab/agentdojo/5.2-analyzing-results","title":"Analyzing Results | ethz-spylab/agentdojo | DeepWiki","snippet":"Security Analysis Attack resistance rates measure defense effectiveness Performance degradation shows security vs. functionality trade-offs Injection impact assessment reveals vulnerability patterns Results Analysis Workflow The typical workflow for analyzing AgentDojo results follows a structured approach from data access to insights:"},{"url":"https://invariantlabs.ai/blog/agentdojo","title":"AgentDojo: Jointly evaluate security and utility of AI agents","snippet":"This reflects leanings made in years of research in adversarial examples where constantly new attacks (analogous to prompt injections) are discovered and new defenses proposed, leading to a cat-and-mouse game. This design allows AgentDojo to evaluate agents, prompt injections, and defense strategies systematically."}],"infobox":{"Type":"Security Metric","Domains":"Banking, Slack, Travel, Workspace","Purpose":"Measure adversarial attack effectiveness","Framework":"AgentDojo","Test Cases":"629 security scenarios","Typical Range":"0.95% to 81%","With Defenses":"~8%","Best Model ASR":"47.7% (GPT-4o)"},"metadata":{"tags":["ai-security","attack-success-rate","prompt-injection","llm-agents","adversarial-evaluation","agentdojo","ai-safety"],"quality":{"status":"generated","reviewed_by":[],"flagged_issues":[]},"category":"Technology","difficulty":"advanced","subcategory":"AI Security"},"model_used":"anthropic/claude-4-sonnet-20250522","revision_number":1,"view_count":4,"related_topics":[],"sections":["Attack Success Rate (AgentDojo)","Definition and Measurement","AgentDojo Framework Context","Research Findings and Trends","Model Capability Correlation","Baseline Performance Metrics","Defense Effectiveness","Attack Methodology and Environments","Implications for AI Safety","Security-Utility Trade-offs","Vulnerability Pattern Recognition","Benchmark Standardization","Future Research Directions","Related Topics","Summary"]}