#51

To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack

Terry Yue Zhuo, Yangruibo Ding, Wenbo Guo, Ruijie Meng

2025 | arXiv (preprint)

position-paper general-cybersecurity fully-autonomous

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

AI agents are poised to fundamentally alter the cost structure of cyber attacks by automating vulnerability discovery and exploitation at scale, yet current defensive strategies (data governance, safety alignment, output guardrails, representation engineering, access controls) are model-centric and insufficient against adaptive AI adversaries.

AI 智能体将通过大规模自动化漏洞发现和利用，从根本上改变网络攻击的成本结构。然而，目前的防御策略（数据治理、安全对齐、输出护栏、表示工程、访问控制）主要以模型为中心，不足以对抗自适应的 AI 对手。

For over a decade, cybersecurity has relied on human labor scarcity to limit attackers. AI agents break this balance by enabling economically viable attacks across the long tail of previously ignored targets at near-zero marginal cost. Current model-level safeguards fail because attackers can use open-weight models, bypass alignment, and synthesize novel attacks from first principles. Defenders need offensive security intelligence to predict and preempt attacks rather than react to them.

十多年来，网络安全一直依赖于人类劳动力的稀缺性来限制攻击者。AI 智能体打破了这种平衡，使得针对以前被忽略的大量长尾目标进行经济上可行的攻击成为可能，且边际成本接近于零。目前的模型级保障措施之所以失败，是因为攻击者可以使用开源模型、绕过对齐并从基本原理中合成全新的攻击。防御者需要进攻性的安全情报来预测和先发制人地阻止攻击，而不是被动地做出反应。

Threat Model 威胁模型

A financially motivated adversary that is technically sophisticated but constrained by human labor. The adversary has access to state-of-the-art AI agents (via APIs or local deployment) and can integrate them into automated pipelines for vulnerability discovery, exploitation, and monetization. The adversary does not need nation-state capabilities or novel cryptographic breaks. The goal is to maximize aggregate profit across a large heterogeneous victim population, not to damage a specific high-value target.

一个受经济利益驱动的对手，技术成熟但受人类劳动力约束。对手可以访问最先进的 AI 智能体（通过 API 或本地部署），并能将其集成到自动化的漏洞发现、利用和变现流水线中。对手不需要国家级的能力或全新的密码学突破。其目标是在大规模异质的受害者群体中最大化总利润，而不是破坏特定的高价值目标。

Methodology 核心方法

This is a position paper that argues defenders must develop offensive security intelligence by teaching AI agents to hack in controlled environments. The paper first formalizes catastrophic cybersecurity risks in the AI agent era, including system-level vulnerability exploitation, cross-domain compromise chains, and automated superhuman cyber attacks. It then systematically examines five categories of model-centric defenses (data governance, safety alignment, representation engineering, output guardrails, access controls) and explains why each fails against adaptive AI adversaries. Finally, it proposes three concrete actions: (1) construct comprehensive benchmarks covering the full attack lifecycle, (2) advance from workflow-based to trained agents for discovering in-wild vulnerabilities, and (3) implement governance restricting offensive agents to audited cyber ranges while distilling findings into defense-only agents.

这是一篇观点论文，认为防御者必须通过在受控环境中教 AI 智能体“黑客技术”来开发进攻性安全情报。论文首先形式化了 AI 智能体时代的灾难性网络安全风险，包括系统级漏洞利用、跨域妥协链和自动化的超人网络攻击。随后系统地检查了五类以模型为中心的防御措施（数据治理、安全对齐、表示工程、输出护栏、访问控制），并解释了为什么每种措施在面对自适应 AI 对手时都会失效。最后，提出了三个具体行动：（1）构建涵盖完整攻击生命周期的综合基准测试；（2）从基于工作流的智能体进化到用于发现野外漏洞的经过训练的智能体；（3）实施治理，将进攻性智能体限制在经过审计的网络靶场内，同时将发现结果提取并转化为仅用于防御的智能体。

Architecture 架构设计

The paper outlines an offense-to-defense workflow architecture where offensive agents operate within audited cyber ranges to identify and validate vulnerabilities, and their traces are distilled into actionable defensive artifacts (automated patch suggestions, regression tests). Specialized defense-only agents focusing on detection, root cause analysis, and remediation are then safely released to protect the global software ecosystem.

论文概述了一个“从攻到防”的工作流架构：进攻性智能体在经过审计的网络靶场内操作，以识别和验证漏洞，其操作痕迹被提取为可操作的防御制品（自动补丁建议、回归测试）。随后，专注于检测、根因分析和修复的专门防御智能体被安全地发布，以保护全球软件生态系统。

Tool Integration 工具集成

Memory Mechanism 记忆机制

none

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

The paper surveys SOTA agent performance across existing benchmarks, finding highly variable results: attack generation ranges from 0.2% (SeCodePLT) to 54.5% (AutoPenBench), CTF solve rates are 22-55%, vulnerability detection is 12.9-77.8%, PoC generation is 28.9% (CyberGym), and patching is 22.3-78.8%. AI agents perform better on small-scale generative tasks than large-scale analytic tasks. Current benchmarks do not cover the full attack lifecycle (e.g., exploit chaining, command and control are missing).

论文调查了现有基准测试中 SOTA 智能体的表现，发现结果差异很大：攻击生成的范围从 0.2% (SeCodePLT) 到 54.5% (AutoPenBench)，CTF 解决率为 22-55%，漏洞检测率为 12.9-77.8%，PoC 生成率为 28.9% (CyberGym)，补丁生成率为 22.3-78.8%。AI 智能体在小规模生成任务上的表现优于大规模分析任务。目前的基准测试并未涵盖完整的攻击生命周期（例如，缺少漏洞利用链和命令与控制）。

Environment 评估环境

Metrics 评估指标

Scale 评估规模

Survey of 11 existing cybersecurity benchmarks

Contributions 核心贡献

Formalizes the catastrophic cybersecurity risks posed by AI agents, including system-level exploitation of the long tail of software, cross-domain compromise chains, and automated superhuman cyber attacks
Systematically analyzes five categories of model-centric defenses (data governance, safety alignment, representation engineering, output guardrails, access controls) and demonstrates why each is insufficient against adaptive AI adversaries
Proposes three actionable directions for building offensive AI capabilities responsibly: comprehensive attack-lifecycle benchmarks, evolution from workflow-based to trained security agents, and a governance framework (capability-tiered checkpoints, audited cyber ranges, offense-to-defense distillation)
Presents a practical evolution roadmap for offensive security agents across three stages: knowledge models, workflow agents, and trained agents (Table 2)
Addresses alternative views and counterarguments, including challenges in training security agents, limited real-world adoption by attackers, and uncertainty about continued AI progress

形式化了 AI 智能体带来的灾难性网络安全风险，包括对长尾软件的系统级利用、跨域妥协链和自动化的超人网络攻击
系统地分析了五类以模型为中心的防御措施（数据治理、安全对齐、表示工程、输出护栏、访问控制），并证明了为什么每种措施在面对自适应 AI 对手时都是不足的
提出了负责任地构建进攻性 AI 能力的三个可行方向：全面的攻击生命周期基准测试、从基于工作流向经过训练的安全智能体进化，以及治理框架（能力分级检查点、经过审计的网络靶场、从攻到防的知识蒸馏）
展示了进攻性安全智能体的实用进化路线图，分为三个阶段：知识模型、工作流智能体和经过训练的智能体（表 2）
讨论了其他观点和反面论点，包括训练安全智能体的挑战、攻击者在现实世界中的有限采用，以及 AI 持续进步的不确定性

Limitations 局限性

The paper is a position paper without empirical validation of the proposed offense-to-defense framework
The proposed governance mechanisms (capability-tiered checkpoints, audited cyber ranges) are conceptual and lack implementation details or feasibility analysis
Training offensive agents is acknowledged to be very challenging due to data scarcity, domain-specific tool requirements, out-of-distribution attack evolution, and the need for deep ML-security community collaboration
The paper acknowledges that offensive capabilities could leak from controlled environments, and the dual-use risk is inherent
Assumes continued rapid progress in frontier AI development, which is not guaranteed

本文是一篇观点论文，没有对提出的“从攻到防”框架进行实证验证
提出的治理机制（能力分级检查点、经过审计的网络靶场）仍处于概念阶段，缺乏实施细节或可行性分析
承认训练进攻性智能体非常具有挑战性，原因包括数据稀缺、领域特定工具要求、分布外攻击演进，以及需要深度机器学习安全社区的协作
论文承认进攻性能力可能会从受控环境中泄露，双重用途风险是固有存在的
假设前沿 AI 开发将继续保持快速进步，但这并不能得到保证

Research Gaps 研究空白

Current benchmarks lack coverage of critical attack phases such as exploit chaining, command and control, project-level vulnerability detection, and root cause analysis
No existing benchmarks adequately cover the defense lifecycle (detection, root cause analysis, remediation)
Trained offensive security agents (post-trained from cybersecurity environments) are underexplored; most agents remain at the workflow stage
Methods for safely distilling offensive agent findings into defense-only agents do not yet exist
Large-scale, real-world project benchmarks with dynamic execution environments and proper security-specific metrics are missing
Reinforcement learning approaches for offensive security agents using cyber ranges are largely unexplored
Benchmark quality control is insufficient: environment flaws, noisy vulnerability labels, and lack of regular updates to track evolving threats

目前的基准测试缺乏对关键攻击阶段的覆盖，如漏洞利用链、命令与控制、项目级漏洞检测和根因分析
现有的基准测试均未充分涵盖防御生命周期（检测、根因分析、补救）
经过训练的进攻性安全智能体（从网络安全环境中进行后训练）尚未得到充分探索；大多数智能体仍处于工作流阶段
尚不存在将进攻性智能体发现结果安全地蒸馏为仅用于防御的智能体的方法
缺乏具有动态执行环境和适当安全专用指标的大规模现实项目基准测试
利用网络靶场对进攻性安全智能体进行强化学习的方法在很大程度上尚未被探索
基准测试质量控制不足：存在环境缺陷、漏洞标签噪声，以及缺乏定期更新以跟踪演进的威胁

Novel Techniques 新颖技术

Capability-tiered checkpoints for staged release of offensive AI capabilities, where each model version is evaluated on standardized offensive measurements and assigned a release tier
Offense-to-defense distillation: offensive agents discover vulnerabilities in cyber ranges, and their traces are converted into defensive artifacts (patches, regression tests) for safely-released defense-only agents
Proposal to evolve offensive security agents from workflow-based (prompting with external tool orchestration) to trained agents (post-trained via reinforcement learning in cyber range environments)

能力分级检查点：用于分阶段发布进攻性 AI 能力，根据标准化的进攻性测量评估每个模型版本，并分配发布等级
从攻到防的知识蒸馏：进攻性智能体在网络靶场发现漏洞，其操作痕迹被转换为防御制品（补丁、回归测试），用于发布安全的仅用于防御的智能体
提议将进攻性安全智能体从基于工作流（通过外部工具编排进行提示）进化为经过训练的智能体（在网络靶场环境中通过强化学习进行后训练）

Open Questions 开放问题

How can offensive AI capabilities be effectively contained within audited environments while still producing useful defensive intelligence?
What training methodologies can produce agents that generalize across the full attack lifecycle rather than excelling only at small-scale generative tasks?
How should benchmarks be designed to cover the complete attack and defense lifecycle including exploit chaining, C2, and remediation?
Can the offense-to-defense distillation process be made reliable enough that defense-only agents cannot be repurposed for attacks?
Will AI agent-driven attacks see widespread real-world adoption by cyber criminals, or will they remain niche tools?
How do we handle the inherent dual-use nature of offensive security research when AI dramatically lowers the barrier to misuse?

如何有效地将进攻性 AI 能力限制在经过审计的环境中，同时仍能产生有用的防御情报？
什么样的训练方法可以产生能够涵盖完整攻击生命周期的智能体，而不仅仅是在小规模生成任务上表现出色？
基准测试应如何设计才能覆盖包括利用链、C2 和补救在内的完整攻防生命周期？
从攻到防的知识蒸馏过程能否变得足够可靠，以确保仅用于防御的智能体不会被重新用于攻击？
AI 智能体驱动的攻击是否会被网络犯罪分子广泛采用，还是仅作为小众工具存在？
当 AI 极大降低滥用门槛时，我们应如何处理进攻性安全研究固有的双重用途性质？

Builds On 基于前人工作

PentestGPT (Deng et al., 2024)
Enigma (Abramovich et al.)
CyberPal (Levi et al., 2025a)
RepoAudit (Guo et al.)
VulnLLM-R (Nie et al., 2025a)
Locus (Zhu et al., 2025a)
PBFuzz (Zeng et al., 2025)
PRIMUS (Yu et al., 2025)
Foundation-Sec-8B (Kassianik et al., 2025)
Carlini et al. (2025) - LLMs unlock new paths to monetizing exploits
Potter et al. (2025) - Frontier AI impact on cybersecurity