#21

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, Daniel Kang

2025 | arXiv (preprint)

system vulnerability-assessment fully-autonomous multi-agent hierarchical-planning

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Prior work has shown that single LLM agents can exploit known (one-day) vulnerabilities when given a description, but they perform poorly on zero-day vulnerabilities where no description is provided. This paper investigates whether teams of LLM agents can autonomously exploit real-world zero-day web vulnerabilities.

之前的研究表明，单个 LLM 智能体在给定描述的情况下可以利用已知（one-day）漏洞，但在没有提供描述的零日（zero-day）漏洞上表现不佳。本文研究 LLM 智能体团队是否能自主利用现实世界的零日 Web 漏洞。

Single agents struggle with the joint exploration, planning, and execution required for zero-day exploitation due to limited context lengths and difficulty backtracking after exploring dead ends. A more structured multi-agent approach could overcome these limitations and answer the open question of whether AI agents can exploit vulnerabilities unknown to the attacker ahead of time.

由于上下文长度有限且在探索死胡同后难以回溯，单个智能体在进行零日漏洞利用所需的联合探索、规划和执行方面存在困难。一种更结构化的多智能体方法可以克服这些限制，并回答 AI 智能体是否能利用攻击者事先未知的漏洞这一开放性问题。

Threat Model 威胁模型

An attacker with access to a web application (with basic credentials like a normal user account) but no knowledge of specific vulnerabilities in the system. The attacker uses LLM-powered agents with access to web browsing tools, terminals, and vulnerability-specific tooling, but agents do not search for vulnerabilities via search engines.

攻击者可以访问 Web 应用程序（具有普通用户账户等基本凭据），但对系统中的特定漏洞一无所知。攻击者使用由 LLM 驱动的智能体，这些智能体可以访问网页浏览工具、终端和特定漏洞工具，但智能体不会通过搜索引擎搜索漏洞。

Methodology 核心方法

The authors introduce HPTSA (Hierarchical Planning and Task-Specific Agents), a multi-agent framework with three components: (1) a hierarchical planner that explores the target website and determines attack strategies, (2) a team manager that dispatches task-specific expert agents and synthesizes information across agent runs, and (3) a set of six task-specific expert agents (XSS, SQLi, CSRF, SSTI, ZAP, and a generic web hacking agent) each specialized in exploiting a particular vulnerability class. The planner explores the environment, the manager selects and coordinates expert agents, and experts attempt exploitation with access to relevant documentation and specialized tools.

作者引入了 HPTSA（分层规划与特定任务智能体），这是一个包含三个组件的多智能体框架：(1) 分层规划器，负责探索目标网站并确定攻击策略；(2) 团队管理器，负责调度特定任务的专家智能体并综合各智能体运行的信息；(3) 一组六个特定任务专家智能体（XSS, SQLi, CSRF, SSTI, ZAP 和通用 Web 黑客智能体），每个智能体专门负责利用特定的漏洞类别。规划器探索环境，管理器选择并协调专家智能体，专家在获取相关文档和专业工具的情况下尝试漏洞利用。

Architecture 架构设计

Three-tier hierarchical architecture: (1) Planner at the top explores the target and creates high-level attack plans, (2) Manager in the middle selects which task-specific agents to dispatch and passes context from previous agent runs, (3) Task-specific expert agents at the bottom attempt exploitation of specific vulnerability classes. Agents communicate via LangGraph message passing.

三层分层架构：(1) 顶层的规划器（Planner）探索目标并制定高层攻击计划；(2) 中层的管理器（Manager）选择分发哪些特定任务智能体，并传递先前智能体运行的上下文；(3) 底层的特定任务专家智能体（Expert agents）尝试利用特定类别的漏洞。智能体通过 LangGraph 消息传递进行通信。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

HPTSA with GPT-4 achieves 42% pass@5 and 18% pass@1 on 14 real-world zero-day web vulnerabilities. It outperforms a single GPT-4 agent (no description) by 4.3x on pass@1 and 2.0x on pass@5, and performs within 1.8x of a GPT-4 agent given the vulnerability description. Open-source models (Llama-3.1-405B, Qwen-2.5-72B) and traditional scanners (ZAP, MetaSploit) achieve 0% success.

使用 GPT-4 的 HPTSA 在 14 个现实世界的零日 Web 漏洞上实现了 42% 的 pass@5 和 18% 的 pass@1。在 pass@1 上优于单个 GPT-4 智能体（无描述）4.3 倍，在 pass@5 上优于 2.0 倍，且表现与给定漏洞描述的 GPT-4 智能体相差不到 1.8 倍。开源模型（Llama-3.1-405B, Qwen-2.5-72B）和传统扫描器（ZAP, MetaSploit）的成功率为 0%。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-4 single agent without vulnerability description (0DV agent)
GPT-4 single agent with vulnerability description (1DV agent)
ZAP vulnerability scanner
MetaSploit vulnerability scanner
Llama-3.1-405B with HPTSA
Qwen-2.5-72B with HPTSA

Scale 评估规模

14 real-world zero-day web vulnerabilities (CVEs from 2024)

Contributions 核心贡献

First demonstration that teams of LLM agents can autonomously exploit real-world zero-day vulnerabilities, resolving an open question from prior work
Introduction of HPTSA, a hierarchical multi-agent framework with a planner, team manager, and task-specific expert agents for cybersecurity exploitation
A new benchmark of 14 real-world zero-day web vulnerabilities (all CVEs from 2024, past GPT-4 knowledge cutoff) spanning XSS, SQLi, CSRF, privilege escalation, and other types
Ablation studies demonstrating the necessity of each component: task-specific agents, documents, and hierarchical structure

首次证明 LLM 智能体团队可以自主利用现实世界的零日漏洞，解决了之前研究中的一个开放性问题。
引入了 HPTSA，这是一种用于网络安全漏洞利用的分层多智能体框架，包含规划器、团队管理器 e 特定任务专家智能体。
建立了一个包含 14 个现实世界零日 Web 漏洞的新基准（均为 2024 年的 CVE，超过了 GPT-4 的知识截止日期），涵盖 XSS, SQLi, CSRF, 权限提升等类型。
消融实验证明了每个组件的必要性：特定任务智能体、文档 e 分层结构。

Limitations 局限性

Only 42% pass@5 success rate, meaning the majority of zero-day vulnerabilities remain unexploited
Only GPT-4 succeeds; open-source models (Llama-3.1-405B, Qwen-2.5-72B) achieve 0%, showing heavy dependence on frontier proprietary models
Benchmark is limited to 14 web vulnerabilities in open-source software, which may produce a biased sample of the vulnerability landscape
Focused exclusively on web vulnerabilities; non-web vulnerabilities (e.g., binary exploitation, network protocols) are not addressed
Agents fail on vulnerabilities requiring access to undocumented API endpoints or non-obvious navigation paths (e.g., CVE-2024-25635, CVE-2024-33247)
Average cost of $4.39 per run ($24.40 per successful exploit) with GPT-4, which may limit scalability
Code and prompts are not released publicly (at OpenAI's request), limiting reproducibility
Open-source models showed high refusal rates (31% for Llama) and tendency to repeat incorrect approaches

仅有 42% 的 pass@5 成功率，意味着大多数零日漏洞仍未被利用。
仅 GPT-4 取得成功；开源模型（Llama-3.1-405B, Qwen-2.5-72B）成功率为 0%，显示出对前沿商业模型的严重依赖。
基准局限于开源软件中的 14 个 Web 漏洞，这可能产生漏洞景观的偏差样本。
专门关注 Web 漏洞；未涉及非 Web 漏洞（例如二进制漏洞利用、网络协议）。
智能体在需要访问未记录的 API 端点或非显而易见的导航路径（如 CVE-2024-25635, CVE-2024-33247）的漏洞上失败。
使用 GPT-4 每次运行的平均成本为 4.39 美元（每次成功利用 24.40 美元），这可能限制其扩展性。
代码和提示词未公开（应 OpenAI 要求），限制了可重复性。
开源模型表现出较高的拒绝率（Llama 为 31%）且倾向于重复错误的方法。

Research Gaps 研究空白

How to improve agent exploration of non-obvious attack surfaces (hidden endpoints, undocumented APIs)
Whether more sophisticated multi-agent coordination or planning strategies could further close the gap to one-day (known vulnerability) performance
Extending zero-day exploitation capabilities beyond web vulnerabilities to network, binary, and other domains
Whether AI agents will ultimately favor offense or defense in cybersecurity, and how to steer development toward defensive applications
Reducing dependence on frontier proprietary models by improving open-source model capabilities for cybersecurity tasks
Developing better strategies for agents to handle vulnerabilities that lack visible input fields or obvious injection points

如何改进智能体对非显而易见攻击面（隐藏端点、未记录 API）的探索。
更复杂的多智能体协调或规划策略是否能进一步缩小与已知漏洞（one-day）利用表现的差距。
将零日漏洞利用能力从 Web 漏洞扩展到网络、二进制等其他领域。
AI 智能体最终在网络安全中是偏向进攻还是防御，以及如何引导开发转向防御应用。
通过提高开源模型在网络安全任务中的能力，减少对前沿商业模型的依赖。
为智能体开发更好的策略，以处理缺乏可见输入字段或明显注入点的漏洞。

Novel Techniques 新颖技术

Hierarchical planning with task-specific expert agents (HPTSA) that separates exploration/planning from exploitation, allowing backtracking at the manager level rather than within individual agents
Task-specific expert agents with curated vulnerability-class documentation (5-6 documents per agent) to provide domain knowledge without requiring it in the model's training data
HTML simplification strategy to reduce token consumption by stripping irrelevant tags (images, SVG, style) before passing web content to agents
Cross-agent information synthesis where the manager uses traces from prior agent runs to refine instructions for subsequent agents

具有特定任务专家智能体的分层规划 (HPTSA)，将探索/规划与漏洞利用分离，允许在管理器级别而非单个智能体内部进行回溯。
任务特定专家智能体与策划的漏洞等级文档 (HPTSA) 的特定任务专家智能体，旨在提供领域知识而无需将其包含在模型的训练数据中。
HTML 简化策略，通过在将网页内容传递给智能体之前剥离无关标签（图像、SVG、样式）来减少 token 消耗。
跨智能体信息综合，管理器利用先前智能体运行的轨迹来精炼后续智能体的指令。

Open Questions 开放问题

Can the approach scale to more complex, multi-step vulnerabilities that require chaining multiple exploits?
How would HPTSA perform on non-web vulnerability classes (e.g., memory corruption, logic bugs in APIs)?
What is the optimal number and granularity of task-specific expert agents?
Could reinforcement learning or self-improvement be used to make the planner and manager more effective over time?
How robust is the approach to defensive measures like WAFs, rate limiting, or honeypots?
Will next-generation open-source models close the gap with GPT-4 on this task?

该方法能否扩展到更复杂的、需要链接多个漏洞利用的多步漏洞？
HPTSA 在非 Web 漏洞类别（如内存损坏、API 中的逻辑漏洞）上表现如何？
特定任务专家智能体的最佳数量和粒度是多少？
能否使用强化学习或自我提升使规划器 e 管理器随着时间的推移变得更有效？
该方法对 WAF、速率限制或蜜罐等防御措施的鲁棒性如何？
下一代开源模型是否能在该任务上缩小与 GPT-4 的差距？

Builds On 基于前人工作

Fang et al. 2024a - LLM agents can autonomously exploit one-day vulnerabilities
Fang et al. 2024b - LLM agents can autonomously hack websites
Liu et al. 2023b - Dynamic LLM-agent network for multi-agent collaboration
Chen et al. 2023 - AutoAgents framework for automatic agent generation
Zhang et al. 2023 - Building cooperative embodied agents modularly with LLMs
Yao et al. 2022 - ReAct: Synergizing reasoning and acting in language models