#62

LLM Agents can Autonomously Exploit One-day Vulnerabilities LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang

2024 | arXiv (preprint)

empirical-study vulnerability-assessment fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Can LLM agents autonomously exploit real-world one-day vulnerabilities (disclosed but unpatched vulnerabilities), as opposed to toy capture-the-flag exercises studied in prior work?

与之前研究中研究的玩具式夺旗 (CTF) 练习不同，LLM 智能体能否自主利用现实世界的“一日漏洞”（One-day vulnerabilities，已披露但未修复的漏洞）？

Prior work on LLM agents in cybersecurity focused exclusively on toy problems and CTF-style exercises that do not reflect real-world deployments. This leaves a critical gap in understanding whether LLM agents pose a genuine threat to real systems with known but unpatched vulnerabilities.

之前关于网络安全中 LLM 智能体的工作完全集中在玩具问题和 CTF 式练习上，这些练习不能反映现实世界的部署情况。这导致在理解 LLM 智能体是否对具有已知但未修复漏洞的真实系统构成真正威胁方面存在关键空白。

Threat Model 威胁模型

An attacker has access to the CVE description of a one-day vulnerability (disclosed but not yet patched in the target system). The attacker uses an LLM agent with tool access to autonomously exploit the vulnerability. The time window is between vulnerability disclosure (t=1) and patch deployment (t=n).

攻击者可以访问一日漏洞的 CVE 描述（已披露但尚未在目标系统中修复）。攻击者使用具有工具访问权限的 LLM 智能体自主利用该漏洞。时间窗口位于漏洞披露 (t=1) 和补丁部署 (t=n) 之间。

Methodology 核心方法

The authors build a simple LLM agent (91 lines of code) using the ReAct framework implemented in LangChain. The agent is given access to tools (web browser, terminal, web search, file editor, code interpreter), the CVE description, and a detailed prompt encouraging creative exploitation. They evaluate the agent on a benchmark of 15 real-world one-day vulnerabilities collected from the CVE database and academic papers, spanning website vulnerabilities, container vulnerabilities, and vulnerable Python packages.

作者使用 LangChain 中实现的 ReAct 框架构建了一个简单的 LLM 智能体（91 行代码）。该智能体被赋予了工具访问权限（Web 浏览器、终端、Web 搜索、文件编辑器、代码解释器）、CVE 描述以及鼓励创造性利用的详细提示词。他在一个包含从 CVE 数据库和学术论文中收集的 15 个现实世界一日漏洞的基准测试中评估了该智能体，涵盖了网站漏洞、容器漏洞和有漏洞的 Python 包。

Architecture 架构设计

Single LLM agent with ReAct loop. The agent receives a prompt with the CVE description, uses tools iteratively (observe-reason-act cycle), and attempts to exploit the target vulnerability in a sandboxed environment. For OpenAI models, the Assistants API is used; for open-source models, the Together AI API is used.

具有 ReAct 循环的单个 LLM 智能体。智能体接收包含 CVE 描述的提示词，迭代地使用工具（观察-推理-行动循环），并尝试在沙盒环境中利用目标漏洞。对于 OpenAI 模型，使用 Assistants API；对于开源模型，使用 Together AI API。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

GPT-4 achieves an 87% success rate (pass@5) on exploiting 15 real-world one-day vulnerabilities, while every other model tested (GPT-3.5, 8 open-source LLMs) and open-source vulnerability scanners (ZAP, Metasploit) achieve 0%. Without the CVE description, GPT-4's success rate drops from 87% to 7%, demonstrating the critical importance of vulnerability knowledge. The average cost per exploit is $8.80 (average $3.52 per run with 40% overall success rate), which is 2.8x cheaper than estimated human expert cost.

GPT-4 在利用 15 个现实世界一日漏洞方面实现了 87% 的成功率 (pass@5)，而测试的所有其他模型（GPT-3.5, 8 个开源 LLM）和开源漏洞扫描器（ZAP, Metasploit）的成功率均为 0%。在没有 CVE 描述的情况下，GPT-4 的成功率从 87% 下降到 7%，证明了漏洞知识的关键重要性。每次漏洞利用的平均成本为 8.80 美元（平均每次运行 3.52 美元，整体成功率为 40%），比估计的人类专家成本便宜 2.8 倍。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-3.5
OpenHermes-2.5-Mistral-7B
LLaMA-2-Chat-70B
LLaMA-2-Chat-13B
LLaMA-2-Chat-7B
Mixtral-8x7B-Instruct
Mistral-7B-Instruct-v0.2
Nous-Hermes-2-Yi-34B
OpenChat-3.5
ZAP
Metasploit

Scale 评估规模

15 real-world one-day vulnerabilities

Contributions 核心贡献

First demonstration that LLM agents can autonomously exploit real-world one-day vulnerabilities, moving beyond toy CTF exercises
A benchmark of 15 real-world one-day vulnerabilities spanning websites, containers, and Python packages, with over half rated high or critical severity
Empirical evidence of a stark capability gap: GPT-4 achieves 87% success while all other tested models and traditional vulnerability scanners achieve 0%
Ablation study showing the CVE description is critical: without it, GPT-4 success drops from 87% to 7%, indicating finding vulnerabilities is much harder than exploiting known ones
Cost analysis showing LLM-based exploitation is 2.8x cheaper than human experts

首次证明 LLM 智能体可以自主利用现实世界的一日漏洞，超越了玩具式的 CTF 练习
一个包含 15 个现实世界一日漏洞的基准测试，涵盖网站、容器和 Python 包，其中一半以上被评为高危或严重等级
显著能力差距的实证证据：GPT-4 实现了 87% 的成功率，而所有其他测试的模型和传统漏洞扫描器均为 0%
消融研究表明 CVE 描述至关重要：没有它，GPT-4 的成功率从 87% 下降到 7%，表明发现漏洞比利用已知漏洞要困难得多
成本分析表明，基于 LLM 的漏洞利用比人类专家便宜 2.8 倍

Limitations 局限性

Only 15 vulnerabilities in the benchmark, limiting generalizability
Focused on open-source software only, as closed-source CVEs cannot be reproduced
Many open-source CVEs were excluded due to irreproducibility (broken Docker containers, unspecified dependencies)
Agent lacks sub-agent capabilities and a dedicated planning module, which could improve performance
The agent struggles with JavaScript-heavy web apps (e.g., Iris XSS) and non-English CVE descriptions (e.g., Hertzbeat RCE in Chinese)
Context window limitations constrain the number of actions the agent can take
OpenAI tool response size limit of 512 kB forces workarounds for large web pages

基准测试中仅有 15 个漏洞，限制了泛化能力
仅关注开源软件，因为闭源 CVE 无法复现
许多开源 CVE 由于不可复现性（Docker 容器损坏、未指定的依赖项）而被排除公外
智能体缺乏子智能体能力和专门的规划模块，而这些本可以提高性能
智能体难以处理重度使用 JavaScript 的 Web 应用（如 Iris XSS）和非英语 CVE 描述（如中文的 Hertzbeat RCE）
上下文窗口限制了智能体可以采取的操作数量
OpenAI 工具响应大小限制为 512 kB，迫使对大型网页采取折中方案

Research Gaps 研究空白

Whether other frontier models (beyond GPT-4) can achieve similar exploitation capabilities remains unexplored
Planning mechanisms and sub-agent architectures could improve exploitation success but have not been tested
The ability of LLM agents to autonomously discover (not just exploit) vulnerabilities is extremely limited (7% without CVE description)
Defensive measures against LLM-based autonomous exploitation are not addressed
Scaling to a larger and more diverse set of vulnerabilities is needed

除了 GPT-4 之外，其他前沿模型是否能实现类似的漏洞利用能力仍未得到探索
规划机制和子智能体架构可以提高漏洞利用成功率，但尚未经过测试
LLM 智能体自主发现（而不仅仅是利用）漏洞的能力极其有限（没有 CVE 描述时仅为 7%）
尚未讨论针对基于 LLM 的自主漏洞利用的防御措施
需要扩展到更大、更多样化的漏洞集

Novel Techniques 新颖技术

Using CVE descriptions as structured knowledge input to guide LLM agent exploitation
Demonstrating that a minimal 91-line agent with ReAct can exploit complex real-world vulnerabilities
Showing that vulnerability exploitation may be an emergent capability of frontier LLMs

使用 CVE 描述作为结构化知识输入来引导 LLM 智能体进行漏洞利用
证明了一个只有 91 行代码、采用 ReAct 框架的极简智能体可以利用复杂的现实世界漏洞
表明漏洞利用可能是前沿 LLM 的一种涌现能力

Open Questions 开放问题

Is the exploitation capability truly emergent in GPT-4, or could it be replicated with fine-tuned open-source models?
How would defensive countermeasures (e.g., rate limiting, anomaly detection) affect LLM agent success?
Can planning modules and sub-agents significantly close the gap for vulnerability discovery without CVE descriptions?
How do newer models (GPT-4o, Claude 3, etc.) compare on this benchmark?

这种漏洞利用能力在 GPT-4 中真的是涌现出来的吗，还是可以通过微调的开源模型来复制？
防御性对策（如速率限制、异常检测）会如何影响 LLM 智能体的成功？
规划模块和子智能体能否显著缩小在没有 CVE 描述的情况下发现漏洞的差距？
更新的模型（GPT-4o, Claude 3 等）在这个基准测试上的表现如何？

Builds On 基于前人工作

ReAct
LangChain
Fang et al. 2024 - LLM agents can autonomously hack websites
ACIDRain (Warszawski & Bailis 2017)