#58

PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang, Xiaofei Xie

2025 | arXiv (preprint)

benchmark penetration-testing fully-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Existing LLM-based penetration testing approaches rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across individual penetration testing stages.

现有的基于 LLM 的渗透测试方法依赖于简单的提示词，缺乏任务分解或领域自适应，导致黑盒行为不可靠，且无法深入了解模型在各个渗透测试阶段的能力。

Current benchmarks are primarily based on CTF challenges designed for human participants, emphasize final outcomes only, and provide limited visibility into intermediate reasoning and actions across the multiple stages of penetration testing. This lack of stage-level evaluation hinders fine-grained analysis of model behavior and progress toward building reliable AI-based penetration testing agents.

目前的基准测试主要基于为人类参与者设计的 CTF 挑战，仅强调最终结果，对渗透测试多个阶段的中间推理和行动可见性有限。这种缺乏阶段级评估的现状阻碍了对模型行为的细粒度分析，也阻碍了构建可靠的基于 AI 的渗透测试智能体的进展。

Threat Model 威胁模型

External network penetration testing targeting web applications. The attacker interacts with the target system only through its externally accessible interface (URLs, HTTP requests). Scenarios are derived from real-world security incidents and include OWASP Top 10 and CWE Top 25 vulnerabilities, plus one zero-day.

针对 Web 应用程序的外部网络渗透测试。攻击者仅通过其外部可访问的界面（URL、HTTP 请求）与目标系统交互。场景源自真实的安全性事件，包括 OWASP Top 10 和 CWE Top 25 漏洞，外加一个零日（zero-day）漏洞。

Methodology 核心方法

PentestEval decomposes the penetration testing workflow into six sequential stages aligned with NIST and PTES: Information Collection, Weakness Gathering, Weakness Filtering, Attack Decision-Making, Exploit Generation, and Exploit Revision. It provides expert-annotated ground truth for each stage across 346 tasks in 12 realistic vulnerable scenarios. The benchmark evaluates LLMs on both individual stage performance and end-to-end pipeline completion, using stage-specific metrics (Jaccard similarity, Spearman rank correlation, syntax/functional correctness, revision success rate). Five expert penetration testers collaboratively designed environments, annotated ground truth, and cross-validated results.

PentestEval 将渗透测试工作流分解为与 NIST 和 PTES 一致的六个顺序阶段：信息收集、弱点收集、弱点过滤、攻击决策、漏洞利用生成和漏洞利用修正。它为 12 个真实漏洞场景中的 346 个任务提供了专家标注的地面真值（ground truth）。该基准测试在单个阶段性能和端到端流水线完成度上评估 LLM，使用阶段特定指标（Jaccard 相似度、Spearman 秩相关系数、语法/功能正确性、修正成功率）。五名专家渗透测试员协作设计了环境，标注了地面真值，并进行了交叉验证。

Architecture 架构设计

Three-stage framework: (1) Environment Construction builds Docker-packaged vulnerable web environments from real-world incidents; (2) Human-Expert Annotation provides stage-wise ground truth via three independent expert testers with five-expert validation; (3) Performance Evaluation uses task-specific metrics comparing LLM outputs against expert-annotated solutions. For end-to-end evaluation, a Sequential Modular Pipeline (SMP) chains stages linearly.

三阶段框架：(1) 环境构建，根据真实事件构建 Docker 封装的易受攻击的 Web 环境；(2) 人类专家标注，通过三名独立专家测试员提供阶段性地面真值，并由五名专家进行验证；(3) 性能评估，使用特定任务指标将 LLM 输出与专家标注的解法进行比较。对于端到端评估，采用顺序模块化流水线 (SMP) 将各阶段线性连接。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

At the stage level, LLMs exhibit generally weak performance with a mean success rate of only 0.41 across all stages, with Attack Decision-Making (0.25) and Exploit Generation functional correctness (0.26) being the most challenging. End-to-end pipelines reach only 31% success rate, and fully autonomous agents like PentestAgent (3%) and VulnBot (6%) fail almost entirely. Injecting ground truth at earlier stages (SMP-GT-ADM) boosts end-to-end success from 0.31 to 0.67, demonstrating the value of modularization.

在阶段级别，LLM 表现通常较弱，所有阶段的平均成功率仅为 0.41，其中攻击决策 (0.25) 和漏洞利用生成的功能正确性 (0.26) 最具挑战性。端到端流水线的成功率仅为 31%，而像 PentestAgent (3%) 和 VulnBot (6%) 这样的全自主智能体几乎完全失败。在早期阶段注入地面真值 (SMP-GT-ADM) 将端到端成功率从 0.31 提升到 0.67，证明了模块化的价值。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PentestGPT
PentestGPT-Auto
PentestAgent
VulnBot
Sequential-Modular-Pipeline

Scale 评估规模

346 tasks across 12 realistic vulnerable scenarios, over 3000 stage-level evaluations and 180 end-to-end tests

Contributions 核心贡献

PentestEval: the first modular benchmarking framework for fine-grained, stage-level evaluation of LLM performance across six decomposed penetration testing stages, built with five domain experts
Comprehensive evaluation of nine widely used LLMs revealing that models fail to achieve good performance on critical tasks like Weakness Gathering, Attack Decision-Making, and Exploit Generation
End-to-end assessment of existing tools (PentestGPT, PentestAgent, VulnBot) and a step-wise pipeline, demonstrating that fully autonomous agents fail consistently
Design insights showing that modularization enhances each individual stage and improves overall performance, with structured reasoning and critical attack path emphasis being key

PentestEval：第一个用于对 LLM 在六个分解的渗透测试阶段的表现进行细粒度、阶段级评估的模块化基准测试框架，由五名领域专家构建
对九个广泛使用的 LLM 进行了全面评估，揭示了模型在弱点收集、攻击决策和漏洞利用生成等关键任务上未能取得良好表现
对现有工具（PentestGPT、PentestAgent、VulnBot）和分步流水线进行了端到端评估，证明全自主智能体持续失败
设计洞察表明，模块化增强了每个单独阶段并提高了整体性能，其中结构化推理和对关键攻击路径的强调是关键

Limitations 局限性

Scope limited to external web application penetration testing; does not cover cloud infrastructures, IoT systems, or LLM-driven agent ecosystems
12 scenarios may not capture the full diversity of real-world penetration testing challenges
Ground truth completeness limited even with five experts; some attack paths may remain unidentified
Data contamination may affect results, though mitigated through novel attack configurations and a zero-day vulnerability
Analysis relies solely on LLM outputs without access to training data or internal architectures, limiting attribution of specific failure causes
Information Collection stage excluded from LLM evaluation as it uses fixed automated procedures

范围限于外部 Web 应用程序渗透测试；不涵盖云基础设施、IoT 系统或 LLM 驱动的智能体生态系统
12 个场景可能无法捕捉现实世界渗透测试挑战的全部多样性
即使有五名专家，地面真值的完整性仍受限；某些攻击路径可能仍未被识别
数据污染可能会影响结果，尽管通过新颖的攻击配置和零日漏洞有所缓解
分析仅依赖于 LLM 输出，无法访问训练数据或内部架构，限制了对特定失败原因的归因
信息收集阶段被排除在 LLM 评估之外，因为它使用固定的自动化程序

Research Gaps 研究空白

LLMs cannot reason about prerequisite relationships between weaknesses, preventing coherent multi-step attack chain construction
LLMs struggle with unstructured reconnaissance data and miss critical application-level vulnerabilities
No mechanisms for identifying zero-day vulnerabilities; all tested models failed on the zero-day scenario
LLMs misinterpret symbolic version ranges, a persistent obstacle to accurate context-based filtering
Semantic misinterpretation of critical code fragments (escape characters, encoded payloads) causes over one-third of exploit failures
Need for schema-guided normalization to transform raw reconnaissance outputs into security-centric JSON representations
Need for dedicated post-processing and validation modules to separate exploit generation from runtime verification

LLM 无法推理弱点之间的先决条件关系，阻碍了连贯的多步攻击链构建
LLM 难以处理非结构化的侦察数据，并漏掉关键的应用级漏洞
缺乏识别零日漏洞的机制；所有测试的模型在零日场景中都失败了
LLM 误解符号版本范围，这是准确进行基于上下文过滤的持久障碍
对关键代码片段（转义字符、编码的有效载荷）的语义误解导致了超过三分之一的漏洞利用失败
需要架构引导的归一化，将原始侦察输出转换为以安全为中心的 JSON 表示
需要专门的后处理和验证模块，将漏洞利用生成与运行时验证分离

Novel Techniques 新颖技术

Stage-level decomposition of penetration testing into six formally defined tasks with mathematical formulations for each stage
Expert-annotated ground truth at each stage enabling fine-grained comparison between LLM and human performance
Sequential Modular Pipeline (SMP) with ground-truth injection variants to isolate stage-level bottlenecks
Priority-based formulation for Attack Decision-Making using Spearman rank correlation rather than binary correctness
NonCVE Identification Rate (NICR) metric to assess LLM ability to find non-standardized vulnerabilities

将渗透测试阶段分解为六个正式定义的任务，并为每个阶段建立了数学公式
在每个阶段提供专家标注的地面真值，实现 LLM 与人类表现之间的细粒度比较
具有地面真值注入变体的顺序模块化流水线 (SMP)，用以隔离阶段级瓶颈
基于优先级的攻击决策表述，使用 Spearman 秩相关系数而非二元正确性
非 CVE 识别率 (NICR) 指标，用以评估 LLM 发现非标准化漏洞的能力

Open Questions 开放问题

How can LLMs be made to reason about prerequisite relationships between vulnerabilities for multi-step attack chains?
Can fuzzing or auditing components be integrated to enable zero-day vulnerability discovery by LLMs?
How effective would fine-tuning on high-quality attack-chain datasets be for improving penetration testing performance?
Can modular architectures with inter-module context propagation close the gap between stage-level and end-to-end performance?
How to extend the benchmark to cover cloud, IoT, and Active Directory penetration testing domains?

如何让 LLM 推理漏洞之间的先决条件关系，以构建多步攻击链？
能否集成模糊测试或审计组件，使 LLM 能够发现零日漏洞？
在高质量攻击链数据集上进行微调对提高渗透测试性能的效果如何？
具有模块间上下文传播的模块化架构能否缩小阶段级与端到端性能之间的差距？
如何将基准测试扩展到涵盖云、IoT 和有源目录（Active Directory）渗透测试领域？

Builds On 基于前人工作

NIST-SP-800-115
PTES
PentestGPT
PentestAgent
VulnBot
AutoAttacker
PentestAI

Open Source 开源信息

Yes