#02

What Makes a Good LLM Agent for Real-world Penetration Testing? What Makes a Good LLM Agent for Real-world Penetration Testing?

Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, Tianwei Zhang

2026 | arXiv (preprint)

system penetration-testing fully-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

LLM-based penetration testing agents show widely varying performance across systems and benchmarks, with task completion rates ranging from single digits under naive prompting to 40-80% with sophisticated architectures. The paper asks what drives these performance differences and what fundamental limitations remain.

基于 LLM 的渗透测试智能体在不同系统和基准测试中表现出巨大差异，任务完成率从原生提示下的个位数到复杂架构下的 40-80% 不等。本文探讨了驱动这些性能差异的原因以及仍然存在的根本局限性。

The global shortfall of 4.7 million cybersecurity professionals and the labor-intensive nature of manual penetration testing has driven interest in LLM-based automation. However, existing systems are optimized to address transient LLM constraints (e.g., limited context windows, poor tool knowledge) rather than persistent task challenges. Performance gaps between systems compress by over half when backbone models upgrade, indicating that current architectural innovations are workarounds for past-generation model limitations rather than solutions to fundamental penetration testing challenges like long-horizon planning and real-time difficulty assessment.

全球 470 万网络安全专业人员的短缺以及手动渗透测试的劳动密集型性质，推动了对基于 LLM 自动化的兴趣。然而，现有系统针对解决瞬态 LLM 限制（如有限的上下文窗口、贫乏的工具知识）进行了优化，而不是持久的任务挑战。当主干模型升级时，系统之间的性能差距缩小了一半以上，这表明当前的架构创新是对上一代模型局限性的权宜之计，而不是对长时程规划和实时难度评估等根本渗透测试挑战的解决方案。

Threat Model 威胁模型

Standard penetration testing threat model: black-box/grey-box assessment of target systems. The agent operates as an external attacker with network access to the target, attempting to gain unauthorized access and escalate privileges. No assumptions about active defenses in the benchmark environments, though the paper discusses adversarial environments as a remaining challenge.

标准渗透测试威胁模型：对目标系统进行黑盒/灰盒评估。智能体作为具有目标网络访问权限的外部攻击者运行，尝试获取未经授权的访问并提升权限。基准环境中没有主动防御的假设，尽管本文讨论了对抗性环境作为一个尚存的挑战。

Methodology 核心方法

The paper conducts a systematic analysis of 28 LLM-based penetration testing systems, evaluates five representative implementations across three benchmarks, and identifies two distinct failure categories: Type A (capability gaps addressable through engineering) and Type B (complexity barriers from planning and state management limitations). Based on this analysis, the authors present PentestGPT V2, which couples a Tool and Skill Layer (for Type A failures) with a Task Difficulty Assessment (TDA) mechanism integrated into an Evidence-Guided Attack Tree Search (EGATS) framework (for Type B failures), plus a Memory Subsystem for persistent state management.

本文对 28 个基于 LLM 的渗透测试系统进行了系统分析，在三个基准测试中评估了五个代表性实现，并确定了两个不同的故障类别：A 型（可通过工程解决的能力差距）和 B 型（来自规划和状态管理局限性的复杂性障碍）。基于此分析，作者提出了 PentestGPT V2，它将工具与技能层（针对 A 型故障）与集成在证据引导攻击树搜索 (EGATS) 框架中的任务难度评估 (TDA) 机制（针对 B 型故障）相结合，此外还有一个用于持久状态管理的内存子系统。

Architecture 架构设计

PentestGPT V2 is a single-agent system with four main components: (1) A Tool and Skill Layer providing typed interfaces for 38 security tools across six categories, skill compositions encoding expert attack patterns, and RAG-based knowledge augmentation from MITRE ATT&CK, OWASP, and tool documentation. (2) A TDA mechanism computing a Task Difficulty Index from four dimensions: horizon estimation, evidence confidence, context load, and historical success rate. (3) An Evidence-Guided Attack Tree Search (EGATS) algorithm adapted from MCTS that maintains an external attack tree with UCB-based node selection, TDI-guided mode switching between BFS (reconnaissance) and DFS (exploitation), and evidence-based pruning. (4) A Memory Subsystem with a structured State Store tracking hosts, services, credentials, sessions, and vulnerabilities, with selective context injection and branch summaries to prevent context forgetting.

PentestGPT V2 是一个单智能体系统，具有四个主要组件：(1) 工具与技能层，为 6 个类别的 38 个安全工具提供类型化接口，编码专家攻击模式的技能组合，以及来自 MITRE ATT&CK、OWASP 和工具文档的基于 RAG 的知识增强。(2) TDA 机制，从四个维度计算任务难度指数：时程估计、证据置信度、上下文负载和历史成功率。(3) 证据引导攻击树搜索 (EGATS) 算法，改编自 MCTS，维护具有基于 UCB 节点选择的外部攻击树，在 BFS（侦察）和 DFS（利用）之间进行 TDI 引导的模式切换，以及基于证据的修剪。(4) 内存子系统，具有结构化状态存储，跟踪主机、服务、凭据、会话和漏洞，具有选择性上下文注入和分支摘要，以防止上下文遗忘。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

PentestGPT V2 achieves 91% task completion on XBOW with Opus 4.5 thinking mode (49% relative improvement over best baseline of 61%), roots 12 of 13 machines on the PentestGPT Benchmark (33% improvement over best baseline of 9), and compromises 4 of 5 hosts on GOAD (doubling baseline performance of 2). In the HTB Season 8 live competition, PentestGPT V2 completed 10 of 13 machines (76.9%), ranking in the top 100 out of 8,036 active participants. Ablation shows Tool Layer dominates on short-horizon tasks (+14% on XBOW), TDA-EGATS on multi-step scenarios (+2 machines on PentestGPT-Ben), and Memory on extended campaigns (+1 host on GOAD).

PentestGPT V2 在使用 Opus 4.5 思考模式的 XBOW 上实现了 91% 的任务完成率（比最佳基准 61% 提高了 49% 的相对性能），在 PentestGPT 基准测试中 Root 了 13 台机器中的 12 台（比最佳基准 9 台提高了 33%），并在 GOAD 上入侵了 5 台主机中的 4 台（使基准性能 2 台翻倍）。在 HTB 第 8 赛季直播比赛中，PentestGPT V2 完成了 13 台机器中的 10 台 (76.9%)，在 8,036 名活跃参与者中排名前 100。消融实验显示，工具层在短时程任务上占主导地位（XBOW 上 +14%），TDA-EGATS 在多步场景中占主导地位（PentestGPT 基准上 +2 台机器），内存层在扩展任务中占主导地位（GOAD 上 +1 台主机）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PentestGPT v1.0
AutoPT
PentestAgent
VulnBot
Cochise (in initial analysis only)

Scale 评估规模

104 XBOW web tasks, 13 HTB/VulnHub machines, 5-host GOAD AD environment, 13 HTB Season 8 live machines

Contributions 核心贡献

Systematic analysis of 28 LLM-based penetration testing systems identifying two distinct failure categories: Type A (capability gaps) and Type B (complexity barriers), showing existing architectures optimize for transient model constraints rather than persistent task challenges
PentestGPT V2 system with Tool and Skill Layer (38 typed tool interfaces, skill compositions, RAG knowledge augmentation) for Type A failures and Task Difficulty Assessment integrated into Evidence-Guided Attack Tree Search for Type B failures
Task Difficulty Assessment (TDA) mechanism computing tractability from four measurable dimensions: horizon estimation, evidence confidence, context load, and historical success rate
Evidence-Guided Attack Tree Search (EGATS) algorithm adapting MCTS to penetration testing with TDI-guided mode switching, UCB selection with difficulty penalty, and evidence-based pruning
Comprehensive evaluation across three benchmarks plus live HTB Season 8 deployment, achieving 91% on CTF benchmarks and 4/5 hosts on enterprise AD
Open-source release of implementation, tool interfaces, and evaluation scripts

对 28 个基于 LLM 的渗透测试系统进行了系统分析，确定了两个不同的故障类别：A 型（能力差距）和 B 型（复杂性障碍），表明现有架构针对瞬态模型限制而非持久任务挑战进行了优化。
提出了 PentestGPT V2 系统，包含工具与技能层（38 个类型化工具接口、技能组合、RAG 知识增强）用于解决 A 型故障，以及集成在证据引导攻击树搜索中的任务难度评估用于解决 B 型故障。
任务难度评估 (TDA) 机制，从四个可测量维度计算可处理性：时程估计、证据置信度、上下文负载和历史成功率。
证据引导攻击树搜索 (EGATS) 算法，将 MCTS 改编用于渗透测试，具有 TDI 引导的模式切换、带难度惩罚的 UCB 选择以及基于证据的修剪。
在三个基准测试中进行了全面评估，外加 HTB 第 8 赛季的直播部署，在 CTF 基准测试中达到 91%，在企业级 AD 中达到 4/5 主机。
开源发布了实现、工具接口和评估脚本。

Limitations 局限性

Benchmark scope omits binary exploitation, mobile security, and cloud-specific attack scenarios where different challenges may dominate
PentestGPT Benchmark uses retired machines with public walkthroughs, potentially inflating absolute numbers through data contamination
TDA cannot distinguish 'difficult but tractable' from 'novel requiring creative reasoning' -- both present as high TDI, leading to premature pruning of novel attack paths (PlayerTwo failure case)
No defense against adversarial environments: honeypots, canary tokens, and deceptive services can poison the agent's state representation
No cross-session continuity for multi-week engagements; Memory Subsystem operates within single sessions only
Baseline systems evaluated with their original tool invocation mechanisms rather than PentestGPT V2's Tool Layer, so reported improvements conflate tool integration and architectural contributions
Results obtained with only three frontier model families; different model architectures show different strengths (e.g., Opus 4.5 best on XBOW)
Blind injection requiring timing-based exfiltration and multi-stage creative payload chaining remain unsolved (9% XBOW failures)

基准测试范围忽略了二进制漏洞利用、移动安全和云特定攻击场景，在这些场景中，其他挑战可能占主导地位。
PentestGPT 基准测试使用了具有公开 Writeup 的退役机器，可能通过数据污染夸大了绝对数值。
TDA 无法区分“困难但可处理”与“需要创造性推理的新颖” —— 两者都表现为高 TDI，导致过早修剪新颖的攻击路径（PlayerTwo 失败案例）。
没有针对对抗性环境的防御：蜜罐、金丝雀令牌和欺骗性服务可能会毒害智能体的状态表示。
没有针对数周长期参与的跨会话连续性；内存子系统仅在单次会话内运行。
基准系统使用其原始工具调用机制而非 PentestGPT V2 的工具层进行评估，因此报告的改进混合了工具集成和架构贡献。
结果仅在三个前沿模型系列上获得；不同的模型架构显示出不同的优势（例如，Opus 4.5 在 XBOW 上表现最好）。
需要基于定时的渗漏和多阶段创造性载荷链的盲注仍然未解决（XBOW 故障占 9%）。

Research Gaps 研究空白

Novel exploitation requiring creative, out-of-distribution reasoning beyond pattern matching remains an open problem that neither improved search nor larger corpora can resolve
Adversarial robustness: agents have no mechanism to detect honeypots, canary tokens, or deceptive services that poison their state representation
Cross-session temporal reasoning: maintaining mental models across multi-week engagements requires hierarchical abstraction, goal decomposition, and progress monitoring that current transformer architectures do not natively support
Evaluation methodology gap: no existing benchmarks separately assess Type A and Type B performance, making it hard to measure architectural progress
Distinguishing tractable difficulty from intractable novelty in real-time remains unsolved
Binary exploitation requiring precise memory layout reasoning poses distinct challenges not captured by current benchmarks
Social engineering and business logic testing are absent from all current LLM pentest evaluations

需要超出模式匹配的创造性、分布外推理的新型漏洞利用仍然是一个悬而未决的问题，改进的搜索或更大的语料库都无法解决。
对抗鲁棒性：智能体没有机制来检测会毒害其状态表示的蜜罐、金丝雀令牌或欺骗性服务。
跨会话时间推理：在数周的长期参与中维护心智模型需要分层抽象、目标分解和进度监控，当前的 Transformer 架构原生不支持这些。
评估方法差距：现有基准测试没有分别评估 A 型和 B 型性能，导致难以衡量架构进展。
实时区分可处理的难度与棘手的新颖性仍然未解决。
需要精确内存布局推理的二进制漏洞利用提出了目前基准测试未涵盖的独特挑战。
社会工程和业务逻辑测试在当前所有 LLM 渗透测试评估中都缺失。

Novel Techniques 新颖技术

Task Difficulty Index (TDI) combining four measurable dimensions (horizon estimation, evidence confidence, context load, historical success) to guide real-time exploration-exploitation decisions in penetration testing
Evidence-Guided Attack Tree Search (EGATS) adapting MCTS to penetration testing with TDI-guided BFS/DFS mode switching and an intermediate LLM-DECIDE zone for ambiguous cases
UCB formula modified with difficulty penalty term (-lambda * TDI) to penalize high-difficulty nodes during selection
Type A / Type B failure taxonomy for systematically categorizing LLM agent failures into engineering-solvable capability gaps vs. architecture-requiring complexity barriers
Credential propagation mechanism that re-evaluates pruned branches when new credentials discovered elsewhere in the attack tree might satisfy their preconditions
Context load dimension tracking fraction of context window consumed with empirically-derived 40% ideal working window threshold

任务难度指数 (TDI) 结合了四个可测量维度（时程估计、证据置信度、上下文负载、历史成功），以指导渗透测试中的实时探索-利用决策。
证据引导攻击树搜索 (EGATS) 将 MCTS 改编用于渗透测试，具有 TDI 引导的 BFS/DFS 模式切换和针对模糊情况的中间 LLM-DECIDE 区域。
修改后的 UCB 公式，带有难度惩罚项 (-lambda * TDI)，在选择期间惩罚高难度节点。
A 型 / B 型故障分类学，用于系统地将 LLM 智能体故障分类为 engineering-solvable 的能力差距与需要架构的复杂性障碍。
凭据传播机制，当攻击树其他地方发现新凭据可能满足其先验条件时，重新评估已修剪的分支。
上下文负载维度跟踪所消耗的上下文窗口比例，采用实证得出的 40% 理想工作窗口阈值。

Open Questions 开放问题

How can LLM agents distinguish genuinely novel (requiring creative reasoning) from merely difficult (requiring more search) tasks in real time?
Can agents develop meta-awareness to detect adversarial manipulation of their state representation (honeypots, deceptive services)?
How should cross-session state and strategic patience be modeled for multi-week penetration testing engagements?
Will the Type A/B failure framework generalize to other LLM agent domains like software engineering or web navigation?
What is the optimal balance between domain-specific tool specialization (like Cochise's AD focus) and general-purpose flexibility?
How do we build evaluation methodologies that separately measure progress on Type A vs Type B challenges?
Can TDA be extended with learned difficulty models trained on historical pentest traces rather than hand-tuned scoring rubrics?

LLM 智能体如何实时区分真正新颖（需要创造性推理）与仅仅是困难（需要更多搜索）的任务？
智能体能否发展出元意识来检测对其状态表示的对抗性操纵（蜜罐、欺骗性服务）？
对于数周的渗透测试参与，应如何建模跨会话状态和战略耐心？
A/B 型故障框架是否会推广到其他 LLM 智能体领域，如软件工程或 Web 导航？
领域特定工具专业化（如 Cochise 对 AD 的关注）与通用灵活性之间的最佳平衡是什么？
我们如何构建分别衡量 A 型与 B 型挑战进展的评估方法？
TDA 能否通过在历史渗透测试轨迹上训练的深度学习难度模型来扩展，而不是依靠手工调整的评分量表？

Builds On 基于前人工作

PentestGPT (Deng et al., USENIX Security 2024)
Monte Carlo Tree Search (Coulom 2007, Kocsis & Szepesvari 2006)
Anthropic Agent Skills framework
Model Context Protocol (MCP)
AutoPT Pentesting State Machine
TermiAgent Memory Tree
SWE-bench (Jimenez et al., ICLR 2024)

Open Source 开源信息

Yes - https://anonymous.4open.science/r/Excalibur-FA7D (anonymous for review)