基于大语言模型的自动化渗透测试研究 LLM-Based Penetration Testing

69 papers · 2005–2026

Executive Summary

This report synthesizes 68 academic papers on LLM-based penetration testing, published between 2005 and 2026, with 63% concentrated in 2025–2026. The field has undergone explosive growth: from early chat-based assistants (#01 PentestGPT, 2024) to sophisticated multi-agent systems with hierarchical planning (#04 xOffense, #19 Cochise, 2025) and RL-tuned models (#10 Pentest-R1, #54 Cyber-Zero, 2025) in just two years.

Key Findings

1. Rapid but immature growth. 68% of papers are preprints and only 13% appear at top conferences. The field lacks theoretical foundations — no paper offers formal complexity analysis or provable guarantees. Most systems rely on GPT-4/4o despite evidence that fine-tuned smaller models (#04 Qwen-32B, #42 Hackphyr-7B) achieve competitive or superior results.

2. Severe phase coverage imbalance. 94% of papers address exploitation, but only 25% tackle lateral movement, 21% cover reporting, and a mere 1% address persistence. Current LLM agents are effectively "single-host exploiters" rather than end-to-end penetration testers. Only 4 systems (#06, #19, #42, #60) attempt post-exploitation automation.

3. Fragmented and unrealistic evaluation. No benchmark is used by more than 7 papers. 40% rely on custom lab environments, and only 6% evaluate on real networks. CTF challenges dominate evaluation, but they poorly approximate actual enterprise penetration tests involving multi-host networks, Active Directory, and time-constrained engagements.

Implications for Research

The field presents significant opportunities for doctoral research. We identify 17 concrete research gaps across technical (T1–T7), evaluation (E1–E3), application domain (D1–D3), safety (S1–S2), and theoretical (Th1–Th2) dimensions, and propose 8 PhD thesis directions with feasibility assessments. The most promising unexplored areas are: multi-host network penetration (addressed by only 3 papers), cross-session persistent learning (only 1 paper), and the completely uncharted DRL+LLM integration space.

执行摘要

本报告综合分析了 69 篇关于基于大语言模型（LLM）的渗透测试学术论文，其中62%集中在2025–2026年。该领域经历了爆发式增长：仅两年间，从早期的对话式助手（#01 PentestGPT, 2024）发展为具备层次化规划的复杂多智能体系统（#04 xOffense、#19 Cochise, 2025）以及强化学习微调模型（#10 Pentest-R1、#54 Cyber-Zero, 2025）。

核心发现

1. 快速增长但尚不成熟。 67%的论文为预印本，仅14%发表于顶级会议。该领域缺乏理论基础——没有论文提供形式化的复杂度分析或可证明的保证。大多数系统依赖 GPT-4/4o，尽管有证据表明微调的小模型（#04 Qwen-32B、#42 Hackphyr-7B）可以达到相当甚至更优的效果。

2. Agent 失败的根本原因是架构而非模型。 PentestGPT V2 #02 分析 28 个系统后发现两类失败：Type A（能力缺口，随模型进步改善）和 Type B（复杂性壁垒，不随模型改善）。Type B 失败源于搜索策略低效、多步攻击链脆弱和反馈信号稀疏——这些是渗透测试任务的结构性特点，与编程任务（反馈精确、行动空间有限、操作可逆）形成鲜明对比。CHECKMATE #03 发现同一个 Claude Code 在编程任务上表现优秀，在渗透测试中却呈现盲目试错行为，印证了这一差异。

3. 攻击阶段覆盖严重失衡。 94%的论文涉及漏洞利用（exploitation），但仅25%涉及横向移动（lateral movement），21%涉及报告生成（reporting），仅1%涉及持久化（persistence）。当前的 LLM 智能体实质上是"单主机漏洞利用器"，而非端到端的渗透测试工具。

4. 评估体系碎片化且脱离实际。 没有任何基准测试被超过7篇论文共同使用。40%的论文依赖自建实验环境，仅6%在真实网络上进行评估。CTF 挑战主导了评估体系，但它们无法有效模拟涉及多主机网络、Active Directory 和时间约束的实际企业渗透测试场景。

研究启示

该领域为博士研究提供了重大机遇。我们识别了 17 个具体的研究空白，涵盖技术（T1–T7）、评估（E1–E3）、应用领域（D1–D3）、安全（S1–S2）和理论（Th1–Th2）五个维度，并提出了三阶段递进式博士研究路线图：（1）构建统一基准套件，解决评测碎片化问题；（2）在理想环境下实现多主机网络渗透测试，突破单主机局限；（3）引入主动防御者，研究对抗环境下的攻防动态。三个阶段分别覆盖评测（E1–E3）、技术（T1+T3+D1）和对抗（T5+E3）维度的关键空白，形成从"评测基础设施→系统能力验证→真实场景适应"的完整研究链条。

Survey at a Glance 调研概览

Full analytics → 完整统计 →

69 papers 篇论文

62% from 2025–26 来自 2025–26

67% preprints 预印本

80% fully autonomous 全自动化

Attack Phase Coverage 攻击阶段覆盖率

reconnaissance

86%

scanning

74%

enumeration

84%

exploitation

93%

post exploitation

38%

privilege escalation

46%

lateral movement

25%

reporting

20%

Publication Timeline 发表时间线

Agent Architectures & System Design

Overview

The field of LLM-based penetration testing has evolved rapidly from 2023 to 2026, moving from simple chat-based assistants to sophisticated multi-agent systems with structured planning, memory, and tool integration. This analysis covers 65 papers, of which approximately 40 propose concrete system architectures.

Architecture Taxonomy

Single-Agent Systems

Papers where a single LLM handles all reasoning and execution:

#02 PentestGPT V2 (EGATS-based tree search)
#03 CHECKMATE (Classical Planning+)
#10 Pentest-R1 (RL-tuned single agent)
#13 RapidPen (layered ReAct)
#15 PwnGPT (prompt-chaining for exploit generation)
#18 Pentest Copilot (step-chaining)
#20 HackSynth (Planner-Summarizer dual module)
#33 Mantis (defensive prompt injection)
#38 Getting Pwn'd by AI (SSH feedback loop)
#42 Hackphyr (fine-tuned 7B model)
#44 CTFAgent (RAG-augmented)
#45 Improving LLM RL CTF (RL-trained)
#48 CHAP (context relay across sessions)
#52 RedTeamLLM (ReAct-based)
#54 Cyber-Zero (trajectory-trained)
#55 LLMs as Hackers (hackingBuddyGPT)
#56 EnIGMA (SWE-agent with IATs)
#62 LLM Agents Exploit One-day (minimal ReAct)

Multi-Agent Systems

Papers with multiple specialized LLM agents:

#01 PentestGPT (Reasoning/Generation/Parsing)
#04 xOffense (5 components with MemAgent)
#05 AutoPenGPT (Decision/Expert/Analyser/Summarizer/Util)
#06 AutoAttacker (Summarizer/Planner/Navigator/Experience)
#07 AutoPT (FSM with Agent+Rule states)
#08 AutoPentester (5 agents + Repetition Identifier)
#09 VulnBot (phase-specialized agents)
#11 PentestAgent (Recon/Search/Planning/Execution)
#14 ARACNE (multi-LLM modules)
#21 Teams of LLM (HPTSA manager + experts)
#22 CAI (role-based swarm)
#41 BreachSeek (LangGraph supervisor)
#43 CRAKEN (ReWOO with Graph-RAG)
#46 CyberExplorer (reactive multi-agent)
#49 G-CTR (game-theoretic)
#57 MAPTA (Coordinator/Sandbox/Validation)
#61 AutoPentest (Planner/Supervisor/Workers)
#66 PenHeal (Planner/Executor/Instructor/Summarizer/Extractor)

Hierarchical (Planner-Executor) Systems

A specific multi-agent pattern that separates strategic planning from tactical execution:

#19 Cochise (Planner + Executor for AD)
#21 Teams of LLM (Manager + task-specific experts)
#60 Incalmo (LLM planner + domain expert agents)

Key Design Dimensions

Planning Strategy

Strategy	Papers
ReAct	#05, #06, #13, #14, #19, #22, #28, #29, #39, #42, #44, #47, #52, #55, #56, #57, #60, #62, #66
Tree-based (PTT, attack tree, MCTS)	#01, #02, #08, #19
Chain-of-thought	#08, #11, #18
Classical planning	#03, #31
Finite state machine	#07, #59
RL-based	#10, #24, #26, #27, #45
Hierarchical planning	#21, #41, #61
Task graph (DAG)	#04, #09
Game-theoretic	#25, #49
None/minimal	#15, #16, #20, #32, #33, #50, #53

Observation: ReAct dominates (~30% of systems), but more structured approaches (classical planning, FSM, task graphs) show promising results when applied.

Memory Mechanism

Mechanism	Papers
Conversation history only	#01, #10, #14, #19, #20, #42, #45, #47, #48, #50, #52, #54, #55, #62, #67
RAG (vector store)	#02, #04, #05, #06, #08, #09, #11, #13, #18, #29, #34, #43, #44, #60, #61, #66
Knowledge graph	#03, #31, #43
Dual memory (success/failure logs)	#59
Structured handoff (cross-session)	#48
Belief state (POMDP)	#24
None	#32, #33, #50

Observation: RAG is the most popular memory mechanism, but its effectiveness is debated. #43 CRAKEN found only 43.8% of retrieved documents meet relevance standards. Cross-session memory (#48 CHAP) is underexplored.

Automation Level

Level	Papers
Fully autonomous	#02, #03, #04, #06, #07, #08, #09, #10, #13, #14, #15, #19, #20, #21, #22, #41, #42, #45, #46, #49, #52, #54, #57, #60, #62, #66
Semi-autonomous	#05, #11, #61
Human-in-the-loop	#01, #16, #39, #40, #59, #67
Copilot	#18

Observation: The field is aggressively pursuing full autonomy, but the most successful real-world results (#02 PentestGPT V2) still use human-in-the-loop elements for difficult decisions.

Tool Integration Approaches

Shell command generation (most common): Agent generates bash/shell commands, system executes them (#01-#09, most systems)
Predefined action library: Curated tool wrappers with preconditions (#03 CHECKMATE, #36 ADAPT)
MCP/function calling: Structured tool interfaces (#57 MAPTA)
Interactive tools (IATs): Non-blocking parallel sessions for debuggers and network tools (#56 EnIGMA)
Expert agents: Domain-specific tool handlers (#60 Incalmo)

Architectural Patterns

Pattern 1: Pentesting Task Tree (PTT)

A tree-structured representation of testing state maintained across steps.

Used by: #01, #02, #08, #13, #19
Strength: Provides structured context for LLM reasoning
Weakness: Can become stale, vulnerable to LLM hallucination corrupting tree state

Pattern 2: Phase Decomposition

Splitting pentesting into distinct phases (recon, scan, exploit, etc.) with specialized agents per phase.

Used by: #04, #05, #07, #09, #11, #58
Strength: Reduces complexity per agent, enables specialization
Weakness: Information loss at phase boundaries (#09 found 42.36% of failures due to context loss)

Pattern 3: RAG-Augmented Exploitation

Using retrieval to inject relevant exploit knowledge during testing.

Used by: #04, #05, #08, #09, #11, #39, #43, #44, #61
Strength: Supplements LLM knowledge gaps for specific CVEs/tools
Weakness: Quality issues (#43: 43.8% relevance), can misguide agents (#44)

Pattern 4: Planning-Execution Separation

Decoupling high-level strategy from low-level command generation.

Used by: #03, #19, #21, #60
Strength: #60 Incalmo showed this enables small LLMs to outperform large ones
Weakness: Information transfer between planner and executor is error-prone (#19)

Pattern 5: RL-Trained Agents

Using reinforcement learning to train or fine-tune agents for pentesting.

Used by: #10, #45, #54
Strength: Can learn from experience, improve over time
Weakness: Reward sparsity, limited transfer across domains

Evolution Timeline

2023: Human-Guided Assistants

PentestGPT (#01): First major LLM pentesting system, human-in-the-loop
Getting Pwn'd (#38): Early exploration of LLM for privilege escalation
LLM CTF Challenges (#67): Basic evaluation of ChatGPT on CTF tasks

2024: Autonomous Agent Explosion

AutoAttacker (#06), AutoPT (#07): Push toward full autonomy
Teams of LLM (#21): Multi-agent for zero-day exploitation
NYU CTF Bench (#28), AutoPenBench (#29): Benchmark establishment
LLM Exploit One-day (#62): Demonstrated GPT-4 can exploit real CVEs

2025: Architecture Sophistication

CHECKMATE (#03): Classical planning integration
xOffense (#04): Fine-tuned mid-scale model beats larger ones
VulnBot (#09): Open-source multi-agent with PTG
Pentest-R1 (#10): First RL-trained pentesting LLM
PentestEval (#58): Stage-level benchmark revealing exploitation is the bottleneck
Incalmo (#60): Proved abstractions matter more than model size
EnIGMA (#56): Interactive tools for CTF solving

2026: Maturation and Scaling

PentestGPT V2 (#02): Top-100 live HTB ranking, most comprehensive real-world validation
HackWorld (#53): GUI/visual agent evaluation
CHAP (#48): Long-running context management across sessions
Cyber-Zero (#54): Trajectory synthesis for training without runtime

Gaps & Opportunities

1. Post-Exploitation and Lateral Movement — Critical

Most systems focus exclusively on initial access (getting a shell or capturing a flag). Of the approximately 40 systems surveyed, only 3 address post-exploitation or lateral movement in multi-host networks: #06 AutoAttacker (post-breach automation), #19 Cochise (Active Directory lateral movement), and #60 Incalmo (multi-host pivoting via abstractions). Even among these, results are limited: #19 achieved only ~1.83 compromised accounts per run, and #60 succeeded by using high-level action abstractions rather than LLM-driven lateral movement. The Phase Decomposition pattern (Pattern 2 above) typically terminates after exploitation, with no agents specialized for post-exploitation phases. Real penetration tests spend the majority of time in post-exploitation, making this gap a fundamental barrier to practical deployment. See also gap T1 in Research Gaps document.

2. Multimodal Capabilities — Critical

No system in the survey effectively handles GUI-based testing, image analysis, or visual interaction with web applications. The sole evaluation of visual agents (#53 HackWorld) showed that Computer-Using Agents achieve below 12% success on web vulnerabilities through visual interaction, highlighting a massive capability gap. This limitation is explicitly noted as a gap by at least 6 papers (#01, #03, #04, #09, #10, #44), yet none attempt to address it architecturally. All 40+ systems in the Architecture Taxonomy above operate exclusively through text-based shell commands (Tool Integration approach 1), with zero systems incorporating visual perception into their tool integration stack. Given that web application testing represents the largest segment of the commercial pentesting market, this gap severely limits real-world applicability. See also gap T2 in Research Gaps document.

3. Cross-Session Memory — Critical

Of the 40 systems surveyed, only #48 CHAP explicitly addresses maintaining context across multiple testing sessions. The Memory Mechanism table above shows that 15 systems rely on conversation history alone, which is lost between sessions. Multiple papers report severe context degradation: #05 showed retention drops to 57.1% on complex tasks, #09 found 42.36% of failures attributable to session context loss, and #01 noted context degradation as a primary limitation. Real penetration tests span days or weeks, requiring persistent state management that no current architecture provides at scale. The only system operating across multi-day engagements (#02 PentestGPT V2) achieves this through human-maintained state rather than automated memory. This gap is particularly acute for the Hierarchical and Multi-Agent patterns, where coordination state across sessions adds further complexity. See also gap T3 in Research Gaps document.

4. Defense-Aware Agents — Important

No offensive system in the entire survey accounts for active defenders such as IDS/IPS, honeypots, or SOC analysts. This is starkly illustrated by two defensive papers: #32 CHeaT achieved 100% defense success against PentestGPT using simple countermeasures, and #33 Mantis demonstrated that prompt injection can halt or even reverse LLM-based attacks. At least 4 offensive papers (#19, #06, #07, #08) explicitly acknowledge "evaluation limited to systems without active defenses" as a limitation. Among all architecture patterns identified above, none incorporate evasion, stealth, or defender-awareness as a design consideration. This creates a fundamental disconnect between benchmark performance and real-world deployability, where defenders are always present. See also gap T5 in Research Gaps document.

5. Formal Verification of Agent Plans — Important

No system provides formal guarantees about agent behavior, scope adherence, or safety constraints. Classical planning (#03 CHECKMATE) represents the closest approach, demonstrating that formal action models improve consistency (100% vs. 75% for ReAct baselines), but it does not verify that generated plans stay within authorized scope. The FSM-based approaches (#07 AutoPT, #59 RefPentester) provide some structural guarantees about state transitions but not about the safety of individual actions. The POMDP formalization in #24 and the labeled transition system in #36 offer theoretical foundations, but neither has been applied to verify real agent behavior. Given that autonomous agents executing commands on live systems risk unintended damage (#20, #45), formal verification is a prerequisite for real-world deployment and regulatory acceptance. See also gaps S2 and Th1 in Research Gaps document.

6. Cost-Efficient Architectures — Important

Of the 26 fully autonomous systems listed in the Automation Level table, the vast majority depend on GPT-4 or GPT-4o, creating significant cost and latency barriers. However, two key results suggest this dependency is not fundamental: #04 xOffense demonstrated that a fine-tuned 32B model outperforms 405B models on pentesting tasks, and #60 Incalmo showed that good planning-execution abstractions allow Claude Haiku (a small model) to match or exceed larger models. Pattern 4 (Planning-Execution Separation) appears particularly promising for cost efficiency, as it allows the expensive reasoning to be concentrated in a small planner while cheaper models handle execution. Additionally, #42 Hackphyr showed that a fine-tuned 7B model can perform competitively on privilege escalation tasks, and #54 Cyber-Zero demonstrated trajectory synthesis can train small models without expensive runtime inference. Despite these encouraging results, no systematic study compares architecture-cost tradeoffs across the design space. See also PhD Topic 5 in Research Gaps document.

7. Learning from Failure — Emerging

The overwhelming majority of systems start each engagement from scratch with no accumulated experience. Of 40+ systems, only 3 implement any form of experience retention: #59 RefPentester maintains dual success/failure logs, #06 AutoAttacker includes an experience manager component, and #13 RapidPen uses success-case RAG. #48 CHAP's handoff protocols preserve context between sessions but do not generalize lessons learned to new targets. No system demonstrates cross-engagement transfer learning, where lessons from testing one network improve performance on another. This contrasts sharply with human pentesters, who accumulate expertise over years. The RAG-Augmented Exploitation pattern (Pattern 3) could theoretically support experience storage, but current implementations retrieve static knowledge rather than learned strategies. See also gap T4 in Research Gaps document.

8. Reliable Tool Usage and Error Recovery — Important

LLM agents exhibit alarmingly high rates of invalid command generation, directly undermining the practical value of autonomous pentesting. Quantitative evidence spans multiple systems: #19 Cochise reported 35.9% invalid commands overall with a 94% failure rate for hashcat specifically, #09 VulnBot recorded 19.70% failed tool errors, and #42 Hackphyr noted persistently high invalid action rates. Despite these consistent findings, no architecture pattern explicitly addresses command validation or error recovery as a first-class design concern. The Predefined Action Library approach (Tool Integration approach 2, used by #03 and #36) partially mitigates this by constraining the action space, but sacrifices flexibility. Interactive tools (#56 EnIGMA's IATs) offer a different mitigation by allowing agents to observe tool output in real time, but this has only been applied to CTF contexts. Systematic architectural solutions for command validation, graceful error handling, and adaptive retry strategies remain an open problem. See also gap T7 in Research Gaps document.

9. Standardized Inter-Agent Communication — Emerging

Among the 18 multi-agent systems listed in the Architecture Taxonomy, there is no standardized protocol for inter-agent communication or information handoff. Each system invents its own coordination mechanism: #04 xOffense uses a MemAgent, #05 AutoPenGPT chains agents sequentially, #21 HPTSA uses a manager-expert hierarchy, #22 CAI uses role-based swarms, and #41 BreachSeek uses LangGraph supervision. Pattern 2 (Phase Decomposition) is particularly affected, as #09 VulnBot found that 42.36% of failures stem from information loss at phase boundaries between agents. Without standardized interfaces, multi-agent systems cannot benefit from modular agent reuse or community-developed specialized agents. This mirrors the evolution of microservice architectures in software engineering, where standardized APIs enabled ecosystem growth.

智能体架构与系统设计

概述

基于大语言模型（LLM）的渗透测试领域在2023年至2026年间经历了快速发展，从简单的对话式助手演进为具备结构化规划、记忆和工具集成能力的复杂多智能体系统。本分析涵盖69篇论文，其中约46篇提出了具体的系统架构。

架构分类

单智能体系统

由单个LLM处理所有推理和执行的论文：

#02 PentestGPT V2（基于EGATS的树搜索）
#03 CHECKMATE（Classical Planning+）
#10 Pentest-R1（强化学习调优的单智能体）
#13 RapidPen（分层ReAct）
#15 PwnGPT（用于漏洞利用生成的提示链）
#18 Pentest Copilot（步骤链接）
#20 HackSynth（Planner-Summarizer双模块）
#33 Mantis（防御性提示注入）
#38 Getting Pwn'd by AI（SSH反馈循环）
#42 Hackphyr（微调的7B模型）
#44 CTFAgent（RAG增强）
#45 Improving LLM RL CTF（强化学习训练）
#48 CHAP（跨会话上下文传递）
#52 RedTeamLLM（基于ReAct）
#54 Cyber-Zero（轨迹训练）
#55 LLMs as Hackers (hackingBuddyGPT)
#56 EnIGMA（带IATs的SWE-agent）
#62 LLM Agents Exploit One-day（最小化ReAct）

多智能体系统

包含多个专业化LLM智能体的论文：

#01 PentestGPT（推理/生成/解析）
#04 xOffense（5个组件含MemAgent）
#05 AutoPenGPT（Decision/Expert/Analyser/Summarizer/Util）
#06 AutoAttacker（Summarizer/Planner/Navigator/Experience）
#07 AutoPT（带Agent+Rule状态的FSM）
#08 AutoPentester（5个智能体 + Repetition Identifier）
#09 VulnBot（按阶段专业化的智能体）
#11 PentestAgent（Recon/Search/Planning/Execution）
#14 ARACNE（多LLM模块）
#21 Teams of LLM（HPTSA管理者 + 专家）
#22 CAI（基于角色的Swarm）
#41 BreachSeek（LangGraph监督者）
#43 CRAKEN（带Graph-RAG的ReWOO）
#46 CyberExplorer（反应式多智能体）
#49 G-CTR（博弈论方法）
#57 MAPTA（Coordinator/Sandbox/Validation）
#61 AutoPentest（Planner/Supervisor/Workers）
#66 PenHeal（Planner/Executor/Instructor/Summarizer/Extractor）

分层（规划器-执行器）系统

将战略规划与战术执行分离的特定多智能体模式：

#19 Cochise（Planner + Executor，面向Active Directory）
#21 Teams of LLM（Manager + 任务专属专家）
#60 Incalmo（LLM规划器 + 领域专家智能体）

关键设计维度

规划策略

策略	论文
ReAct	#05, #06, #13, #14, #19, #22, #28, #29, #39, #42, #44, #47, #52, #55, #56, #57, #60, #62, #66
基于树的方法（PTT、攻击树、MCTS）	#01, #02, #08, #19
思维链	#08, #11, #18
经典规划	#03, #31
有限状态机	#07, #59
基于强化学习	#10, #24, #26, #27, #45
分层规划	#21, #41, #61
任务图（DAG）	#04, #09
博弈论	#25, #49
无/最小化	#15, #16, #20, #32, #33, #50, #53

观察：ReAct占主导地位（约占系统的30%），但更结构化的方法（经典规划、有限状态机、任务图）在实际应用中展现出良好的效果。

记忆机制

机制	论文
仅对话历史	#01, #10, #14, #19, #20, #42, #45, #47, #48, #50, #52, #54, #55, #62, #67
RAG（向量存储）	#02, #04, #05, #06, #08, #09, #11, #13, #18, #29, #34, #43, #44, #60, #61, #66
知识图谱	#03, #31, #43
双重记忆（成功/失败日志）	#59
结构化交接（跨会话）	#48
信念状态（POMDP）	#24
无	#32, #33, #50

观察：RAG是最流行的记忆机制，但其有效性仍存在争议。#43 CRAKEN发现仅43.8%的检索文档达到相关性标准。跨会话记忆（#48 CHAP）仍处于探索不足的状态。

自动化程度

程度	论文
完全自主	#02, #03, #04, #06, #07, #08, #09, #10, #13, #14, #15, #19, #20, #21, #22, #41, #42, #45, #46, #49, #52, #54, #57, #60, #62, #66
半自主	#05, #11, #61
人在回路中	#01, #16, #39, #40, #59, #67
辅助驾驶	#18

观察：该领域正积极追求完全自主，但在实际应用中最成功的成果（#02 PentestGPT V2）在面对困难决策时仍然采用人在回路中的方式。

工具集成方式

Shell命令生成（最常见）：智能体生成bash/shell命令，系统执行（#01-#09，大多数系统）
预定义动作库：带有前置条件的精选工具封装（#03 CHECKMATE，#36 ADAPT）
MCP/函数调用：结构化工具接口（#57 MAPTA）
交互式工具（IATs）：用于调试器和网络工具的非阻塞并行会话（#56 EnIGMA）
专家智能体：领域特定的工具处理器（#60 Incalmo）

架构模式

模式1：渗透测试任务树（PTT）

跨步骤维护的树状测试状态表示。

使用者：#01, #02, #08, #13, #19
优势：为LLM推理提供结构化上下文
劣势：可能变得过时，易受LLM幻觉影响导致树状态损坏

模式2：阶段分解

将渗透测试拆分为不同阶段（侦察、扫描、利用等），每个阶段配备专业化智能体。

使用者：#04, #05, #07, #09, #11, #58
优势：降低每个智能体的复杂度，实现专业化分工
劣势：阶段边界处的信息丢失（#09发现42.36%的失败源于上下文丢失）

模式3：RAG增强的漏洞利用

使用检索技术在测试过程中注入相关的漏洞利用知识。

使用者：#04, #05, #08, #09, #11, #39, #43, #44, #61
优势：弥补LLM在特定CVE/工具方面的知识空白
劣势：质量问题（#43：43.8%的相关性），可能误导智能体（#44）

模式4：规划-执行分离

将高层策略与底层命令生成解耦。

使用者：#03, #19, #21, #60
优势：#60 Incalmo表明这种方式使小型LLM能够超越大型LLM
劣势：规划器与执行器之间的信息传递容易出错（#19）

模式5：强化学习训练的智能体

使用强化学习来训练或微调渗透测试智能体。

使用者：#10, #45, #54
优势：能够从经验中学习，随时间改进
劣势：奖励稀疏性，跨领域迁移能力有限

演进时间线

2023年：人工引导的助手

PentestGPT (#01)：首个主要的LLM渗透测试系统，人在回路中
Getting Pwn'd (#38)：LLM用于权限提升的早期探索
LLM CTF Challenges (#67)：ChatGPT在CTF任务上的基础评估

2024年：自主智能体爆发

AutoAttacker (#06)、AutoPT (#07)：推动完全自主化
Teams of LLM (#21)：用于零日漏洞利用的多智能体系统
NYU CTF Bench (#28)、AutoPenBench (#29)：基准测试的建立
LLM Exploit One-day (#62)：证明GPT-4能够利用真实CVE漏洞

2025年：架构复杂化

CHECKMATE (#03)：经典规划集成
xOffense (#04)：微调的中等规模模型超越更大的模型
VulnBot (#09)：带PTG的开源多智能体系统
Pentest-R1 (#10)：首个强化学习训练的渗透测试LLM
PentestEval (#58)：阶段级基准测试揭示漏洞利用是瓶颈
Incalmo (#60)：证明抽象层次比模型规模更重要
EnIGMA (#56)：用于CTF解题的交互式工具

2026年：成熟与规模化

PentestGPT V2 (#02)：HackTheBox实时排名前100，最全面的真实场景验证
HackWorld (#53)：GUI/视觉智能体评估
CHAP (#48)：跨会话的长时间运行上下文管理
Cyber-Zero (#54)：无需运行时的轨迹合成训练

差距与机会

1. 后渗透与横向移动 — 关键

大多数系统仅聚焦于初始访问（获取shell或夺旗）。在所调查的约40个系统中，仅有3个涉及多主机网络中的后渗透或横向移动：#06 AutoAttacker（入侵后自动化）、#19 Cochise（Active Directory横向移动）和 #60 Incalmo（通过抽象层实现多主机渗透）。即便在这些系统中，成果也十分有限：#19每次运行仅成功入侵约1.83个账户，而#60的成功依赖于高层动作抽象而非LLM驱动的横向移动。上述阶段分解模式（模式2）通常在漏洞利用阶段后即告终止，没有专门针对后渗透阶段的智能体。真实渗透测试的大部分时间消耗在后渗透阶段，这一差距构成了实际部署的根本性障碍。另见Research Gaps文档中的差距T1。

2. 多模态能力 — 关键

本调查中没有任何系统能有效处理基于GUI的测试、图像分析或与Web应用的视觉交互。唯一针对视觉智能体的评估（#53 HackWorld）显示，Computer-Using Agents通过视觉交互发现Web漏洞的成功率低于12%，揭示了巨大的能力差距。至少6篇论文（#01、#03、#04、#09、#10、#44）明确指出这一差距，但无一从架构层面尝试加以解决。上述架构分类中的全部40余个系统均完全通过基于文本的shell命令运行（工具集成方式1），没有任何系统将视觉感知纳入其工具集成栈。鉴于Web应用测试占据商业渗透测试市场的最大份额，这一差距严重制约了实际应用价值。另见Research Gaps文档中的差距T2。

3. 跨会话记忆 — 关键

在所调查的40个系统中，仅 #48 CHAP 明确涉及跨多个测试会话维护上下文。上述记忆机制表显示，15个系统仅依赖对话历史，而这些历史在会话间即告丢失。多篇论文报告了严重的上下文退化问题：#05显示在复杂任务中保持率降至57.1%，#09发现42.36%的失败归因于会话上下文丢失，#01则指出上下文退化是首要限制。真实渗透测试持续数天乃至数周，需要大规模的持久化状态管理，而当前架构均未提供这一能力。唯一跨多日运行的系统（#02 PentestGPT V2）通过人工维护状态而非自动化记忆来实现这一点。这一差距对分层和多智能体模式尤为突出，因为跨会话的协调状态进一步增加了复杂性。另见Research Gaps文档中的差距T3。

4. 具备防御感知的智能体 — 重要

本调查中没有任何攻击系统考虑到主动防御方，如IDS/IPS、蜜罐或SOC分析师。两篇防御论文鲜明地说明了这一问题：#32 CHeaT仅用简单的对抗措施即对PentestGPT实现了100%的防御成功率，#33 Mantis则证明提示注入可以中止甚至反转基于LLM的攻击。至少4篇攻击论文（#19、#06、#07、#08）明确承认"评估仅限于无主动防御的系统"是其局限性。在上述识别的所有架构模式中，无一将规避、隐蔽或防御感知作为设计考量。这在基准测试表现与实际部署能力之间造成了根本性脱节，因为在真实环境中防御方始终存在。另见Research Gaps文档中的差距T5。

5. 智能体计划的形式化验证 — 重要

没有系统能为智能体行为、范围遵从或安全约束提供形式化保证。经典规划（#03 CHECKMATE）是最接近的方法，证明了形式化动作模型可以提升一致性（100% vs. ReAct基线的75%），但并未验证生成的计划是否处于授权范围之内。基于FSM的方法（#07 AutoPT、#59 RefPentester）对状态转换提供了一定的结构性保证，但对单个动作的安全性不提供保证。#24中的POMDP形式化和#36中的标记转换系统提供了理论基础，但均未被应用于验证真实智能体行为。鉴于自主智能体在实际系统上执行命令存在造成非预期损害的风险（#20、#45），形式化验证是实际部署和监管接受的前提条件。另见Research Gaps文档中的差距S2和Th1。

6. 成本高效的架构 — 重要

在自动化程度表中列出的26个完全自主系统中，绝大多数依赖GPT-4或GPT-4o，造成了显著的成本和延迟障碍。然而，两项关键成果表明这种依赖并非本质性的：#04 xOffense证明经过微调的32B模型在渗透测试任务上可以超越405B模型，#60 Incalmo则表明良好的规划-执行抽象层使Claude Haiku（小型模型）能够达到甚至超越更大模型的表现。模式4（规划-执行分离）在成本效率方面尤为有前景，因为它允许将昂贵的推理集中在小型规划器中，而由较廉价的模型处理执行。此外，#42 Hackphyr表明经过微调的7B模型在权限提升任务中可以达到有竞争力的表现，#54 Cyber-Zero证明轨迹合成可以在无需昂贵运行时推理的条件下训练小型模型。尽管这些成果令人鼓舞，但目前尚无系统性研究在整个设计空间内比较架构-成本权衡。另见Research Gaps文档中的博士课题5。

7. 从失败中学习 — 新兴

绝大多数系统在每次任务中从零开始，不积累任何经验。在40余个系统中，仅有3个实现了某种形式的经验保留：#59 RefPentester维护成功/失败双重日志，#06 AutoAttacker包含经验管理组件，#13 RapidPen使用成功案例RAG。#48 CHAP的交接协议在会话间保留上下文，但并未将所学经验泛化至新目标。没有系统展示跨任务的迁移学习，即从测试一个网络中获得的经验改善对另一个网络的测试表现。这与人类渗透测试人员形成鲜明对比——他们在多年实践中不断积累专业知识。RAG增强的漏洞利用模式（模式3）理论上可以支持经验存储，但当前实现检索的是静态知识而非习得策略。另见Research Gaps文档中的差距T4。

8. 可靠的工具使用与错误恢复 — 重要

LLM智能体表现出令人担忧的高无效命令生成率，直接削弱了自主渗透测试的实际价值。定量证据跨越多个系统：#19 Cochise报告总体35.9%的无效命令率，其中hashcat的失败率高达94%；#09 VulnBot记录了19.70%的工具调用失败率；#42 Hackphyr也观察到持续偏高的无效动作率。尽管这些发现具有一致性，但没有任何架构模式将命令验证或错误恢复作为一等设计关注点。预定义动作库方式（工具集成方式2，由#03和#36采用）通过约束动作空间部分缓解了这一问题，但牺牲了灵活性。交互式工具（#56 EnIGMA的IATs）提供了另一种缓解途径，允许智能体实时观察工具输出，但这仅在CTF场景中得到应用。针对命令验证、优雅错误处理和自适应重试策略的系统性架构解决方案仍是一个开放问题。另见Research Gaps文档中的差距T7。

9. 标准化的智能体间通信 — 新兴

在架构分类中列出的18个多智能体系统中，不存在标准化的智能体间通信协议或信息交接规范。每个系统各自发明协调机制：#04 xOffense使用MemAgent，#05 AutoPenGPT以顺序方式链接智能体，#21 HPTSA采用管理者-专家层级结构，#22 CAI使用基于角色的Swarm，#41 BreachSeek使用LangGraph监督。模式2（阶段分解）受此影响尤为突出，#09 VulnBot发现42.36%的失败源于智能体间阶段边界处的信息丢失。缺乏标准化接口意味着多智能体系统无法受益于模块化智能体复用或社区开发的专业化智能体。这与软件工程中微服务架构的演进历程相类似——正是标准化API的建立推动了生态系统的发展。

Evaluation & Benchmarking Landscape

Overview

Evaluation in LLM-based pentesting is fragmented. Most papers use different environments, metrics, and baselines, making cross-paper comparison nearly impossible. This analysis synthesizes evaluation practices across 65 papers.

Benchmark Inventory

CTF-Based Benchmarks

Benchmark	Type	Scale	Used By
NYU CTF Bench	CTF (6 categories)	200 challenges	#28, #54, #56, #65
AutoPenBench	Docker pentest tasks	33 tasks	#04, #10, #29
Cybench	Professional CTF	40 tasks (4 competitions)	#10, #54, #56, #63
InterCode-CTF	CTF framework	Variable	#10, #54, #56
XBOW	Web CTF	104 challenges	#02, #57
PentestEval	Stage-level	346 tasks, 12 scenarios	#58
picoCTF	Educational CTF	Variable	#45, #52
HackWorld	GUI-based web CTF	36 challenges	#53

Real-World / Simulated Environments

Environment	Type	Used By
HackTheBox	Real machines	#01, #02, #08, #13, #22, #59, #61
VulnHub	VM images	#05, #07, #29, #39, #40, #52
Vulhub (Docker)	Docker CVEs	#03, #04
Metasploitable	Intentionally vulnerable	#06, #41, #66
GOAD	Active Directory lab	#02, #19
MHBench	Multi-host OpenStack	#60
NetSecGame	Network simulation	#42
CyBORG	Cyber gym	#27

Observations

No single benchmark dominates: The most reused are NYU CTF Bench (4 papers), AutoPenBench (3), Cybench (3), and HackTheBox (7)
CTF vs. real pentest disconnect: CTF benchmarks dominate, but CTF ≠ real penetration testing. Only #02, #19, and #60 evaluate on realistic multi-host networks
Benchmark contamination risk: #20 HackSynth identified that static CTF flags can be memorized from training data; #39 noted VulnHub machines may be in LLM training data

Metrics Used

Metric	Papers Using It	Notes
Task/challenge completion rate	Nearly all	Most common but definition varies
Success rate (binary)	#01, #06, #07, #13, #21	Coarse-grained
Sub-task completion	#08, #29, #58	More fine-grained
Milestone-based progress	#03, #29	Best practice for partial credit
Cost ($)	#02, #03, #19, #57, #60	Critical but often unreported
Time/steps	#03, #08, #55	Important for practical deployment
Human interactions	#08	Measures autonomy

Critical Metric Gaps

Stealth/detection avoidance: No paper measures whether the agent avoids triggering IDS/IPS
False positive rate: Only #57 MAPTA requires PoC validation
Reproducibility: Most papers report single-run results; #29 AutoPenBench found high variance across runs
Coverage breadth: Few papers measure what percentage of the attack surface was explored

Attack Phase Coverage

Phase	Papers Covering It	Gap Assessment
Reconnaissance	#01, #04, #05, #07, #09, #11, #16, #48, #57, #61	Well covered
Scanning	#01, #04, #05, #07, #09, #10, #13, #16, #48	Well covered
Enumeration	#01, #04, #05, #07, #09, #10, #11, #13, #16, #48, #55	Well covered
Exploitation	Nearly all	Best covered, yet hardest to succeed at
Post-exploitation	#06, #42, #48, #58	Severely underexplored
Privilege escalation	#05, #10, #16, #48, #52, #55	Moderate, #55 is dedicated to this
Lateral movement	#06, #19, #42, #60	Severely underexplored
Reporting	#05, #16, #57, #61	Minimal coverage

Key finding: The later stages of penetration testing (post-exploitation, lateral movement) are dramatically underrepresented. Most systems stop after getting a shell or capturing a flag.

Model Comparison

Models Most Frequently Tested

Model	Papers	Typical Result
GPT-4 / GPT-4o	~30 papers	Consistently best performer
GPT-3.5	~15 papers	Significantly worse than GPT-4
Claude 3.5 Sonnet	#22, #53, #56, #58	Competitive with GPT-4, best on some benchmarks
Llama family	#04, #09, #10, #42, #45, #54, #55, #62	Generally much weaker than GPT-4
DeepSeek	#10, #54, #58	Competitive when fine-tuned
Qwen	#04, #19, #54, #58	Promising with fine-tuning
o1/o3 (reasoning)	#19, #23, #45	Better for complex reasoning tasks

Key Trends

GPT-4 dominance: In #62, only GPT-4 solved any one-day exploits (87%); all other models scored 0%
Fine-tuning closes the gap: #04 xOffense's fine-tuned Qwen3-32B beat GPT-4o and Llama-405B
Reasoning models help: #19 Cochise found o1+GPT-4o compromised ~5x more accounts than non-reasoning models
Model size isn't everything: #60 Incalmo showed Haiku 3.5 with good abstractions beats Sonnet 4 without them

Critical Gaps in Evaluation

1. No Standardized Benchmark Suite

Each paper uses different environments, metrics, and baselines. #37 Benchmarking Practices systematically documented this fragmentation.

2. CTF ≠ Real Pentesting

CTF challenges are isolated, have known solutions, and lack active defenders. Real penetration tests involve multi-host networks, ambiguous goals, and time pressure. Only #02, #19, #60 approach real-world conditions.

3. No Active Defender Evaluation

Zero papers evaluate agents against actively defended systems. #32 CHeaT and #33 Mantis show trivial defenses defeat all current agents, yet no offensive paper accounts for this.

4. Missing Reproducibility

Few papers release code (#04, #09, #22, #42, #48, #55, #56, #60, #61 are open-source)
Run-to-run variance is rarely reported
Environment setup instructions often insufficient

5. No Cost-Normalized Comparison

API costs vary dramatically ($0.68 per target in #03 to $96.20 total in #61). Without cost normalization, comparing system effectiveness is misleading.

6. Lack of Attack/Defense Benchmarks

#37 noted all benchmarks are Jeopardy-style (solve isolated challenges). No benchmark models attack vs. defense dynamics, multi-step attack chains, or realistic network topologies.

Recommendations for Ideal Evaluation

Adopt AutoPenBench + PentestEval as minimum baselines for system papers
Report cost, time, and success rate together - all three matter
Measure partial progress using milestones, not just binary success
Include active defense scenarios - at minimum, evaluate against CHeaT/Mantis-style defenses
Test on multi-host networks like GOAD or MHBench, not just isolated CTF challenges
Report variance across multiple runs (at least 3-5)
Use held-out test sets to avoid benchmark contamination from training data

评估与基准测试全景分析

概述

基于LLM的渗透测试领域中，评估方法高度碎片化。大多数论文使用不同的环境、指标和基线，导致跨论文比较几乎不可能实现。本分析综合了69篇论文的评估实践。

基准测试清单

基于CTF的基准测试

基准测试	类型	规模	使用论文
NYU CTF Bench	CTF（6个类别）	200个挑战	#28, #54, #56, #65
AutoPenBench	Docker渗透测试任务	33个任务	#04, #10, #29
Cybench	专业级CTF	40个任务（4场竞赛）	#10, #54, #56, #63
InterCode-CTF	CTF框架	可变	#10, #54, #56
XBOW	Web CTF	104个挑战	#02, #57
PentestEval	阶段级评估	346个任务，12个场景	#58
picoCTF	教育型CTF	可变	#45, #52
HackWorld	基于GUI的Web CTF	36个挑战	#53

真实/模拟环境

环境	类型	使用论文
HackTheBox	真实靶机	#01, #02, #08, #13, #22, #59, #61
VulnHub	虚拟机镜像	#05, #07, #29, #39, #40, #52
Vulhub (Docker)	Docker CVE复现	#03, #04
Metasploitable	故意存在漏洞的系统	#06, #41, #66
GOAD	Active Directory实验环境	#02, #19
MHBench	基于OpenStack的多主机环境	#60
NetSecGame	网络模拟	#42
CyBORG	网络安全训练场	#27

观察发现

没有单一基准测试占据主导地位：复用率最高的是NYU CTF Bench（4篇论文）、AutoPenBench（3篇）、Cybench（3篇）和HackTheBox（7篇）
CTF与真实渗透测试之间存在脱节：CTF基准测试占据主导地位，但CTF并不等同于真实的渗透测试。仅有 #02、#19 和 #60 在真实的多主机网络上进行了评估
基准测试数据污染风险：#20 HackSynth发现静态CTF flag可能已被记忆在训练数据中；#39 指出VulnHub靶机可能存在于LLM的训练数据中

使用的评估指标

指标	使用论文	备注
任务/挑战完成率	几乎所有论文	最常用但定义各异
成功率（二值）	#01, #06, #07, #13, #21	粒度较粗
子任务完成度	#08, #29, #58	粒度更细
基于里程碑的进度	#03, #29	部分得分的最佳实践
成本（$）	#02, #03, #19, #57, #60	关键但经常未报告
时间/步骤数	#03, #08, #55	对实际部署至关重要
人工交互次数	#08	衡量自主程度

关键指标缺口

隐蔽性/规避检测：没有论文衡量智能体是否能避免触发IDS/IPS
误报率：仅 #57 MAPTA要求PoC验证
可复现性：大多数论文报告单次运行结果；#29 AutoPenBench发现多次运行间存在较高方差
覆盖广度：很少有论文衡量探索了多大比例的攻击面

攻击阶段覆盖情况

阶段	涵盖该阶段的论文	缺口评估
侦察	#01, #04, #05, #07, #09, #11, #16, #48, #57, #61	覆盖良好
扫描	#01, #04, #05, #07, #09, #10, #13, #16, #48	覆盖良好
枚举	#01, #04, #05, #07, #09, #10, #11, #13, #16, #48, #55	覆盖良好
利用	几乎所有论文	覆盖最充分，但也是最难成功的阶段
后渗透	#06, #42, #48, #58	严重不足
权限提升	#05, #10, #16, #48, #52, #55	中等覆盖，#55 专门研究此方向
横向移动	#06, #19, #42, #60	严重不足
报告生成	#05, #16, #57, #61	覆盖极少

核心发现：渗透测试的后期阶段（后渗透、横向移动）的研究严重不足。大多数系统在获得shell或夺取flag后即停止。

模型对比

最常被测试的模型

模型	使用论文	典型表现
GPT-4 / GPT-4o	约30篇论文	持续表现最优
GPT-3.5	约15篇论文	明显弱于GPT-4
Claude 3.5 Sonnet	#22, #53, #56, #58	与GPT-4相当，在某些基准上表现最佳
Llama系列	#04, #09, #10, #42, #45, #54, #55, #62	总体远弱于GPT-4
DeepSeek	#10, #54, #58	经微调后具有竞争力
Qwen	#04, #19, #54, #58	微调后表现亮眼
o1/o3（推理模型）	#19, #23, #45	在复杂推理任务上表现更优

关键趋势

GPT-4的主导地位：在 #62 中，仅GPT-4能解决one-day漏洞利用（87%）；其他所有模型得分为0%
微调缩小差距：#04 xOffense微调的Qwen3-32B超越了GPT-4o和Llama-405B
推理模型有所助益：#19 Cochise发现o1+GPT-4o攻陷的账户数约为非推理模型的5倍
模型规模并非一切：#60 Incalmo表明配备良好抽象的Haiku 3.5优于不具备抽象能力的Sonnet 4

评估中的关键缺口

1. 缺乏标准化基准测试套件

每篇论文使用不同的环境、指标和基线。#37 Benchmarking Practices系统性地记录了这种碎片化现象。

2. CTF不等于真实渗透测试

CTF挑战是孤立的，具有已知解法，且缺乏主动防御者。真实渗透测试涉及多主机网络、模糊的目标和时间压力。仅有 #02、#19、#60 接近真实世界条件。

3. 缺乏主动防御评估

没有论文在有主动防御的系统上评估智能体。#32 CHeaT和 #33 Mantis表明简单的防御措施即可击败所有当前智能体，但没有攻击性论文考虑到这一点。

4. 可复现性不足

少数论文开源了代码（#04, #09, #22, #42, #48, #55, #56, #60, #61 为开源项目）
多次运行间的方差很少被报告
环境搭建说明往往不够充分

5. 缺乏成本归一化比较

API成本差异巨大（#03 每个目标$0.68到 #61 总计$96.20）。缺乏成本归一化使得系统效能比较容易产生误导。

6. 缺乏攻防对抗基准

#37 指出所有基准测试都是Jeopardy模式（解决孤立的挑战）。没有基准测试对攻防对抗动态、多步攻击链或真实网络拓扑进行建模。

理想评估方案建议

将AutoPenBench + PentestEval作为系统论文的最低基线
同时报告成本、时间和成功率 —— 三者缺一不可
使用里程碑衡量部分进展，而非仅用二值成功/失败
纳入主动防御场景 —— 至少应针对CHeaT/Mantis风格的防御进行评估
在多主机网络上测试，如GOAD或MHBench，而非仅在孤立的CTF挑战上
报告多次运行的方差（至少3-5次）
使用留出测试集以避免训练数据对基准测试的污染

Research Gaps & PhD Opportunities

Overview

After analyzing 65 papers on LLM-based penetration testing (2012-2026), this document synthesizes the major research gaps and proposes concrete PhD thesis directions. The field is in early adolescence: core capabilities are demonstrated but reliability, scalability, and real-world applicability remain far from solved.

Consolidated Research Gaps

Technical Gaps

T1. Post-Exploitation and Lateral Movement (Critical)

Gap: The overwhelming majority of systems stop after initial exploitation (getting a shell/flag). Only #06 AutoAttacker, #19 Cochise, #42 Hackphyr, and #60 Incalmo address post-exploitation or lateral movement. Evidence: #06 noted "no prior comprehensive study on LLM-driven post-breach attack automation"; #19 tested AD networks but achieved only ~1.83 accounts per run; #60 Incalmo succeeded in 37/40 multi-host scenarios but used high-level abstractions rather than LLM-driven lateral movement. Significance: Real penetration tests focus heavily on post-exploitation. Without this, LLM pentesting remains a toy problem.

T2. Multimodal Understanding (Critical)

Gap: No system effectively handles visual/GUI-based information. #53 HackWorld showed CUAs achieve below 12% on web vulnerabilities through visual interaction. Evidence: Noted as a gap by #01, #03, #04, #09, #10, #44. Modern web applications require browser interaction, CAPTCHA solving, and visual analysis that text-only agents cannot perform. Significance: Web application testing is the largest real-world pentesting market segment.

T3. Context Management Across Long Engagements (Critical)

Gap: Real penetration tests span days/weeks. Current systems operate within single sessions of minutes/hours. #48 CHAP is the only paper addressing cross-session context relay. Evidence: #01 noted context degradation; #05 showed retention drops to 57.1% on complex tasks; #09 found 42.36% of failures due to session context loss; #02 is the only system that maintains state across multi-day engagements. Significance: Without solving this, LLM agents are limited to "sprints" rather than realistic engagement timelines.

T4. Learning from Experience (Important)

Gap: Most systems don't accumulate knowledge across engagements. Each test starts from scratch. Evidence: #59 RefPentester's dual success/failure logs, #06 AutoAttacker's experience manager, and #13 RapidPen's success-case RAG are initial attempts. No system demonstrates cross-engagement transfer learning. Significance: Human pentesters improve over years of experience. Current LLM agents cannot.

T5. Defense-Aware Offensive Agents (Important)

Gap: No offensive system accounts for active defenders. #32 CHeaT achieved 100% defense success against PentestGPT; #33 Mantis demonstrated prompt injection can halt/reverse LLM attacks. Evidence: #19, #06, #07, #08 all note "evaluation limited to systems without active defenses" as a limitation. Significance: Any real deployment faces defenders. Current agents are trivially defeated by basic countermeasures.

T6. Binary Exploitation and Low-Level Attacks (Important)

Gap: LLMs fundamentally struggle with binary exploitation, heap manipulation, and low-level memory attacks. Evidence: #15 PwnGPT found LLMs cannot understand runtime memory state; #20 HackSynth found all LLMs completely fail at binary exploitation; #22 CAI showed weak performance in pwn (0.77x) and crypto (0.47x). Significance: Binary exploitation is a core pentesting skill that current LLM architectures may be fundamentally unsuited for.

T7. Reliable Tool Usage (Moderate)

Gap: LLMs frequently generate invalid commands with wrong parameters. Evidence: #19 Cochise found 35.9% invalid commands, with 94% hashcat failure rate; #09 VulnBot reported 19.70% failed tool errors; #42 Hackphyr noted high invalid action rates. Significance: Unreliable tool usage wastes time and can alert defenders.

Evaluation Gaps

E1. No Standardized Benchmark Suite (Critical)

Gap: Each paper uses different environments, metrics, and baselines. Cross-paper comparison is nearly impossible. Evidence: #37 Benchmarking Practices systematically documented this fragmentation. #29 AutoPenBench and #58 PentestEval are steps forward but not universally adopted. Significance: Without standardized evaluation, the field cannot measure progress.

E2. CTF ≠ Real Pentesting (Critical)

Gap: Most benchmarks use isolated CTF challenges or single-vulnerability Docker containers. These don't represent real network penetration tests. Evidence: Only #02, #19, and #60 evaluate on multi-host networks. #37 noted "multi-step attack chains in real-world networks are rarely modeled." Significance: Good CTF performance doesn't predict real-world pentesting ability.

E3. No Attack/Defense Evaluation (Important)

Gap: #37 noted all benchmarks are Jeopardy-style. No benchmark models adversarial dynamics. Evidence: #32 CHeaT and #33 Mantis are defensive papers that revealed offensive agent fragility, but no offensive paper tests against defenders. Significance: Adversarial evaluation is essential for understanding real-world deployability.

Application Domain Gaps

D1. Active Directory / Enterprise Network Testing (Critical)

Gap: Only #19 Cochise tests on AD networks. Enterprise networks with domain controllers, trust relationships, and GPOs are the primary target of real penetration tests. Evidence: #19 achieved limited results (1.83 accounts/run); #60 Incalmo tested multi-host but not AD-specific attacks. Significance: Enterprise AD pentesting is the highest-value commercial pentesting activity.

D2. Cloud and Container Security (Important)

Gap: No paper specifically addresses cloud-native pentesting (AWS/Azure/GCP misconfigurations, container escapes, serverless exploitation). Evidence: Noted as a gap by #10, #34. Cloud infrastructure is increasingly the attack surface. Significance: Growing market segment with unique challenges.

D3. API and Mobile Security Testing (Moderate)

Gap: No paper addresses API security testing or mobile application pentesting. Evidence: #02 notes "benchmark scope omits mobile security." Significance: API-first architectures and mobile apps represent a significant attack surface.

Safety & Ethics Gaps

S1. Dual-Use Risk Management (Critical)

Gap: The same tools used for defensive testing can be weaponized. No paper provides a satisfactory framework for managing this risk. Evidence: #23 noted "missing safeguards for distinguishing ethical from unethical use"; #34 raised "governance and guardrails" as a primary gap; #06 demonstrated 100% jailbreak success rate. Significance: This is both a technical and policy problem that affects publishability and deployment.

S2. Agent Containment and Safety (Important)

Gap: Autonomous agents executing commands on real systems can cause unintended damage. Evidence: #20 HackSynth identified that "current containment mechanisms are insufficient"; #45 noted agents with networking capabilities could cause unintentional DoS. Significance: Safety guarantees are prerequisite for real-world deployment.

Theoretical Gaps

Th1. Formal Model of Pentesting as an AI Problem (Important)

Gap: No consensus on the formal problem definition. Papers variously model pentesting as MDP, POMDP, Mealy machine, DAG, or ad-hoc. Evidence: #24 uses POMDP, #03 uses classical planning, #07 uses FSM, #36 uses labeled transition system. These formalizations are incompatible. Significance: A unified theoretical framework would enable principled comparison and combination of approaches.

Th2. When and Why LLMs Work for Pentesting (Moderate)

Gap: #23 hypothesizes pentesting is fundamentally pattern-matching, explaining LLM suitability, but this is untested. Evidence: #23 is the only paper that attempts to explain why LLMs work for pentesting. All others are empirical. Significance: Understanding the theoretical basis would guide architecture design.

Potential PhD Thesis Topics

⭐ Topic 1: Autonomous Multi-Host Network Penetration Testing with Persistent Memory

Research questions:

How can LLM agents maintain and transfer knowledge across multi-day, multi-host penetration testing engagements?
What abstractions enable efficient planning over large network topologies?
How should agents balance exploration vs. exploitation in unfamiliar networks?

Why it matters: Addresses gaps T1, T3, T4, D1. Real pentests involve multi-host networks and span days. Only #48 CHAP and #60 Incalmo touch this.

Why existing work is insufficient: The 60+ papers in this survey overwhelmingly target single-host, single-session scenarios that bear little resemblance to real penetration testing engagements. #60 Incalmo demonstrates multi-host orchestration but relies on high-level plan abstractions rather than genuine LLM-driven lateral movement, and it lacks any persistent memory across sessions. #48 CHAP is the sole paper addressing cross-session context relay, but it focuses narrowly on context summarization without tackling multi-host coordination or knowledge transfer across engagements.

Builds on: #02 (EGATS), #19 (AD testing), #48 (context relay), #60 (planning-execution decoupling)

Differentiation: No existing work combines persistent cross-session memory with multi-host lateral movement and AD-specific attack knowledge. Specifically: #60 Incalmo handles multi-host but not persistent memory; #48 CHAP handles context relay but only for single-host CTF tasks; #19 Cochise tests on AD networks but achieves only 1.83 accounts per run with no cross-session learning; #02 EGATS maintains multi-day state but does not address lateral movement or network-wide planning. This topic uniquely integrates all four dimensions.

Feasibility: High. GOAD provides free AD labs; context relay techniques exist; multi-agent frameworks are mature.

Topic 2: Defense-Aware Offensive AI — Adversarial Pentesting Against Active Defenders

Research questions:

How can LLM agents detect and evade honeypots, IDS/IPS, and deception-based defenses?
What is the equilibrium between LLM-based attackers and LLM-based defenders?
Can agents learn stealth strategies through adversarial training?

Why it matters: Addresses gaps T5, E3. No offensive paper accounts for defenders. #32 CHeaT showed current agents are trivially defeated.

Why existing work is insufficient: The entire offensive literature operates in a defender-free vacuum. #32 CHeaT demonstrated that a simple LLM-based defensive honeypot achieves 100% success at stopping PentestGPT, and #33 Mantis showed that prompt injection can completely halt or even reverse LLM-based attacks. Despite these alarming findings, no subsequent offensive paper has incorporated adversarial robustness into its design or evaluation.

Builds on: #32 (CHeaT defenses), #33 (Mantis), #50 (Honeypot detection), #02 (adversarial robustness gap)

Differentiation: First work to study the attacker-defender co-evolution in LLM-based pentesting. Unlike #32 CHeaT (defense-only perspective) and #33 Mantis (prompt injection defense), this topic would develop offensive agents that actively model and adapt to defensive countermeasures, creating a game-theoretic framework for studying equilibrium rather than one-sided evaluation.

Feasibility: Medium. Requires building both offensive and defensive agents. CHeaT provides a starting defensive framework.

Topic 3: Multimodal Penetration Testing Agents for Web Applications

Research questions:

Can vision-language models perform GUI-based web application testing (clicking, form filling, visual analysis)?
How should agents combine visual understanding with traditional tool-based testing?
What benchmark adequately evaluates multimodal web pentesting?

Why it matters: Addresses gaps T2, D3. #53 HackWorld showed CUAs achieve <12%. Web apps are the largest pentesting market.

Why existing work is insufficient: #53 HackWorld is the only paper evaluating vision-based interaction for web pentesting, and its results are discouraging (below 12% success). All other web pentesting systems (#07, #56) rely entirely on text-based tool output, ignoring the visual complexity of modern web applications. No paper has attempted a hybrid approach that combines visual understanding with traditional command-line tools, which is how human pentesters actually work.

Builds on: #53 (HackWorld benchmark), #07 (web pentesting), #56 (interactive tools)

Differentiation: No existing system combines vision-language models with traditional pentesting tools for web application security. #53 HackWorld evaluates pure CUA approaches but does not propose a hybrid architecture; #07 tests web applications but only through text-based interactions; #56 provides interactive tool environments but no visual capabilities. This topic would pioneer the multimodal fusion approach.

Feasibility: Medium. VLMs are rapidly improving. HackWorld provides an initial benchmark.

⭐ Topic 4: Deep Reinforcement Learning Meets LLMs for Autonomous Penetration Testing

Research questions:

Can DRL policies trained on network simulation transfer to real-world pentesting scenarios when combined with LLM reasoning?
What reward functions and curriculum designs are effective for multi-stage pentesting?
How can DRL address the reward sparsity problem in complex network environments while LLMs handle natural-language tool interaction?

Why it matters: Addresses gaps T4, Th1. #10 Pentest-R1 and #45 showed RL potential but only on CTF tasks. No work integrates DRL strategic planning with LLM tactical execution.

Why existing work is insufficient: The DRL-for-pentesting line of work (#45, #27 CyBORG) and the LLM-for-pentesting line of work (#01, #02, #60) have developed in complete isolation from each other. DRL approaches excel at strategic planning over network topologies but cannot handle natural-language tool interaction or adapt to novel vulnerability descriptions. LLM approaches excel at understanding and executing individual attack steps but lack the ability to learn optimal strategies through experience. No paper has attempted to bridge these two paradigms, leaving a completely unexplored design space.

Builds on: #10 (two-stage RL), #45 (curriculum RL), #54 (trajectory synthesis), #24 (POMDP formalization), #27 (CyBORG gym)

Differentiation: First to integrate DRL strategic planning with LLM tactical execution for pentesting. Unlike #10 Pentest-R1 (RL fine-tuning of LLM weights on CTF tasks), this topic uses DRL as a separate strategic planning layer that guides LLM-based execution agents, enabling the system to learn optimal network traversal policies while leveraging LLM flexibility for individual attack steps. Unlike #45 (pure RL in simulation), this bridges simulation-trained policies to real-world environments via LLM grounding. Unlike #24 (POMDP formalization without learning), this provides a concrete learning algorithm.

Feasibility: Medium-High. CyBORG and GOAD provide training environments. Pentest-R1 provides RL methodology. The DRL and LLM communities have separately matured the building blocks.

Topic 5: Efficient Small-Model Architectures for Pentesting Agents

Research questions:

What architectural abstractions allow small LLMs (7B-32B) to match or exceed large models for pentesting?
How should tool interfaces, planning modules, and memory be designed to compensate for model capacity?
What fine-tuning strategies are most effective for pentesting specialization?

Why it matters: Addresses gap T6 (indirectly), builds on the finding from #04 and #60 that abstractions > model size. Enables private, offline deployment.

Why existing work is insufficient: Individual papers have demonstrated promising results with smaller models (#04 fine-tuned 32B beating 405B, #60 Incalmo achieving strong results with Haiku, #42 Hackphyr fine-tuning), but each operates as an isolated experiment on a different benchmark with different architecture choices. There is no systematic study of which architectural components (tool abstraction layers, planning modules, memory systems, fine-tuning data composition) contribute most to closing the gap between small and large models.

Builds on: #04 (fine-tuned 32B beats 405B), #42 (Hackphyr fine-tuning), #54 (trajectory synthesis), #60 (Incalmo with Haiku)

Differentiation: Systematic study of the architecture-model size tradeoff, going beyond single fine-tuning experiments. Unlike #04 (single model, single benchmark), this would conduct controlled ablation studies across multiple architectures, model sizes, and benchmarks. Unlike #42 Hackphyr (fine-tuning only), this includes non-parametric components (planning modules, tool abstraction, external memory). Unlike #60 Incalmo (fixed architecture, model comparison), this optimizes the full architecture for small-model constraints.

Feasibility: High. Open-source models, existing fine-tuning pipelines, existing benchmarks.

Topic 6: Unified Benchmark Suite for LLM-Based Penetration Testing

Research questions:

What evaluation dimensions are necessary and sufficient for comparing pentesting agents?
How should benchmarks model multi-step attack chains, active defenders, and realistic network topologies?
Can automated difficulty calibration replace expert manual grading?

Why it matters: Addresses gaps E1, E2, E3. The field desperately needs standardized evaluation.

Why existing work is insufficient: #37 Benchmarking Practices documented severe fragmentation: no two papers use the same benchmark, metrics, or evaluation methodology. #29 AutoPenBench and #58 PentestEval are steps forward but each covers only a narrow slice (AutoPenBench focuses on single-host Docker tasks, PentestEval on knowledge assessment). No benchmark includes multi-host network scenarios, active defenders, or difficulty-calibrated progressive challenge sets that would enable meaningful cross-system comparison.

Builds on: #29 (AutoPenBench), #58 (PentestEval), #37 (benchmarking practices), #63 (Cybench), #53 (HackWorld)

Differentiation: First unified benchmark combining CTF challenges, multi-host networks, active defenses, and standardized metrics. Unlike #29 AutoPenBench (single-host Docker only), this includes network-level scenarios. Unlike #58 PentestEval (knowledge-focused), this measures end-to-end operational capability. Unlike #37 (meta-analysis only), this delivers a concrete benchmark artifact. Unlike #63 Cybench (CTF-only), this models realistic enterprise environments.

Feasibility: High. Docker-based environments exist. Challenge is coordination and community adoption.

Topic 7: Neurosymbolic Pentesting — Combining LLMs with Formal Planning

Research questions:

How can classical planning, knowledge graphs, and LLMs be optimally combined for pentesting?
Can formal methods provide safety guarantees for autonomous pentesting agents?
How should the action/state space be defined for pentesting as a planning problem?

Why it matters: Addresses gaps Th1, T7. #03 CHECKMATE showed classical planning improves consistency (100% vs 75%); #31 MulVAL showed Datalog-based reasoning works for attack graphs.

Why existing work is insufficient: Papers have explored individual formal approaches in isolation: #03 uses classical PDDL planning, #24 models pentesting as a POMDP, #07 uses finite state machines, #36 uses labeled transition systems, and #31 uses Datalog-based attack graphs. These formalizations are mutually incompatible and none has been systematically combined with LLM capabilities. #03 CHECKMATE comes closest but uses a rigid planning layer that cannot adapt to unexpected findings during execution.

Builds on: #03 (Classical Planning+), #24 (POMDP), #31 (MulVAL Datalog), #36 (formal transition system), #49 (game-theoretic)

Differentiation: Systematic framework combining symbolic reasoning for planning with LLMs for execution, providing formal guarantees. Unlike #03 CHECKMATE (fixed PDDL domain), this dynamically constructs and updates the planning domain from LLM observations. Unlike #24 (POMDP formalization without implementation), this provides a working system. Unlike #31 MulVAL (static attack graph analysis), this supports online replanning during active engagements.

Feasibility: Medium. Classical planning and knowledge graphs are well-understood. Integration with LLMs is the research challenge.

⭐ Topic 8: Cross-Engagement Learning and Knowledge Transfer for Pentesting Agents

Research questions:

How can pentesting agents accumulate and transfer tactical knowledge across independent engagements?
What knowledge representation enables generalization from past exploits to novel but structurally similar targets?
How should success and failure experiences be encoded, retrieved, and applied in new contexts?

Why it matters: Addresses gaps T3, T4. Human pentesters improve dramatically over years of experience, but current LLM agents start every engagement from scratch. Only #48 CHAP addresses cross-session context relay, and only 3 papers (#06, #13, #59) attempt any form of experience accumulation.

Why existing work is insufficient: #59 RefPentester maintains dual success/failure logs but only within a single engagement without cross-engagement transfer. #06 AutoAttacker's experience manager stores past actions but lacks structured generalization. #13 RapidPen's success-case RAG retrieves past solutions but cannot learn from failures or adapt strategies over time. #48 CHAP relays context across sessions but focuses on summarization rather than learning. None of these approaches demonstrates genuine transfer learning where performance on engagement N+1 measurably improves from experience on engagements 1 through N.

Builds on: #48 (CHAP context relay), #59 (RefPentester dual logs), #06 (experience manager), #13 (success-case RAG), #54 (trajectory synthesis)

Differentiation: First systematic study of cross-engagement transfer learning for pentesting agents. Unlike #59 RefPentester (within-engagement logs only), this builds a persistent knowledge base that grows across engagements. Unlike #06 AutoAttacker (raw action replay), this uses structured knowledge representations that support analogical reasoning. Unlike #13 RapidPen (success-only retrieval), this learns equally from failures. Unlike #48 CHAP (context summarization), this extracts generalizable tactical patterns rather than session-specific state.

Feasibility: High. RAG infrastructure is mature; trajectory datasets exist (#54); evaluation can be designed around measuring improvement curves across sequential engagements on diverse targets.

Recommended Research Agenda

The three starred topics above form a coherent, mutually reinforcing research program suitable for a PhD proposal. They address the three largest gaps in the field and build naturally on each other.

Foundation: Cross-Engagement Learning (Topic 8) serves as the methodological foundation. A pentesting agent that cannot learn from experience will always be limited to the knowledge frozen in its training data. By developing structured knowledge representations and transfer learning mechanisms, this work creates the substrate on which the other two directions depend. The key deliverable is an experience accumulation framework that demonstrably improves agent performance across sequential engagements.

Application: Multi-Host Network Pentesting (Topic 1) provides the real-world problem setting where cross-engagement learning becomes essential. Single-host CTF challenges are too simple to require persistent memory or accumulated experience, but multi-host network engagements spanning multiple sessions demand both. Topic 1 builds directly on Topic 8's knowledge transfer framework, applying it to the most commercially relevant and technically challenging pentesting scenario. The persistent memory architecture developed in Topic 8 becomes the backbone for maintaining state across multi-day, multi-host campaigns.

Scaling: DRL+LLM Integration (Topic 4) addresses the strategic planning layer that neither pure LLM approaches nor pure experience replay can provide. Once an agent has accumulated experience (Topic 8) across network-scale engagements (Topic 1), DRL provides the mechanism to distill that experience into optimal network traversal and exploitation policies. The DRL component learns macro-level strategy (which hosts to target, in what order, with what techniques) while the LLM component handles micro-level execution (interpreting tool output, crafting payloads, adapting to unexpected responses). Together, these three topics trace a path from "LLM agents that solve isolated CTF challenges" to "learning systems that conduct realistic, multi-session network penetration tests with improving performance over time."

Suggested sequencing: Begin with Topic 8 (Year 1-2), which produces immediately publishable results on experience accumulation and has the highest feasibility. Extend to Topic 1 (Year 2-3), applying the learning framework to multi-host scenarios. Integrate Topic 4 (Year 3-4) as the strategic planning layer that ties everything together. Each stage produces independent publications while building toward the unified thesis.

Field Maturity Assessment

Dimension	Maturity	Evidence
Initial access / single-host exploitation	Growing	Multiple systems achieve 40-90% on CTF/VulnHub
Multi-host / real network testing	Nascent	Only 3-4 papers attempt this
Post-exploitation / lateral movement	Embryonic	Almost no work beyond #06, #19, #60
Evaluation standardization	Nascent	Benchmarks emerging but not adopted
Theoretical foundations	Embryonic	Multiple incompatible formalizations
Defense interaction	Embryonic	Defensive papers exist but no offensive paper accounts for them
Real-world deployment	Pre-nascent	#02 is the closest with live HTB ranking

Overall: The field is in early growth phase. Core capabilities are demonstrated but the gap between CTF-solving and real-world pentesting remains vast. This makes it an excellent time for PhD research — the fundamental problems are identified but unsolved.

Cross-Cutting Themes

Abstractions beat model size: Multiple papers (#04, #60) show that good architectural design matters more than using the largest model.
Exploitation is the bottleneck: Across all papers, exploitation phase has the lowest success rates. Recon and scanning are largely solved.
Context management is universal: Every system struggles with maintaining context over long interactions.
RAG is necessary but insufficient: RAG helps but retrieval quality is poor and can mislead agents.
Open-source models are catching up: Fine-tuned open-source models (#04, #10, #54) increasingly competitive with GPT-4.

研究空白与博士研究机会

概述

在分析了69篇关于基于LLM的渗透测试论文后，本文档综合梳理了主要研究空白，并提出具体的博士研究路线图。该领域正处于早期成长阶段：核心能力已得到验证，但可靠性、可扩展性和实际应用性仍远未解决。

渗透测试对 AI Agent 的核心挑战

在列举具体研究空白之前，有必要先回答一个根本性问题：渗透测试这个任务本身有什么特点，使得当前 AI Agent 做不好？瓶颈到底在哪里？

渗透测试 vs 编程任务：同一 Agent，为何表现差异巨大？

CHECKMATE #03 的评测揭示了一个重要现象：Claude Code + Sonnet 4.5 在编程任务上表现优秀，但在渗透测试中却呈现 "exploratory and blind trial-and-error" 的行为模式。这一差异的根源在于两类任务的结构性不同：

维度	编程任务（SWE-agent 等）	渗透测试
反馈信号	编译错误、测试用例——精确、即时、二元（对/错）	工具输出——海量、噪声大、需推断意义
成功判定	测试通过 = 成功（明确）	无明确信号直到拿到 shell/flag
行动空间	编辑代码文件（有限、结构化）	数百种工具 × 参数组合（巨大、开放）
可逆性	`git reset` 可完全回退	操作可能不可逆（锁定账户、触发告警）
环境确定性	代码文件是静态的	目标系统是动态的（服务超时、状态变化）

核心差异在于反馈信号的质量。编程 Agent 每执行一步都能从编译器/测试框架获得精确的对错反馈，形成高效的试错循环。而渗透测试 Agent 执行一个 nmap 扫描后，面对大量端口和服务信息，无法判断哪些值得深入——直到最终成功利用漏洞前，Agent 始终处于"不知道自己是否在正确路径上"的状态。

两类失败模式：哪些会随模型进步消失，哪些不会？

PentestGPT V2 #02 分析了 28 个 LLM 渗透测试系统后，提出了最清晰的失败分类框架：

Type A 失败（能力缺口）：源于工具缺失、知识不足、命令格式错误。

可通过工程手段解决（更好的工具接口、RAG 知识增强、模型微调）
会随着 LLM 能力提升而自然改善
例如：#19 Cochise 35.9% 的无效命令、#09 VulnBot 19.70% 的工具调用失败
PentestGPT V2 通过 38 个 typed tool interfaces 和 RAG 基本消除了此类失败

Type B 失败（复杂性壁垒）：源于规划和状态管理的架构性局限。

不随 LLM 能力提升而改善——这是架构问题，不是模型问题
核心根因：Agent 缺乏实时任务难度评估能力，导致过度投入低价值攻击路径，耗尽上下文后仍未完成攻击链
PentestGPT V2 引入 Task Difficulty Assessment（TDA），从任务复杂度、证据置信度、上下文负载、历史成功率四个维度评估难度，将 Type B 失败率从 58% 降至 27%

这一分类的意义在于：Type A 失败不值得做博士研究（会被模型进步和工程优化自然解决），Type B 失败才是真正需要架构创新的研究机会。

Type B 失败的三个具体表现

(a) 搜索策略低效——Agent 不知道何时该放弃

渗透测试的行动空间巨大且缺乏结构，Agent 频繁陷入无效路径却无法识别和退出。

证据：CyberExplorer #46 测量了 "dead-end persistence ratio"——Agent 在最终失败的路径上投入的计算量是有效路径的 2.6x–5.1x 倍
证据：PentestGPT V2 #02 发现 Agent "over-commit to low-value branches and exhaust context before completing attack chains"
证据：CHECKMATE #03 发现 Claude Code 的测试过程是 "largely ad-hoc, showing exploratory and blind trial-and-error behavior"
现有解决思路：
- 任务难度评估：PentestGPT V2 的 TDA 机制（Type B 失败率 58%→27%）
- 结构化搜索空间：CHECKMATE 的经典规划用 DAG 显式枚举可行动作，避免遗漏或盲目探索（一致性 75%→100%，成本降低 53%）
- 行动模板化：PentestGPT V2 的 38 个 typed tool interfaces 和 CHECKMATE 的 predefined attack actions 约束了行动空间
未解决的问题：现有方案仅在单主机场景验证，多主机网络中搜索空间会组合爆炸

(b) 多步攻击链的脆弱性——单步成功率的指数衰减

渗透测试是链式依赖任务（侦察→扫描→枚举→利用→后渗透），每一步依赖前一步的正确输出。单步成功率的复合效应导致端到端成功率急剧下降。

证据：PentestEval #58 显示单阶段成功率最高约 50%，端到端成功率仅 31%
证据：#39 Towards Automated PT 发现性能随任务序列位置显著退化（60% → 33%）
证据：#09 VulnBot 发现 42.36% 的失败源于阶段间信息传递丢失
现有解决思路：
- 中间里程碑验证：AutoPenBench #29 的 milestone-based 评估提供了链中间的验证点
- 规划-执行分离：Incalmo #60 将高层规划与低层执行解耦，失败时规划器切换路径而非执行器盲目重试——这使小模型（Haiku）超越了大模型（Sonnet）
- 攻击树搜索：PentestGPT V2 的 EGATS 在链断时自动回溯到攻击树的其他分支
未解决的问题：现有方案都在单会话中运行，跨会话/跨天的攻击链状态保持完全未解决（仅 #48 CHAP 初步尝试）

(c) 反馈信号稀疏——Agent 无法评估中间进展

与编程任务（编译器/测试框架提供即时反馈）不同，渗透测试缺乏天然的中间反馈机制。Agent 在最终拿到 shell 之前，无法判断当前进展。

证据：CHECKMATE #03 发现 Claude Code 表现出 blind trial-and-error 行为——本质上是因为缺少中间反馈导致的无方向探索
证据：PentestGPT V2 #02 的 evidence confidence 维度——Agent 需要评估当前收集到的证据是否足以支撑下一步决策
现有解决思路：
- 基准层面：AutoPenBench #29 的里程碑评估和 PentestEval #58 的阶段级 ground truth 为研究者提供了中间反馈
- 架构层面：CHECKMATE 的经典规划通过前置条件检查提供结构化反馈——动作的前置条件是否满足即是一种中间信号
- PentestGPT V2 的 TDA 用 evidence confidence 作为代理反馈信号
未解决的问题：这些方案依赖人工标注的里程碑或预定义前置条件，无法泛化到未见过的目标环境

挑战与研究路线图的对应关系

Type B 挑战	阶段一（基准）	阶段二（多主机渗透）
(a) 搜索策略低效	基准提供标准化对比平台，量化搜索效率	多 Agent 并行调度探索不同路径
(b) 攻击链脆弱性	里程碑评估提供链中间验证点	持久记忆+规划-执行分离维护链状态
(c) 反馈信号稀疏	阶段级 ground truth 注入中间反馈	知识图谱提供结构化状态表示

研究空白汇总

技术空白

T1. 后渗透与横向移动（关键）

空白：绝大多数系统在初始利用（获取shell/flag）后即停止。仅有 #06 AutoAttacker、#19 Cochise、#42 Hackphyr 和 #60 Incalmo 涉及后渗透或横向移动。证据：#06 指出"此前没有关于LLM驱动的入侵后攻击自动化的全面研究"；#19 在AD网络上进行了测试，但每次运行仅获取约1.83个账户；#60 Incalmo 在37/40个多主机场景中取得成功，但使用了高层抽象而非LLM驱动的横向移动。意义：真实的渗透测试高度依赖后渗透阶段。缺少这一环节，LLM渗透测试仍停留在实验阶段。

T2. 多模态理解（关键）

空白：没有系统能够有效处理视觉/GUI信息。#53 HackWorld 表明CUA在通过视觉交互进行Web漏洞测试时成功率低于12%。证据：#01、#03、#04、#09、#10、#44 均将此列为研究空白。现代Web应用需要浏览器交互、CAPTCHA破解和视觉分析，纯文本智能体无法胜任。意义：Web应用测试是现实渗透测试中最大的市场细分领域。

T3. 长期任务中的上下文管理（关键）

空白：真实的渗透测试通常持续数天/数周。当前系统仅在数分钟/数小时的单次会话中运行。#48 CHAP 是唯一一篇涉及跨会话上下文传递的论文。证据：#01 指出了上下文退化问题；#05 显示在复杂任务上保留率降至57.1%；#09 发现42.36%的失败源于会话上下文丢失；#02 是唯一在多日任务中维持状态的系统。意义：不解决这个问题，LLM智能体只能进行"冲刺式"测试，无法适应真实的任务周期。

T4. 从经验中学习（重要）

空白：大多数系统不会在多次任务间积累知识。每次测试都从零开始。证据：#59 RefPentester 的成功/失败双日志、#06 AutoAttacker 的经验管理器、#13 RapidPen 的成功案例RAG是初步尝试。没有系统展示了跨任务的迁移学习能力。意义：人类渗透测试人员通过多年经验不断提升。当前LLM智能体无法做到这一点。

T5. 具备防御感知能力的攻击智能体（重要）

空白：没有攻击系统考虑了主动防御者的存在。#32 CHeaT 对PentestGPT实现了100%的防御成功率；#33 Mantis 证明了提示注入可以阻止/逆转LLM攻击。证据：#19、#06、#07、#08 均将"评估仅限于无主动防御的系统"列为局限性。意义：任何实际部署都面临防御者。当前智能体可被基本对抗措施轻易击败。

T6. 二进制利用与底层攻击（重要）

空白：LLM在二进制利用、堆操作和底层内存攻击方面存在根本性困难。证据：#15 PwnGPT 发现LLM无法理解运行时内存状态；#20 HackSynth 发现所有LLM在二进制利用方面完全失败；#22 CAI 在pwn（0.77x）和crypto（0.47x）方面表现不佳。意义：二进制利用是渗透测试的核心技能，当前LLM架构可能从根本上不适合此任务。

T7. 可靠的工具使用（中等）

空白：LLM经常生成参数错误的无效命令。证据：#19 Cochise 发现35.9%的命令无效，hashcat失败率达94%；#09 VulnBot 报告19.70%的工具调用失败；#42 Hackphyr 指出无效操作率偏高。意义：不可靠的工具使用浪费时间并可能暴露给防御者。

评估空白

E1. 缺乏标准化基准测试套件（关键）

空白：每篇论文使用不同的环境、指标和基线。跨论文比较几乎不可能。证据：#37 Benchmarking Practices 系统性地记录了这种碎片化现象。#29 AutoPenBench 和 #58 PentestEval 是进步，但尚未被普遍采用。意义：没有标准化评估，该领域无法衡量进展。

E2. CTF ≠ 真实渗透测试（关键）

空白：大多数基准使用孤立的CTF挑战或单漏洞Docker容器，不能代表真实的网络渗透测试。证据：仅 #02、#19 和 #60 在多主机网络上进行了评估。#37 指出"真实网络中的多步攻击链很少被建模。" 意义：良好的CTF表现不能预测真实渗透测试能力。

E3. 缺乏攻防对抗评估（重要）

空白：#37 指出所有基准都是Jeopardy模式。没有基准对对抗动态进行建模。证据：#32 CHeaT 和 #33 Mantis 是防御性论文，揭示了攻击智能体的脆弱性，但没有攻击性论文针对防御者进行测试。意义：对抗性评估对于理解实际部署能力至关重要。

应用领域空白

D1. Active Directory / 企业网络测试（关键）

空白：仅 #19 Cochise 在AD网络上进行了测试。拥有域控制器、信任关系和GPO的企业网络是真实渗透测试的主要目标。证据：#19 取得了有限成果（每次运行1.83个账户）；#60 Incalmo 测试了多主机但未针对AD特定攻击。意义：企业AD渗透测试是商业价值最高的渗透测试活动。

D2. 云和容器安全（重要）

空白：没有论文专门针对云原生渗透测试（AWS/Azure/GCP配置错误、容器逃逸、Serverless利用）。证据：#10、#34 将此列为空白。云基础设施日益成为主要攻击面。意义：市场规模不断增长，具有独特挑战。

D3. API与移动安全测试（中等）

空白：没有论文涉及API安全测试或移动应用渗透测试。证据：#02 指出"基准范围不包含移动安全。" 意义：API优先架构和移动应用代表了重要的攻击面。

安全与伦理空白

S1. 双重用途风险管理（关键）

空白：用于防御性测试的工具同样可被武器化。没有论文提供了令人满意的风险管理框架。证据：#23 指出"缺乏区分道德与非道德使用的安全措施"；#34 将"治理与防护机制"列为主要空白；#06 展示了100%的越狱成功率。意义：这既是技术问题也是政策问题，影响论文的可发表性和系统部署。

S2. 智能体限制与安全保障（重要）

空白：在真实系统上执行命令的自主智能体可能造成意外损害。证据：#20 HackSynth 指出"当前的限制机制不够充分"；#45 指出具有网络能力的智能体可能导致意外的DoS攻击。意义：安全保障是实际部署的先决条件。

理论空白

Th1. 渗透测试作为AI问题的形式化模型（重要）

空白：对形式化问题定义没有共识。各论文将渗透测试分别建模为MDP、POMDP、Mealy machine、DAG或临时方案。证据：#24 使用POMDP，#03 使用Classical Planning，#07 使用FSM，#36 使用Labeled Transition System。这些形式化方法互不兼容。意义：统一的理论框架将支持对各种方法的系统性比较与整合。

Th2. LLM何时以及为何适用于渗透测试（中等）

空白：#23 假设渗透测试本质上是模式匹配，以此解释LLM的适用性，但这一假设未经验证。证据：#23 是唯一试图解释LLM 为何适用于渗透测试的论文。所有其他论文都是经验性的。意义：理解理论基础将指导架构设计。

博士研究路线图

基于上述 17 个研究空白的交叉分析，本节提出一个三阶段递进式博士研究计划。三个阶段之间存在自然的依赖关系：先建立评测基础设施，再在其上开发和验证系统，最后引入对抗性因素研究真实场景。每个阶段均可独立产出高质量论文，同时共同构成一个完整的博士论文。

阶段一：统一基准套件（Year 1-2）    → 评测基础设施
       ↓ 为阶段二提供标准化评测平台
阶段二：理想环境多主机渗透（Year 2-3） → 系统能力验证
       ↓ 为阶段三提供攻击方基线
阶段三：对抗环境渗透测试（Year 3-4）  → 真实场景研究

阶段一：统一基准套件（Year 1-2）

核心问题：如何设计一个包含多步攻击链、真实网络拓扑和标准化指标的 LLM 渗透测试评测平台？

填补空白：E1（无标准化基准）、E2（CTF ≠ 真实渗透）、E3（无攻防对抗评测）

现有工作的不足：#37 Benchmarking Practices 系统性地记录了评测碎片化问题——没有两篇论文使用相同的基准、指标或评测方法。#29 AutoPenBench 和 #58 PentestEval 是进步，但前者仅覆盖单主机 Docker 任务，后者侧重知识评估。没有基准包含多主机网络场景、主动防御者或难度校准的渐进式挑战集。

技术路线：

多层基准架构设计：基于 Docker/Docker Compose 构建可复现的评测环境
- 层级 1：单主机漏洞利用——对标 AutoPenBench #29 的 33 个 Docker 任务，扩展至 100+ 任务，覆盖 Web 漏洞、配置错误、已知 CVE 等
- 层级 2：多主机网络攻击链——参考 MHBench #60 的 OpenStack 方案，但改用更轻量的 Docker 网络拓扑，包含 3-10 台主机的企业网络模拟
- 层级 3：对抗场景——预留防御者接口（IDS/IPS、蜜罐），在阶段三中填充具体防御策略
统一指标体系：整合现有碎片化指标为标准化评测框架
- 任务完成率 + 里程碑进度（参考 #29 的 milestone 评估，实现部分完成的细粒度度量）
- 阶段级成功率（参考 #58 PentestEval 的阶段分解，揭示各攻击阶段的瓶颈）
- 成本标准化（API 调用费用 / 每任务，参考 #03 CHECKMATE 的 $0.68/任务标准）
- 时间/步骤效率（总耗时、LLM 调用次数、工具调用次数）
难度校准机制：基于人类专家基线和多系统交叉验证进行难度分级，确保基准具有区分度
数据来源：从 VulnHub（退役靶机）、HackTheBox（retired machines）、真实 CVE 复现（Vulhub Docker 镜像）中构建任务库
基准验证：在新基准上复现 3-5 个现有系统（PentestGPT #01、CHECKMATE #03、Incalmo #60 等），验证基准的区分度和公平性

基础工作：#29（AutoPenBench）、#58（PentestEval）、#37（基准测试实践元分析）、#63（Cybench）、#53（HackWorld）

预期交付物：开源基准套件（含 Docker 环境 + 评测脚本 + 排行榜）+ 基准设计论文

与下一阶段的衔接：阶段一产出的基准平台直接作为阶段二系统的标准化评测环境。层级 2 的多主机场景为阶段二的横向移动研究提供测试床。

阶段二：理想环境下的多主机渗透测试 Agent（Year 2-3）

核心问题：LLM Agent 如何在无防御者的多主机企业网络中实现端到端渗透测试，包括后渗透和横向移动？

填补空白：T1（后渗透与横向移动）、T3（长时间任务的上下文管理）、D1（AD/企业网络）

现有工作的不足：60+ 篇论文绝大多数针对单主机、单会话场景，与真实渗透测试相差甚远。#60 Incalmo 展示了多主机编排能力但依赖高层抽象而非 LLM 驱动的横向移动，且缺乏跨会话持久记忆。#48 CHAP 是唯一研究跨会话上下文传递的论文，但仅针对单主机 CTF 任务。#19 Cochise 在 AD 网络上测试但每次仅攻陷约 1.83 个账户，无跨会话学习。

技术路线：

Agent 架构——层级式规划-执行分离（参考 #60 Incalmo 的核心发现：小模型+好抽象优于大模型裸用）
- 高层规划器：负责网络拓扑推理、攻击路径选择、资源分配。维护全局攻击图（参考 #31 MulVAL 的攻击图思想）
- 低层执行器：负责单主机上的具体攻击步骤（侦察、扫描、漏洞利用、权限提升）。可复用现有成熟方案（ReAct 循环 + 工具调用）
- 持久记忆模块：连接高低层，维护跨主机、跨会话的知识状态
三层持久记忆设计：
- 短期记忆（会话内）：当前会话的对话历史和工具输出，参考 #48 CHAP 的上下文压缩策略应对上下文窗口限制
- 中期记忆（跨会话）：结构化知识图谱，存储已发现的主机信息、开放端口、凭证、网络拓扑、攻击进度。参考 #03 CHECKMATE 的知识图谱方法
- 长期记忆（跨任务）：成功/失败策略的经验库。参考 #59 RefPentester 的双日志机制和 #06 AutoAttacker 的经验管理器，实现跨任务的策略迁移
横向移动策略：
- AD 特定攻击技术：Kerberoasting、Pass-the-Hash、DCSync、Golden Ticket（参考 #19 Cochise 的 AD 测试经验）
- 网络枢纽（Pivoting）：通过已攻陷主机建立隧道访问内网其他目标
- 凭证重用与权限升级链：自动识别凭证在多主机间的复用可能性
实验环境：GOAD（免费 AD 实验室，含多台域控和工作站）+ 阶段一构建的层级 2 多主机基准
评估方案：在阶段一基准上进行标准化评测，与现有系统（PentestGPT、Incalmo、Cochise）对比。重点评估：攻陷主机数、横向移动成功率、跨会话知识保持率

基础工作：#02（EGATS 多日测试）、#19（AD 横向移动）、#48（跨会话上下文）、#60（规划-执行分离）、#03（知识图谱）、#59（经验双日志）

预期交付物：开源多主机渗透 Agent 系统 + 系统论文（含在统一基准上的全面评测）

与下一阶段的衔接：阶段二在理想条件（无 IDS/IPS、无 SOC）下建立的性能基线，为阶段三引入防御者后的性能对比提供参照。阶段二的 Agent 架构作为阶段三的攻击方基础，在其上叠加防御感知模块。

阶段三：对抗环境下的渗透测试（Year 3-4）

核心问题：LLM 攻击 Agent 如何感知和应对主动防御者？攻防 AI 之间的均衡点在哪里？

填补空白：T5（对抗主动防御者）、E3（无攻防对抗评测）

现有工作的不足：整个攻击性文献都在"无防御者真空"中运行。#32 CHeaT 展示了一个简单的 LLM 蜜罐防御即可实现 100% 的防御成功率；#33 Mantis 表明提示注入可以完全阻止甚至反转 LLM 攻击。然而，没有任何后续攻击性论文将对抗鲁棒性纳入其设计或评估中。至少 4 篇攻击性论文（#19、#06、#07、#08）明确承认"评估仅限于无主动防御的系统"。

技术路线：

防御感知模块——在阶段二 Agent 架构基础上增加
- 蜜罐检测：识别异常响应模式（响应延迟、服务指纹不一致、异常交互行为），参考 #50 LLM Agent Honeypot 的检测信号分析
- IDS 规避：流量模式调整（扫描速率控制、端口扫描顺序随机化）、时间间隔随机化、协议混淆
- 提示注入防御：检测并过滤来自目标系统的提示注入攻击（对抗 #33 Mantis 风格的防御手段），在工具输出和 LLM 输入之间增加安全过滤层
对抗训练框架：
- 构建 LLM 防御者：参考 #32 CHeaT 的蜜罐+提示注入防御框架，实现可配置的防御策略（被动监控、主动欺骗、提示注入反击）
- 攻防交替训练：攻击者学习规避当前防御 → 防御者更新策略应对新攻击 → 迭代进化。评估收敛性和均衡状态
- 博弈论分析：将攻防交互建模为不完全信息博弈，分析纳什均衡。参考 #25 ASAP 和 #49 G-CTR 的博弈论方法
隐蔽性指标体系（对阶段一指标体系的扩展）：
- IDS 告警触发率（Suricata/Snort 规则命中数）
- 攻击完成时间 vs 被检测时间（"存活时间"）
- 防御者感知的攻击者行为模式数量
- 攻击成功率 vs 防御成功率的双向指标
实验环境：阶段一基准的层级 3 + 部署 Suricata/Snort IDS + CHeaT 蜜罐 + Wazuh SIEM
评估方案：攻防对抗基准上的双向评测。对比有/无防御感知模块的攻击成功率差异；对比不同防御策略组合下的攻防均衡

基础工作：#32（CHeaT 蜜罐防御）、#33（Mantis 提示注入防御）、#50（蜜罐检测）、#25（ASAP 博弈论）、#49（G-CTR 博弈论）

预期交付物：防御感知渗透 Agent + 攻防对抗评测框架 + 博弈论分析论文

三阶段整体关系

	阶段一：统一基准	阶段二：理想环境渗透	阶段三：对抗环境渗透
时间	Year 1-2	Year 2-3	Year 3-4
填补空白	E1 + E2 + E3	T1 + T3 + D1	T5 + E3
核心贡献	评测基础设施	系统能力突破	真实场景适应
关键交付	开源基准套件	多主机渗透 Agent	攻防对抗框架
对下阶段的支撑	提供评测平台	提供攻击方基线	—

三个阶段覆盖了领域中 8 个研究空白（E1、E2、E3、T1、T3、T5、D1 及 E3 的对抗评测部分），形成从"评测→系统→对抗"的完整研究链条。

领域成熟度评估

维度	成熟度	证据
初始访问/单主机利用	成长期	多个系统在CTF/VulnHub上达到40-90%的成功率
多主机/真实网络测试	萌芽期	仅3-4篇论文尝试此方向
后渗透/横向移动	胚胎期	除 #06、#19、#60 外几乎没有相关工作
评估标准化	萌芽期	基准正在出现但尚未被采纳
理论基础	胚胎期	多种互不兼容的形式化方法
防御交互	胚胎期	防御性论文存在但无攻击性论文考虑防御者
实际部署	前萌芽期	#02 以实时HTB排名最为接近

总体评估：该领域处于早期成长阶段。核心能力已得到验证，但CTF解题与真实渗透测试之间的差距仍然巨大。这使得现在是进行博士研究的绝佳时机——基本问题已被识别但尚未解决。

贯穿主题

抽象优于模型规模：多篇论文（#04、#60）表明良好的架构设计比使用最大的模型更重要。
利用是瓶颈：在所有论文中，利用阶段的成功率最低。侦察和扫描基本已解决。
上下文管理是普遍难题：每个系统都在长交互中维持上下文方面存在困难。
RAG必要但不充分：RAG有帮助，但检索质量差，可能误导智能体。
开源模型正在追赶：经过微调的开源模型（#04、#10、#54）与GPT-4的竞争力日益增强。