#39

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Isamu Isozaki, Manil Shrestha, Rick Console, Edward Kim

2025 | arXiv (preprint)

benchmark penetration-testing human-in-the-loop multi-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

There is no comprehensive, open, end-to-end penetration testing benchmark to evaluate and drive progress of LLM-based automated penetration testing. Existing tools like PentestGPT rely heavily on human participation, and the degree of human involvement and the specific challenges LLMs face at each pentest stage are not well understood.

目前缺乏一个全面的、开放的、端到端的渗透测试基准测试，用于评估和推动基于LLM的自动化渗透测试的进展。现有的工具（如PentestGPT）严重依赖人工参与，而LLM在渗透测试各阶段的人工参与程度以及面临的具体挑战尚不明确。

Cybercrime losses totaled $12.5 billion in 2023 (a 20% increase from 2022), making automated penetration testing increasingly important. While LLMs show promise for pentest automation, there is no standardized public benchmark to evaluate their capabilities across the full penetration testing pipeline. Without identifying where LLMs struggle (enumeration, exploitation, or privilege escalation), it is hard to gauge the magnitude of subsequent improvements.

2023年网络犯罪损失总计达125亿美元（比2022年增加了20%），这使得自动化渗透测试变得日益重要。虽然LLM在渗透测试自动化方面显示出前景，但目前没有标准化的公共基准来评估它们在整个渗透测试流程中的能力。如果不识别出LLM在哪些环节（枚举、利用或提权）存在困难，就很难衡量后续改进的影响程度。

Threat Model 威胁模型

The LLM agent operates with human-in-the-loop assistance following strict rules to minimize bias. The agent targets locally-hosted VulnHub virtual machines with known vulnerabilities. The goal is to achieve root-level access on each target machine.

LLM智能体在“人在回路（human-in-the-loop）”的协助下运行，并遵循严格的规则以尽量减少偏差。智能体的目标是具有已知漏洞的本地托管VulnHub虚拟机。目标是获得每个目标机器的root级访问权限。

Methodology 核心方法

The authors create an open benchmark of 152 penetration testing tasks across 13 VulnHub machines at easy, medium, and hard difficulty levels. They evaluate PentestGPT with GPT-4o and Llama 3.1-405B, establishing strict rules (25 detailed rules in appendix) to minimize human involvement. They then conduct three cumulative ablation studies on PentestGPT to address identified weaknesses: (1) summary injection to combat forgetting, (2) structured to-do lists replacing the unstructured Penetration Testing Tree, and (3) RAG-based context from HackTricks to augment the reasoning module.

作者创建了一个包含152个渗透测试任务的开放基准测试，涵盖了13台不同难度（简单、中等、困难）的VulnHub机器。他们使用GPT-4o和Llama 3.1-405B对PentestGPT进行了评估，并建立了严格的规则（见附录中的25条详细规则）以尽量减少人为干预。随后，他们针对识别出的弱点对PentestGPT进行了三项累积消融研究（ablation studies）：(1) 引入摘要注入（summary injection）以应对遗忘问题；(2) 使用结构化的待办事项列表（to-do list）替代非结构化的渗透测试树；(3) 基于HackTricks的RAG（检索增强生成）上下文来增强推理模块。

Architecture 架构设计

Built on top of PentestGPT's multi-agent architecture consisting of three modules: a Summarization Module, a Reasoning Module, and a Generative Module (task explanation). Each module maintains its own history of the past 5 conversations. The ablations progressively add: summary injection that maintains a running summary of past activities across modules, a structured to-do list with add/remove/modify task tools (inspired by ReAct) in the reasoning module, and RAG retrieval from a ChromaDB vector store of HackTricks content using bge-large embeddings with cosine similarity retrieval of top-3 documents refined to top-2 via bge-reranker.

构建在PentestGPT的多智能体架构之上，包含三个模块：摘要模块、推理模块和生成模块（任务说明）。每个模块维护过去5次对话的历史记录。消融研究逐步增加了：(1) 摘要注入，用于维护跨模块的过去活动的持续摘要；(2) 结构化待办事项列表，在推理模块中包含添加/删除/修改任务工具（受ReAct启发）；(3) 从ChromaDB向量数据库检索HackTricks内容的RAG，使用bge-large嵌入进行余弦相似度检索前3个文档，并使用bge-reranker精炼至前2个。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Llama 3.1-405B outperforms GPT-4o on 7 out of 13 machines, with equal performance on 4 and GPT-4o better on only 2. Neither model could complete a single machine end-to-end without any task failures. The cumulative ablation with RAG (Ablation 3) achieved the best overall performance, notably improving exploitation (from 50% to 100% on Funbox) and privilege escalation (from 0% to 100% on both Funbox and Symfonos 2). Performance degrades significantly in later task stages for both models, dropping from ~60% in the first 20% of tasks to ~33% in the final 20%.

Llama 3.1-405B在13台机器中的7台表现优于GPT-4o，4台表现持平，仅有2台弱于GPT-4o。在没有任何任务失败的情况下，两个模型都无法完整地完成任何一台机器的端到端测试。带有RAG的累积消融（消融3）实现了最佳的整体性能，显著改善了漏洞利用（在Funbox上从50%提升到100%）和提权（在Funbox和Symfonos 2上均从0%提升到100%）。两个模型的性能在任务后期均显著下降，从前20%任务的约60%成功率下降到最后20%任务的约33%。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-4o with PentestGPT (base)
Llama 3.1-405B with PentestGPT (base)
Llama 3.1-405B with summary injection (Ablation 1)
Llama 3.1-405B with structured to-do lists (Ablation 2)
Llama 3.1-405B with RAG context (Ablation 3)

Scale 评估规模

152 tasks across 13 VulnHub machines (7 easy, 4 medium, 2 hard), with tasks distributed as: 72 reconnaissance, 14 general techniques, 44 exploitation, 22 privilege escalation

Contributions 核心贡献

A novel open benchmark of 152 end-to-end penetration testing tasks across 13 VulnHub machines with 25 strict evaluation rules to minimize human bias, filling a critical gap in standardized LLM pentest evaluation
Comprehensive evaluation of GPT-4o and Llama 3.1-405B using PentestGPT, revealing that Llama 3.1-405B has an edge over GPT-4o especially on easy and medium machines, and that neither model can complete a single machine without failures
Three cumulative ablation studies (summary injection, structured to-do lists, RAG from HackTricks) that provide insights into improving PentestGPT, with RAG showing the best overall improvement across categories
Detailed category-specific and difficulty-level analysis revealing that reconnaissance is easiest while exploitation and privilege escalation are hardest, and that performance degrades significantly in later pentesting stages

一个新颖的开放基准测试，包含13台VulnHub机器上的152个端到端渗透测试任务，并制定了25条严格的评估规则以减少人为偏见，填补了标准化LLM渗透测试评估的空白
使用PentestGPT对GPT-4o和Llama 3.1-405B进行了全面评估，揭示了Llama 3.1-405B在简单和中等难度的机器上比GPT-4o更具优势，且两个模型在没有失败的情况下都无法完成整台机器的测试
进行了三项累积消融研究（摘要注入、结构化待办事项列表、HackTricks RAG），为改进PentestGPT提供了见解，其中RAG在各类别中显示出最显著的整体改进
详细的分类和难度级别分析揭示了侦察最容易，而漏洞利用和权限提升最难，且性能在渗透测试的后期阶段显著退化

Limitations 局限性

Human-in-the-loop evaluation means human errors and bias can still affect results despite strict rules
Benchmark assumes 3 walkthroughs cover all possible exploitation paths; as new exploits are found this may not hold
VulnHub boxes are at least 2 years old, so LLMs may have been trained on walkthrough content, though neither could complete any box end-to-end
Ablation studies were conducted on only 2 boxes (Funbox and Symfonos 2) due to time and cost constraints
Ablation used full-precision Llama 3.1-405B with 8K context vs. quantized 128K context for the main benchmark, so results may differ slightly
Only one trial per test, making results potentially more stochastic than multi-trial approaches
Structured generation ablation showed mixed results, improving some categories while degrading enumeration on Symfonos 2 due to task list bloat

人在回路的评估意味着尽管有严格规则，人为错误和偏见仍可能影响结果
基准测试假设3份攻略（walkthroughs）涵盖了所有可能的利用路径；随着新利用方式的发现，这可能不再成立
VulnHub靶机至少有2年历史，因此LLM可能在攻略内容上进行过训练，尽管没有任何模型能端到端完成任何靶机测试
由于时间和成本限制，消融研究仅在2台靶机（Funbox和Symfonos 2）上进行
消融研究使用了具有8K上下文的全精度Llama 3.1-405B，而主基准测试使用了具有128K上下文的量化模型，因此结果可能略有不同
每个测试仅进行一次试验，使得结果可能比多次试验的方法更具随机性
结构化生成消融显示出混合结果，改善了某些类别，但由于任务列表膨胀，在Symfonos 2上降低了枚举性能

Research Gaps 研究空白

Whether structured or unstructured generation (to-do lists vs. Penetration Testing Trees) is better for long-term LLM planning in pentesting remains an open question
Fully autonomous penetration testing without human involvement remains unsolved; current LLM agents cannot navigate websites or interpret command outputs independently
The performance degradation in later pentesting stages (forgetting earlier findings) needs better solutions beyond simple summary injection
Reinforcement learning for improving LLM penetration testing capability is an unexplored direction
Self-play with adversarial red/blue team LLM agents (e.g., mirroring CCDC competitions) as a training paradigm has not been explored
Balancing the aggressiveness of task list management (adding vs. removing tasks) in structured planning is unsolved

对于渗透测试中的长期LLM规划，结构化生成还是非结构化生成（待办事项列表 vs. 渗透测试树）更好，仍是一个待解决的问题
无需人工参与的完全自主渗透测试仍未实现；目前的LLM智能体无法独立导航网站或解释命令输出
在渗透测试后期阶段的性能退化（遗忘早期发现）需要比简单的摘要注入更好的解决方案
利用强化学习提高LLM渗透测试能力是一个尚未探索的方向
采用对抗性红/蓝队LLM智能体自我对弈（例如镜像CCDC竞赛）作为训练范式尚未被探索
在结构化规划中平衡任务列表管理的积极性（添加 vs. 删除任务）尚未解决

Novel Techniques 新颖技术

Cumulative ablation methodology for systematically improving multi-agent pentest tools: summary injection -> structured to-do lists -> RAG, each building on previous improvements
Structured to-do list with add/remove/modify task tools replacing unstructured Penetration Testing Tree for LLM reasoning, using constrained tool-calling
RAG retrieval from HackTricks with bge-large embeddings and bge-reranker to augment pentest reasoning context via ChromaDB
Strict 25-rule evaluation protocol to minimize and standardize human involvement in human-in-the-loop LLM pentest benchmarking
Prompt modification for Llama ('Be helpful and comprehensive preferably with commands') to prevent overly concise outputs that caused task forgetting

用于系统改进多智能体渗透测试工具的累积消融方法论：摘要注入 -> 结构化待办事项列表 -> RAG，每一项都建立在前一项改进之上
使用带有添加/删除/修改任务工具的结构化待办事项列表替代非结构化的渗透测试树，用于LLM推理，采用受限的工具调用
从HackTricks进行RAG检索，使用bge-large嵌入和bge-reranker，通过ChromaDB增强渗透测试推理上下文
严格的25条规则评估协议，以最小化并标准化人在回路LLM渗透测试基准测试中的人工参与
针对Llama的提示词修改（“请提供帮助且内容全面，最好附带命令”），以防止过于简短的输出导致任务遗忘

Open Questions 开放问题

Can reinforcement learning improve LLM pentesting capabilities beyond prompt engineering and RAG?
How to fully automate the human-in-the-loop steps (website navigation, exploit interpretation) that current agents cannot handle?
Is there an optimal balance between structured and unstructured task planning for long-horizon pentest workflows?
Why does Llama 3.1-405B outperform GPT-4o -- is it due to less verbosity leading to better course correction, or the evaluation format favoring models that ask clarifying questions?
How would these results change with newer models or different agent frameworks beyond PentestGPT?
Can constrained generation techniques further improve structured task list management in pentesting agents?

强化学习能否在提示词工程和RAG之外进一步提高LLM渗透测试能力？
如何完全自动化目前智能体无法处理的人在回路步骤（网站导航、漏洞利用解释）？
在长周期的渗透测试工作流中，是否存在结构化规划与非结构化规划之间的最佳平衡点？
为什么Llama 3.1-405B的表现优于GPT-4o——是因为冗余更少从而能更好地纠正路线，还是评估格式更偏向于会提问澄清的模型？
随着新模型或PentestGPT之外的不同智能体框架的出现，这些结果会如何改变？
受限生成技术能否进一步改进渗透测试智能体中的结构化任务列表管理？

Builds On 基于前人工作

PentestGPT (Deng et al., 2024)
ReAct (Yao et al., 2022)
RAG (Lewis et al., 2020)
Fang et al., 2024a,b,c (LLM web exploitation and vulnerability agents)
Happe et al., 2024 (LLM privilege escalation)
Wu et al., 2023 (Plan, Eliminate, and Track for embodied agents)

Open Source 开源信息

Yes - Benchmark: https://github.com/isamu-isozaki/AI-Pentest-Benchmark; Modified PentestGPT: https://github.com/isamu-isozaki/PentestGPT