#08

AutoPentester: An LLM Agent-based Framework for Automated Pentesting AutoPentester: An LLM Agent-based Framework for Automated Pentesting

Yasod Ginige, Akila Niroshan, Sajal Jain, Suranga Seneviratne

2025 | IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (conference)

arXiv:2510.05605

system penetration-testing fully-autonomous multi-agent chain-of-thought

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Existing LLM-based penetration testing tools such as PentestGPT remain semi-manual, requiring significant human interaction to execute commands, interpret results, and guide strategy. They suffer from limited strategic planning, repetitive attack strategies, low automation, and restricted tool handling.

现有的基于 LLM 的渗透测试工具（如 PentestGPT）仍然处于半手动状态，需要大量的人机交互来执行命令、解释结果并指导策略。它们存在战略规划有限、攻击策略重复、自动化程度低以及工具处理受限的问题。

The cybersecurity industry faces a critical shortage of skilled professionals while the demand for penetration testing and vulnerability assessments continues to grow. Automating the pentesting process can help meet this demand, but current LLM-based approaches still require substantial human expertise and intervention, limiting their practical utility.

网络安全行业面临着熟练专业人员的严重短缺，而对渗透测试和漏洞评估的需求却持续增长。自动化渗透测试过程可以帮助满足这一需求，但目前基于 LLM 的方法仍然需要大量的人类专业知识和干预，限制了它们的实际效用。

Threat Model 威胁模型

The system assumes a black-box external attacker perspective given only a target IP address. The attacker has no prior knowledge of the target system and must discover services, vulnerabilities, and exploit paths autonomously.

系统假设一个仅提供目标 IP 地址的外部黑盒攻击者视角。攻击者对目标系统没有先验知识，必须自主发现服务、漏洞和利用路径。

Methodology 核心方法

AutoPentester is a multi-agent LLM framework that automates the full penetration testing pipeline. Given a target IP, it iteratively performs reconnaissance, scanning, vulnerability assessment, and exploitation using common security tools. It introduces five key modules: a Summarizer that condenses lengthy tool outputs, a Strategy Analyzer that uses Chain-of-Thought reasoning over a Pentest Tree (PTT) to plan attacks, a RAG-enhanced Generator that produces accurate tool commands, an Agent-Computer Interface (ACI) that executes commands via CLI, and a Results Verifier that validates and corrects tool outputs. A Repetition Identifier module detects looping behavior using cosine similarity of vector embeddings. Upon completion, a Report Generator produces a CSV-based vulnerability report.

AutoPentester 是一个多智能体 LLM 框架，可实现全渗透测试管道的自动化。给定目标 IP，它使用通用安全工具迭代执行侦察、扫描、漏洞评估和利用。它引入了五个关键模块：总结器，用于压缩冗长的工具输出；策略分析器，在渗透测试树 (PTT) 上使用思维链推理来规划攻击；RAG 增强的生成器，用于产生准确的工具命令；智能体-计算机接口 (ACI)，通过 CLI 执行命令；以及结果验证器，用于验证和纠正工具输出。重复识别器模块使用向量嵌入的余弦相似度来检测循环行为。完成后，报告生成器产生基于 CSV 的漏洞报告。

Architecture 架构设计

Five LLM-based agents (Summarizer, Strategy Analyzer, Generator, Results Verifier, Report Generator) plus two non-LLM modules (Repetition Identifier, Agent-Computer Interface). The Summarizer chunks large tool outputs (6000-char chunks with 500-char overlap) and summarizes them. The Strategy Analyzer maintains a Pentest Tree (PTT) with findings as attributes and uses CoT reasoning to select the next step. The Generator uses RAG to retrieve relevant knowledge and produce executable commands. The ACI extracts and executes commands via subprocess/pexpect. The Results Verifier checks outputs and adjusts commands. The Repetition Identifier uses vector embeddings and cosine similarity (threshold 0.15) to detect repeated steps.

五个基于 LLM 的智能体（总结器、策略分析器、生成器、结果验证器、报告生成器）外加两个非 LLM 模块（重复识别器、智能体-计算机接口）。总结器对大型工具输出进行分块（6000 字符分块，500 字符重叠）并进行总结。策略分析器维护一个以发现作为属性的渗透测试树 (PTT)，并使用 CoT 推理选择下一步。生成器使用 RAG 检索相关知识并产生可执行命令。ACI 通过 subprocess/pexpect 提取并执行命令。结果验证器检查输出并调整命令。重复识别器使用向量嵌入和余弦相似度（阈值 0.15）来检测重复步骤。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

AutoPentester achieves a 27.0% higher subtask completion rate (59.92% vs 47.18%) and 39.5% higher vulnerability coverage (98.14% vs 70.37%) compared to PentestGPT on HTB and custom VM experiments. It requires 92.6% fewer human interactions (1.13 vs 15.36 per machine) and 18.7% fewer steps, while generating 97.1% fewer incomplete commands. In a user survey of 10 cybersecurity professionals, AutoPentester scored 3.93/5 on average, 19.8% higher than PentestGPT.

与 PentestGPT 在 HTB 和自定义 VM 实验中相比，AutoPentester 实现了高出 27.0% 的子任务完成率（59.92% vs 47.18%）和高出 39.5% 的漏洞覆盖率（98.14% vs 70.37%）。它减少了 92.6% 的人工交互（每台机器 1.13 次 vs 15.36 次） and 18.7% 的步骤，同时减少了 97.1% 的不完整命令生成。在对 10 名网络安全专业人员进行的问卷调查中，AutoPentester 平均得分 3.93/5，比 PentestGPT 高出 19.8%。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PentestGPT

Scale 评估规模

10 HackTheBox machines (6 easy, 4 medium) and 4 custom VMs covering OWASP top-10 vulnerabilities

Contributions 核心贡献

A novel multi-agent LLM framework (AutoPentester) for automated penetration testing and threat analysis that overcomes key limitations of prior work including limited strategic planning, lack of self-adjustment, limited automation, and heavy reliance on human expertise
Evaluation across three LLM backbones (GPT-4-turbo, GPT-3.5-turbo, Gemini-2.0-flash) on 10 HTB machines and 4 custom VMs, demonstrating 27.0% better subtask completion and 39.5% better vulnerability coverage than PentestGPT
An ablation study showing the contribution of each module: RAG improves subtask completion by 20%, Repetition Identifier reduces looping by 90.5%, and Results Verifier decreases incomplete commands by 80.1%
A qualitative user study with 10 cybersecurity professionals comparing AutoPentester against PentestGPT, showing higher confidence scores across most evaluation dimensions

一个用于自动化渗透测试和威胁分析的新型多智能体 LLM 框架 (AutoPentester)，它克服了先前工作的关键局限性，包括战略规划有限、缺乏自我调整、自动化程度低以及严重依赖人类专业知识。
在 10 台 HTB 机器和 4 个自定义 VM 上，针对三个 LLM 主干模型（GPT-4-turbo、GPT-3.5-turbo、Gemini-2.0-flash）进行了评估，证明了子任务完成率比 PentestGPT 高 27.0%，漏洞覆盖率高 39.5%。
一项消融实验展示了每个模块的贡献：RAG 将子任务完成率提高了 20%，重复识别器减少了 90.5% 的循环，结果验证器减少了 80.1% 的不完整命令。
对 10 名网络安全专业人员进行了一项定性用户研究，将 AutoPentester 与 PentestGPT 进行对比，结果显示在大多数评估维度上 AutoPentester 的置信度得分更高。

Limitations 局限性

Relies on CLI tools like curl for web application interaction, making GUI-based task execution challenging; struggles with web applications that require browser-based interaction
Commands generated by the Generator tend to focus narrowly on RAG-suggested content, sometimes missing corner cases such as using specific GitHub repositories for exploits
Knowledge base requires continuous updating to maintain high performance; an outdated RAG knowledge base would degrade exploit generation quality
User study had a small sample size (10 professionals), limiting generalizability of qualitative findings
AutoPentester takes 71.9% longer on average than PentestGPT due to automation overhead (e.g., ACI waiting for interactive tool responses), though it can run unattended
Failed to identify correct attack strategies on 4 of 8 incompletely-solved HTB machines; LLMs still lack ability to navigate complex multi-step attack paths
Limited to CLI-based security tools; does not support GUI-based tools like ZAP or OpenVAS
Only evaluated on Linux and Windows targets with known vulnerabilities; no evaluation on hardened or real production environments

依赖 curl 等 CLI 工具进行 Web 应用程序交互，使得基于 GUI 的任务执行具有挑战性；在需要基于浏览器的交互的 Web 应用程序上表现不佳。
生成器生成的命令倾向于狭隘地集中在 RAG 建议的内容上，有时会错过极端情况，例如使用特定的 GitHub 仓库进行漏洞利用。
知识库需要持续更新以保持高性能；过时的 RAG 知识库会降低利用生成的质量。
用户研究的样本量较小（10 名专业人员），限制了定性结果的泛化性。
由于自动化开销（例如，ACI 等待交互式工具响应），AutoPentester 平均比 PentestGPT 耗时多 71.9%，尽管它可以无人看管地运行。
在 8 台未完全解决的 HTB 机器中，有 4 台未能确定正确的攻击策略；LLM 仍然缺乏导航复杂多步攻击路径的能力。
仅限于基于 CLI 的安全工具；不支持 ZAP 或 OpenVAS 等基于 GUI 的工具。
仅在具有已知漏洞的 Linux 和 Windows 目标上进行了评估；没有对加固或真实生产环境进行评估。

Research Gaps 研究空白

Current LLMs cannot reliably identify complex multi-step attack strategies in penetration testing, suggesting a need for domain-specific fine-tuning
No existing framework adequately handles GUI-based web application testing in an automated fashion
Lack of standardized benchmarks for comparing automated pentesting tools across consistent evaluation criteria
Limited work on integrating reinforcement learning (RLHF, DPO) to teach LLMs pentesting strategies rather than relying solely on prompting
Absence of automated pentesting frameworks that generate industry-standard reports with executive summaries and impact assessments
No existing work combines both network pentesting and web application pentesting with full automation in a single framework

当前的 LLM 无法可靠地识别渗透测试中复杂的多步攻击策略，这表明需要进行领域特定的微调。
目前还没有任何框架能以自动化方式充分处理基于 GUI 的 Web 应用程序测试。
缺乏标准化的基准来在一致的评估标准下比较自动化渗透测试工具。
在整合强化学习（RLHF、DPO）来教导 LLM 渗透测试策略（而非仅依赖提示词）方面的研究有限。
缺乏能够生成包含执行摘要和影响评估的行业标准报告的自动化渗透测试框架。
之前没有工作在单个框架中结合全自动化的网络渗透测试和 Web 应用程序渗透测试。

Novel Techniques 新颖技术

Pentest Tree (PTT) - a modified attack tree that stores both steps and key findings as attributes, enabling findings-oriented CoT reasoning for strategy selection
Repetition Identifier using vector embeddings and cosine similarity (threshold 0.15) to detect and break out of looping attack patterns, with four resolution options (continue, exit, interactive mode, general input)
Results Verifier agent that validates tool outputs and adjusts commands (e.g., correcting IP addresses, adding missing flags), reducing incomplete commands by 80.1%
Chunked summarization approach for long tool outputs (6000-char chunks with 500-char overlap) to handle token limits while maintaining context
RAG-enhanced command generation using a knowledge base built from pentesting textbooks and HackTricks articles, with text-embedding-ada-002 for retrieval

渗透测试树 (PTT) —— 一种修改后的攻击树，将步骤和关键发现都存储为属性，从而实现面向发现的 CoT 推理，用于策略选择。
使用向量嵌入和余弦相似度（阈值 0.15）的重复识别器，用于检测并打破循环攻击模式，提供四种解决选项（继续、退出、交互模式、通用输入）。
结果验证器智能体，用于验证工具输出并调整命令（例如，纠正 IP 地址、添加缺失的标志），将不完整命令减少了 80.1%。
针对长工具输出的分块总结方法（6000 字符分块，500 字符重叠），在处理令牌限制的同时保持上下文。
使用从渗透测试教科书和 HackTricks 文章构建的知识库进行 RAG 增强的命令生成，并使用 text-embedding-ada-002 进行检索。

Open Questions 开放问题

How effective would fine-tuning LLMs on pentesting data (using RLHF or DPO) be compared to the prompt-engineering and RAG approach used here?
Can the framework scale to large enterprise networks with hundreds of hosts and complex network topologies?
How would AutoPentester perform against actively defended environments with IDS/IPS, WAFs, or deception technologies?
What is the optimal knowledge base composition and update frequency for the RAG module to maximize exploit coverage?
How can GUI-based web application testing be integrated without falling back to a human-in-the-loop approach?
Would a more sophisticated planning mechanism (e.g., Monte Carlo Tree Search or hierarchical task networks) improve strategy selection over CoT reasoning?

与此处使用的提示工程和 RAG 方法相比，在渗透测试数据上微调 LLM（使用 RLHF 或 DPO）的效果如何？
该框架能否扩展到具有数百台主机和复杂网络拓扑的大型企业网络？
AutoPentester 在面对具有 IDS/IPS、WAF 或欺骗技术的主动防御环境时表现如何？
RAG 模块的最佳知识库构成和更新频率是多少，以最大化利用覆盖范围？
如何在不回退到人类在环方法的情况下集成基于 GUI 的 Web 应用程序测试？
与 CoT 推理相比，更复杂的规划机制（例如蒙特卡洛树搜索或分层任务网络）是否会改进策略选择？

Builds On 基于前人工作

PentestGPT
AutoAttacker
PenHeal
ScriptKiddie

Open Source 开源信息

Yes - https://github.com/YasodGinworksInige/AutoPentester