#61

AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents

Julius Henke

2025 | arXiv (preprint)

system penetration-testing semi-autonomous multi-agent hierarchical-planning

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Penetration testing is a critical but labor-intensive process that requires bridging multiple stages (enumeration, vulnerability analysis, exploitation). While individual tasks like network scanning can be automated, combining them intelligently to perform end-to-end black-box penetration testing remains a challenge.

渗透测试是一个关键但劳动密集的过程，需要跨越多个阶段（枚举、漏洞分析、漏洞利用）。虽然网络扫描等单个任务可以实现自动化，但如何智能地将它们结合起来执行端到端的黑盒渗透测试仍然是一个挑战。

Automating penetration testing could reduce costs, increase frequency of assessments, and reduce human effort. Existing tools automate individual steps but struggle to bridge tasks effectively, as the search space for exploitable vulnerabilities grows rapidly and requires intuition. LLMs like GPT-4o could provide the intelligence needed to automate this bridging.

渗透测试自动化可以降低成本、增加评估频率并减少人力投入。现有工具可以自动执行单个步骤，但在有效衔接任务方面表现吃力，因为可利用漏洞的搜索空间增长迅速且需要直觉。像 GPT-4o 这样的 LLM 可以提供衔接这些任务所需的智能。

Threat Model 威胁模型

Black-box penetration testing scenario where only a target IP address or domain name is known. No prior knowledge of the target system's internals. The user provides the target and optionally reviews shell commands before execution for safety.

黑盒渗透测试场景，仅知道目标 IP 地址或域名。没有目标系统内部的先验知识。用户提供目标，并可以可选地在执行前审查 shell 命令以确保安全。

Methodology 核心方法

AutoPentest integrates GPT-4o with the LangChain agent framework to perform autonomous black-box penetration testing. The system uses a hierarchical multi-agent architecture with a Planner (creates high-level attack plans), a Supervisor (delegates tasks to workers), and multiple Specialised Workers (execute specific attack phases). Automated service discovery via nmap provides initial context, and RAG with a Pinecone vector database enriches worker context with domain-specific knowledge from OWASP Top 10, CWE descriptions, and other security resources. The Planner dynamically re-plans after each step based on new observations.

AutoPentest 将 GPT-4o 与 LangChain 智能体框架集成，以执行自主的黑盒渗透测试。该系统采用分层多智能体架构，包括一个规划器（Planner，创建高层攻击计划）、一个主管（Supervisor，将任务委派给执行者）和多个专业执行者（Specialised Workers，执行特定的攻击阶段）。通过 nmap 进行的自动服务发现提供了初始上下文，而使用 Pinecone 向量数据库的 RAG 通过来自 OWASP Top 10、CWE 描述和其他安全资源的领域特定知识丰富了执行者的上下文。规划器在每一步之后根据新的观察结果动态地重新规划。

Architecture 架构设计

Hierarchical multi-agent system consisting of: (1) User interface for target specification and optional command review; (2) Deterministic service discovery via nmap with NIST NVD CVE lookups; (3) Planner agent that creates and updates high-level attack plans; (4) Supervisor agent that delegates plan steps to appropriate Specialised Workers; (5) Specialised Workers for Enumeration, Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Configuration, Identification and Authentication Failures, and Privilege Escalation; (6) Pinecone vector database for RAG. All agents are powered by GPT-4o. Built with LangChain and LangGraph for stateful multi-agent conversations.

分层多智能体系统，由以下部分组成：(1) 用于指定目标和可选命令审查的用户界面；(2) 通过 nmap 进行的确定性服务发现，并带有 NIST NVD CVE 查询；(3) 规划器智能体，负责创建和更新高层攻击计划；(4) 主管智能体，负责将计划步骤委派给适当的专业执行者；(5) 专业执行者，分别针对枚举、失效的访问控制、加密失败、注入、不安全设计、安全配置错误、身份识别和身份验证失败以及提权；(6) 用于 RAG 的 Pinecone 向量数据库。所有智能体均由 GPT-4o 驱动。使用 LangChain 和 LangGraph 构建，用于实现有状态的多智能体对话。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Both AutoPentest and manual ChatGPT-4o completed 15-26% of subtasks on three HTB machines. AutoPentest slightly outperformed ChatGPT-4o on the Codify machine (25.93% vs 22.22%) and matched it on the other two. Total API cost for all AutoPentest experiments was $96.20 (average $9.62 per run, median $6.43), compared to $20 for a ChatGPT Plus monthly subscription. The system demonstrated high autonomy but struggled with exploitation steps despite successfully enumerating vulnerabilities.

AutoPentest 和手动使用 ChatGPT-4o 在三台 HTB 机器上均完成了 15-26% 的子任务。AutoPentest 在 Codify 机器上的表现略优于 ChatGPT-4o（25.93% 对比 22.22%），在其他两台机器上表现持平。所有 AutoPentest 实验的总 API 成本为 96.20 美元（平均每次运行 9.62 美元，中位数为 6.43 美元），而 ChatGPT Plus 的月费为 20 美元。该系统展现了高度的自主性，但在成功枚举漏洞后，在漏洞利用步骤上表现吃力。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

ChatGPT-4o-manual

Scale 评估规模

3 HackTheBox machines (Devvortex, Broker, Codify) with 26, 10, and 27 subtasks respectively

Contributions 核心贡献

Design and implementation of AutoPentest, a multi-agent LLM system for autonomous black-box penetration testing using GPT-4o and LangChain
Integration of RAG with domain-specific security knowledge (OWASP Top 10, CWE, CVE) to enhance specialized worker agents
Hierarchical planning with dynamic re-planning after each step based on worker observations
Evaluation on three HackTheBox machines released after GPT-4o's training data cutoff to mitigate data leakage concerns
Comparison with manual ChatGPT-4o baseline showing similar performance with much higher autonomy
Open-source release of AutoPentest at https://github.com/JuliusHenke/autopentest

设计并实现了 AutoPentest，这是一个使用 GPT-4o 和 LangChain 的自主黑盒渗透测试多智能体 LLM 系统
集成了具有领域特定安全知识（OWASP Top 10, CWE, CVE）的 RAG，以增强专业执行者智能体
实现了分层规划，在每步之后根据执行者的观察结果进行动态重新规划
在 GPT-4o 训练数据截止日期之后发布的三个 HackTheBox 机器上进行了评估，以减轻数据泄露顾虑
与手动 ChatGPT-4o 基准进行比较，显示出相似的性能和更高的自主性
在 GitHub 上开源了 AutoPentest：https://github.com/JuliusHenke/autopentest

Limitations 局限性

Low overall subtask completion rate (15-26%), particularly struggling with exploitation after successful enumeration
Task repetition: Specialised Workers sometimes got stuck repeating tasks without reporting back to the Planner
Assumed shell context: Workers sometimes assumed they had active shell access to the target when they did not
Unreported observations: Workers occasionally hit the 100-iteration limit or crashed, losing all observations
Higher cost than ChatGPT Plus subscription ($96.20 total vs $20/month), though with better scalability
Only evaluated on GPT-4o; no comparison with other LLMs like Gemini, Claude, or open-source models
Only covers black-box penetration testing; no white-box scenarios evaluated
Small evaluation scale (3 machines, all rated easy on HTB)
Human review of shell commands still required for safety during experiments

整体子任务完成率较低 (15-26%)，尤其是在成功枚举后难以进行漏洞利用
任务重复：专业执行者有时会陷入重复任务而没有向规划器汇报的情况
假设 shell 上下文：执行者有时在没有获得目标机 shell 访问权限时假设自己已经拥有
未报告的观察结果：执行者偶尔会达到 100 次迭代限制或崩溃，导致丢失所有观察结果
成本高于 ChatGPT Plus 订阅（总计 96.20 美元对比 20 美元/月），尽管具有更好的可扩展性
仅在 GPT-4o 上进行了评估；没有与 Gemini、Claude 或开源模型进行比较
仅涵盖黑盒渗透测试；未评估白盒场景
评估规模较小（3 台机器，且在 HTB 上都被评为简单等级）
为了实验安全，仍需要人工审查 shell 命令

Research Gaps 研究空白

Safety mechanisms for fully autonomous penetration testing without human command review
Improving LLM memory and self-reflection to avoid assumed shell contexts and task repetition
Performance across multiple runs on the same target and leveraging knowledge from previous runs
Effective use of interactive tools (e.g., msfconsole) by LLM agents
White-box autonomous penetration testing where more initial context is available
Evaluation of newer commercial LLMs (Gemini, Claude) and open-source LLMs (Llama 4) for pentest tasks
Checkpointing and resuming long-running penetration tests
Caching enumeration results for repeated runs against the same target

无需人工命令审查的全自主渗透测试安全机制
改进 LLM 的记忆和自我反思能力，以避免假设 shell 上下文和任务重复
在同一目标上进行多次运行的性能表现，以及利用之前运行的知识
LLM 智能体有效使用交互式工具（如 msfconsole）
初始上下文更丰富的白盒自主渗透测试
评估更新的商业 LLM（Gemini, Claude）和开源 LLM（Llama 4）在渗透测试任务中的表现
长时间运行的渗透测试的检查点设置和恢复
针对同一目标的重复运行缓存枚举结果

Novel Techniques 新颖技术

OWASP Top 10-based specialization of worker agents, mapping vulnerability categories to dedicated agents with tailored tools and RAG knowledge
Deterministic service discovery pipeline (nmap + NIST NVD CVE lookup) providing structured context before LLM planning begins
Dynamic re-planning after each worker completes a step, allowing the Planner to adjust based on new observations
Output truncation strategy (first 3000 + last 3000 chars) to manage context length for long tool outputs

基于 OWASP Top 10 的执行者智能体专业化，将漏洞类别映射到具有定制工具和 RAG 知识的专用智能体
确定性服务发现流水线（nmap + NIST NVD CVE 查询），在 LLM 规划开始前提供结构化上下文
每个执行者完成一个步骤后进行动态重新规划，允许规划器根据新观察结果进行调整
输出截断策略（前 3000 + 后 3000 字符），用以管理长工具输出的上下文长度

Open Questions 开放问题

Why does enumeration succeed but exploitation frequently fail, and how can this gap be bridged?
How to prevent agent task repetition and stuck states in long-running autonomous workflows?
Can aggregating observations across multiple runs improve overall success rates?
What safety mechanisms would be sufficient to remove human command review entirely?

为什么枚举成功但漏洞利用经常失败，如何弥补这一差距？
如何防止智能体任务重复以及在长时间运行的自主工作流中陷入僵局？
聚合多次运行的观察结果能否提高整体成功率？
什么样的安全机制足以完全取消人工命令审查？

Builds On 基于前人工作

HPTSA
AutoAttacker
PentestGPT
LangChain
LangGraph

Open Source 开源信息

Yes - https://github.com/JuliusHenke/autopentest