#09

VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu

2025 | arXiv (preprint)

arXiv:2501.13411

Problem & Motivation 问题与动机

Traditional penetration testing is labor-intensive and time-consuming, while existing LLM-assisted or automated approaches suffer from context loss, lack of contextual understanding, excessive unstructured data generation, and inability to maintain coherent multi-phase workflows.

传统的渗透测试耗时耗力,而现有的大语言模型(LLM)辅助或自动化方法存在上下文丢失、缺乏背景理解、生成过多非结构化数据以及无法维持连贯的多阶段工作流等问题。

Current LLM-based penetration testing tools are limited in scope, rely heavily on GPT-4, require human intervention, or struggle with context loss across testing phases. There is a need for a fully autonomous, multi-agent framework that can leverage open-source LLMs to emulate the collaborative workflows of human penetration testing teams, decomposing complex tasks into specialized phases with structured task management.

目前的基于大语言模型的渗透测试工具范围有限,严重依赖 GPT-4,需要人工干预,或者在不同测试阶段之间难以解决上下文丢失问题。因此需要一个全自动的多智能体框架,能够利用开源大语言模型模拟人类渗透测试团队的协作工作流,将复杂任务分解为具有结构化任务管理的专门阶段。

Threat Model 威胁模型

The system assumes a black-box penetration testing scenario where the attacker has network access to target machines and uses standard penetration testing tools from a Kali Linux attack platform. The framework operates in automatic mode without human intervention, targeting vulnerable machines in controlled lab environments and real-world benchmarks.

该系统假设处于黑盒渗透测试场景,攻击者拥有目标机器的网络访问权限,并使用来自 Kali Linux 攻击平台的标准渗透测试工具。该框架在无人工干预的自动模式下运行,针对受控实验环境和真实世界基准测试中的易受攻击机器。

Methodology 核心方法

VulnBot decomposes penetration testing into three specialized phases (reconnaissance, scanning, exploitation), each handled by role-specialized agents. A Penetration Task Graph (PTG) models tasks as a directed acyclic graph to ensure logical execution order. A Summarizer module enables inter-agent communication by extracting and passing critical information between phases. A Check and Reflection mechanism allows the system to reanalyze failed tasks and dynamically update plans. The Generator translates abstract task instructions into tool-specific commands, and the Executor runs them on target systems.

VulnBot 将渗透测试分解为三个专门阶段(侦察、扫描、漏洞利用),每个阶段由专门角色的智能体处理。渗透任务图(PTG)将任务建模为有向无环图,以确保逻辑执行顺序。总结器(Summarizer)模块通过提取并在阶段之间传递关键信息来实现智能体间的通信。检查与反思(Check and Reflection)机制允许系统重新分析失败的任务并动态更新计划。生成器将抽象的任务指令转换为特定工具的命令,执行器在目标系统上运行这些命令。

Architecture 架构设计

Five core modules: (1) Planner - generates and maintains penetration testing plans as PTGs via Plan Session and Task Session; (2) Memory Retriever - uses vector database (Milvus) with RAG to store embeddings of past tasks and penetration knowledge for retrieval; (3) Generator - converts Planner task instructions into tool-specific executable commands; (4) Executor - maintains interactive shell via Python Paramiko and executes commands on target environment; (5) Summarizer - bridges phases by extracting key findings and shell state for inter-agent communication.

五个核心模块:(1) 规划器(Planner)—— 通过计划会话和任务会话生成并维护作为 PTG 的渗透测试计划;(2) 内存检索器(Memory Retriever)—— 使用向量数据库(Milvus)配合 RAG 存储过去任务和渗透知识的嵌入以供检索;(3) 生成器(Generator)—— 将规划器的任务指令转换为特定工具的可执行命令;(4) 执行器(Executor)—— 通过 Python Paramiko 维护交互式 shell,并在目标环境中执行命令;(5) 总结器(Summarizer)—— 通过提取关键发现和 shell 状态,为阶段间的智能体通信架起桥梁。

LLM Models 使用的大模型

Llama3.1-405BLlama3.3-70BDeepSeek-v3GPT-4o

Tool Integration 工具集成

NmapDirbNiktoWPScanMetasploitHydra

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

VulnBot-Llama3.1-405B achieved a 30.3% overall task completion rate on AUTOPENBENCH, compared to 9.09% for base Llama3.1-405B and 21.21% for GPT-4o. On subtask completion, VulnBot-Llama3.1-405B reached 69.05% in a single experiment versus 49.05% for the baseline. On the AI-Pentest-Benchmark with 6 real-world machines, VulnBot with RAG completed end-to-end penetration of the WestWild machine autonomously, a feat GPT-4o and Llama3.1-405B with human intervention could not achieve.

VulnBot-Llama3.1-405B 在 AUTOPENBENCH 上实现了 30.3% 的总任务完成率,而基础版 Llama3.1-405B 为 9.09%,GPT-4o 为 21.21%。在子任务完成率方面,VulnBot-Llama3.1-405B 在单次实验中达到 69.05%,而基准线为 49.05%。在包含 6 台真实机器的 AI-Pentest-Benchmark 上,配备 RAG 的 VulnBot 自动完成了对 WestWild 机器的端到端渗透,这是 GPT-4o 和 Llama3.1-405B 在人工干预下也无法实现的壮举。

Environment 评估环境

AUTOPENBENCHAI-Pentest-BenchmarkVulnHub

Metrics 评估指标

task-completion-ratesubtask-completion-ratefailure-count-per-phase

Baseline Comparisons 基准对比

  • GPT-4o (base)
  • Llama3.3-70B (base)
  • Llama3.1-405B (base)
  • PentestGPT-Llama3.3-70B
  • PentestGPT-Llama3.1-405B
  • PentestGPT-DeepSeek-v3

Scale 评估规模

33 tasks from AUTOPENBENCH (across Access Control, Web Security, Network Security, Cryptography, Real-world categories) and 6 real-world machines from AI-Pentest-Benchmark (VulnHub)

Contributions 核心贡献

  • An autonomous multi-agent penetration testing framework (VulnBot) that decomposes tasks into three specialized phases (reconnaissance, scanning, exploitation) with role-specialized agents
  • A Penetration Task Graph (PTG) mechanism that models tasks and dependencies as a directed acyclic graph for structured, logical task execution with a Merge Plan Algorithm for dynamic plan updates
  • A Check and Reflection mechanism enabling iterative error handling and plan refinement based on task execution feedback
  • Demonstration that open-source LLMs (Llama3.1-405B, Llama3.3-70B, DeepSeek-v3) can outperform GPT-4 baselines in automated penetration testing when paired with appropriate multi-agent architecture
  • Integration of RAG with penetration testing knowledge bases to enhance contextual understanding and reduce hallucination
  • 提出了一种自动化的多智能体渗透测试框架(VulnBot),将任务分解为三个专门阶段(侦察、扫描、漏洞利用),并配备专门角色的智能体
  • 提出了一种渗透任务图(PTG)机制,将任务和依赖关系建模为有向无环图,以实现结构化、逻辑化的任务执行,并配合合并计划算法(Merge Plan Algorithm)进行动态计划更新
  • 设计了检查与反思(Check and Reflection)机制,能够根据任务执行反馈进行迭代错误处理和计划细化
  • 证明了开源大语言模型(Llama3.1-405B, Llama3.3-70B, DeepSeek-v3)在与合适的多智能体架构配合时,可以在自动化渗透测试中超越 GPT-4 基准模型
  • 将 RAG 与渗透测试知识库集成,增强了背景理解并减少了幻觉

Limitations 局限性

  • Cannot process non-textual information such as images or graphical interfaces generated by penetration testing tools, limiting understanding of visual attack surfaces and security scan results
  • End-to-end completion on real-world machines remains a significant challenge; even with RAG, full autonomy across all stages of a real-world penetration test is difficult
  • Exploitation phase still has high failure rates compared to reconnaissance and scanning phases, indicating the inherent complexity of automated exploitation
  • Does not cover post-exploitation, privilege escalation, lateral movement, or reporting phases, limiting the scope to initial access only
  • Relies on a fixed set of predefined tools per phase; cannot dynamically discover or select new tools based on encountered scenarios
  • The 5-step-per-phase limit (15 steps total on AUTOPENBENCH) constrains the depth of exploration possible in complex scenarios
  • Evaluation focuses only on automatic mode; semi-automatic and manual modes are not systematically evaluated
  • 无法处理非文本信息,如渗透测试工具生成的图像或图形界面,限制了对视觉攻击面和安全扫描结果的理解
  • 在真实机器上实现端到端任务完成仍然是一个重大挑战;即使有了 RAG,在真实世界渗透测试的所有阶段实现全自动化也是困难的
  • 与侦察和扫描阶段相比,漏洞利用阶段的失败率仍然很高,这表明了自动化利用的固有的复杂性
  • 未涵盖后期漏洞利用(post-exploitation)、权限提升、横向移动或报告阶段,范围仅限于初始访问
  • 依赖于每个阶段预定义的一组固定工具;无法根据遇到的场景动态发现或选择新工具
  • 每个阶段 5 步的限制(在 AUTOPENBENCH 上总共 15 步)限制了在复杂场景中可能进行的探索深度
  • 评估仅侧重于自动模式;未对半自动和手动模式进行系统评估

Research Gaps 研究空白

  • Open-source LLMs suffer from severe context loss across penetration testing phases, with session context loss being the primary cause of failure (42.36% of 203 total failures in the exploratory study)
  • Current systems lack effective error-handling mechanisms for autonomously diagnosing command execution failures and taking corrective actions
  • No existing framework effectively handles dynamic reasoning across interdependent penetration testing phases without human oversight
  • LLMs frequently generate incorrect penetration testing commands with wrong tool usage or fabricated parameters (19.70% failed tool errors)
  • Multi-modal understanding (processing screenshots, graphical outputs from security tools) is absent from current automated penetration testing systems
  • Fully autonomous agents still face challenges in achieving consistent and reliable results across diverse environments despite multi-agent coordination
  • 开源大语言模型在渗透测试阶段之间存在严重的上下文丢失,其中会话上下文丢失是失败的主要原因(在探索性研究的 203 次总失败中占 42.36%)
  • 当前系统缺乏有效的错误处理机制,无法自动诊断命令执行失败并采取纠正措施
  • 现有框架无法在没有人工监督的情况下,有效地处理相互依赖的渗透测试阶段之间的动态推理
  • 大语言模型经常生成错误的渗透测试命令,包括错误的工具使用或虚构的参数(工具错误占 19.70%)
  • 当前的自动化渗透测试系统缺乏多模态理解能力(处理屏幕截图、安全工具的图形输出)
  • 尽管有多智能体协作,全自动智能体在不同环境中取得一致且可靠的结果方面仍面临挑战

Novel Techniques 新颖技术

  • Penetration Task Graph (PTG): modeling penetration testing workflows as a directed acyclic graph with task nodes containing instruction, action, dependencies, command, result, and status attributes
  • Merge Plan Algorithm for dynamically integrating new tasks into existing plans while preserving completed task state and dependencies
  • Summarizer module for inter-agent communication that extracts key findings and maintains shell state across phase transitions, reducing information loss
  • Three-phase decomposition (from traditional five phases) to reduce context loss between agent role transitions
  • Check and Reflection mechanism within task sessions that enables LLMs to self-correct by reflecting on both successful and failed task outcomes
  • 渗透任务图(PTG):将渗透测试工作流建模为有向无环图,任务节点包含指令、操作、依赖关系、命令、结果和状态属性
  • 合并计划算法(Merge Plan Algorithm):用于将新任务动态集成到现有计划中,同时保留已完成的任务状态和依赖关系
  • 总结器(Summarizer)模块:用于智能体间通信,提取关键发现并在阶段转换期间维护 shell 状态,减少信息丢失
  • 三阶段分解(从传统的五阶段简化):旨在减少智能体角色切换之间的上下文丢失
  • 任务会话内的检查与反思机制:使大语言模型能够通过反思任务执行的成功和失败结果来进行自我修正

Open Questions 开放问题

  • How can automated penetration testing systems incorporate multi-modal understanding to process graphical outputs from security tools?
  • Can the PTG approach scale to larger, more complex network environments with multiple interconnected targets requiring lateral movement?
  • What is the optimal granularity for phase decomposition - would more or fewer phases improve performance for different target complexities?
  • How does the system perform when encountering previously unseen vulnerability types not covered in the RAG knowledge base?
  • Can the Check and Reflection mechanism be enhanced with more sophisticated reasoning (e.g., tree-of-thought) to improve exploitation success rates?
  • What are the security and ethical implications of fully autonomous penetration testing tools, and how should access controls be designed?
  • 自动化渗透测试系统如何整合多模态理解,以处理安全工具的图形输出?
  • PTG 方法能否扩展到更大、更复杂的网络环境,处理需要横向移动的多个互连目标?
  • 阶段分解的最佳粒度是什么 —— 更多或更少的阶段是否会提高针对不同目标复杂性的性能?
  • 当系统遇到 RAG 知识库中未涵盖的新型漏洞时,表现如何?
  • 检查与反思机制能否通过更复杂的推理(如思维树)来增强,以提高漏洞利用的成功率?
  • 全自动渗透测试工具的安全和伦理影响是什么,应如何设计访问控制?

Builds On 基于前人工作

  • PentestGPT
  • AutoAttacker
  • AUTOPENBENCH
  • AI-Pentest-Benchmark
  • Langchain-Chatchat

Open Source 开源信息

Yes - https://github.com/KHenryAegis/VulnBot

Tags