#41

BreachSeek: A Multi-Agent Automated Penetration Tester BreachSeek: A Multi-Agent Automated Penetration Tester

Ibrahim AlShehri, Adnan AlShehri, Abdulrahman AlMalki, Majed Bamardouf, Alaqsa Akbar

2024 | arXiv (preprint)

system penetration-testing fully-autonomous multi-agent hierarchical-planning

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Traditional cybersecurity penetration testing methods are time-consuming, labor-intensive, and unable to rapidly adapt to emerging threats. There is a critical need for an automated solution that can efficiently identify and exploit vulnerabilities across diverse systems without extensive human intervention.

传统的网络安全渗透测试方法耗时、耗力，且无法快速适应新出现的威胁。迫切需要一种自动化解决方案，能够高效地识别和利用各种系统中的漏洞，而无需大量的人工干预。

Modern digital environments have grown in complexity and scale, exposing significant gaps in manual penetration testing. LLMs have demonstrated potential for automating complex tasks traditionally requiring human expertise, yet their application to penetration testing remains largely underexplored. Additionally, single-agent LLM approaches suffer from context window limitations that hinder extended multi-step testing scenarios.

现代数字环境的复杂性和规模日益增长，暴露了手动渗透测试的巨大差距。LLM 在自动化通常需要人类专业知识的复杂任务方面展示了潜力，但它们在渗透测试中的应用在很大程度上仍未得到充分探索。此外，单智能体 LLM 方法受限于上下文窗口，阻碍了扩展的多步骤测试场景。

Threat Model 威胁模型

The system assumes authorized penetration testing of target machines within local networks. The target is a known machine (e.g., Metasploitable 2) on the same network as the testing platform, representing a standard internal network pentest scenario.

该系统假设对本地网络内的目标机器进行授权渗透测试。目标是与测试平台位于同一网络中的已知机器（例如 Metasploitable 2），代表了标准的内部网络渗透测试场景。

Methodology 核心方法

BreachSeek is an AI-driven multi-agent software platform that automates penetration testing by leveraging LLMs integrated through LangChain and LangGraph in Python. The system uses a graph-based architecture with multiple specialized agents (supervisor, pentester, evaluator, recorder) that communicate with each other to distribute tasks. Each agent handles a specific aspect of the testing process, mitigating context window limitations by separating concerns. The platform executes actual commands in a terminal environment rather than just generating text-based outputs, and produces comprehensive PDF security reports.

BreachSeek 是一个 AI 驱动的多智能体软件平台，它利用 Python 中的 LangChain 和 LangGraph 集成的 LLM 来自动化渗透测试。该系统采用基于图的架构，包含多个专业智能体（主管、渗透测试员、评估员、记录员），这些智能体相互通信以分配任务。每个智能体处理测试过程的一个特定方面，通过分离关注点来缓解上下文窗口的局限性。该平台在终端环境中执行实际命令，而不仅仅是生成基于文本的输出，并生成全面的 PDF 安全报告。

Architecture 架构设计

Graph-based architecture implemented using LangGraph with four specialized nodes/agents: (1) Supervisor - oversees the entire process, generates action plans, and identifies subsequent steps; (2) Pentester - accesses shell and Python tools to execute commands using popular penetration testing utilities in a Kali Linux environment; (3) Evaluator - assesses output quality and task completion accuracy; (4) Recorder - maintains a summary of actions and generates a final report when prompted. The supervisor routes tasks to specialized agents, the pentester executes commands and reports to the evaluator, and the recorder captures the entire testing journey.

基于图的架构，使用 LangGraph 实现，包含四个专业节点/智能体：(1) 主管 (Supervisor) - 监督整个过程，生成行动计划并识别后续步骤；(2) 渗透测试员 (Pentester) - 访问 shell 和 Python 工具，在 Kali Linux 环境中使用流行的渗透测试实用程序执行命令；(3) 评估员 (Evaluator) - 评估输出质量和任务完成的准确性；(4) 记录员 (Recorder) - 维护行动摘要并在收到提示时生成最终报告。主管将任务分配给专业智能体，渗透测试员执行命令并向评估员报告，记录员捕获整个测试过程。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

In preliminary testing, BreachSeek successfully exploited a Metasploitable 2 machine, achieving root access with approximately 150,000 tokens. The evaluation was qualitative; the authors note that future work will incorporate quantitative measures using OWASP WSTG and OSCP exam content as standardized benchmarks.

在初步测试中，BreachSeek 成功利用了 Metasploitable 2 机器，消耗约 150,000 个 token 获得了 root 权限。评估是定性的；作者指出，未来的工作将结合使用 OWASP WSTG 和 OSCP 考试内容作为标准基准进行定量测量。

Environment 评估环境

Metrics 评估指标

Scale 评估规模

1 Metasploitable 2 machine

Contributions 核心贡献

Introduction of a multi-agent architecture using LangGraph for automated penetration testing that mitigates LLM context window limitations by distributing tasks across specialized agents
A working system that executes actual commands in a terminal (Kali Linux) rather than just generating text-based outputs, enabling real exploitation of vulnerabilities
Automated generation of comprehensive, formatted PDF security reports capturing the entire penetration testing process
A web UI built with NextJS (front-end) and FastAPI (back-end) for interactive use of the platform
Open-source implementation available on GitHub

引入了使用 LangGraph 的多智能体架构进行自动化渗透测试，通过在专业智能体之间分配任务来缓解 LLM 上下文窗口的限制。
一个在终端 (Kali Linux) 中执行实际命令的工作系统，而不仅仅是生成文本输出，从而实现了对漏洞的真实利用。
自动生成全面的、格式化的 PDF 安全报告，捕获整个渗透测试过程。
使用 NextJS（前端）和 FastAPI（后端）构建的 Web UI，用于交互式使用该平台。
在 GitHub 上提供开源实现。

Limitations 局限性

Evaluation is purely qualitative with only a single target machine (Metasploitable 2), lacking standardized quantitative benchmarks
No comparison with existing tools or baselines such as PentestGPT
Limited to local network testing scenarios; scalability to real-world complex networks is not demonstrated
No human-in-the-loop safety controls implemented yet (planned as future work)
Development used Claude 3.5 Sonnet but planned deployment on Llama 3.1 is not yet validated
No formal threat model or ethical considerations for autonomous exploitation discussed
Context window limitations are claimed to be mitigated but no empirical evidence or measurements comparing single-agent vs. multi-agent context usage are provided

评估纯粹是定性的，且只有一个目标机器 (Metasploitable 2)，缺乏标准的定量基准。
没有与现有工具或基准（如 PentestGPT）进行比较。
仅限于本地网络测试场景；未展示其在现实世界复杂网络中的可扩展性。
尚未实施人机协作的安全控制（计划作为未来工作）。
开发使用了 Claude 3.5 Sonnet，但计划在 Llama 3.1 上的部署尚未经过验证。
没有讨论自主利用的正式威胁模型或伦理考虑。
声称缓解了上下文窗口限制，但没有提供比较单智能体与多智能体上下文使用的经验证据或测量。

Research Gaps 研究空白

Need for standardized benchmarks for evaluating automated penetration testing tools (OWASP WSTG, OSCP suggested)
Human oversight mechanisms for autonomous penetration testing systems remain underdeveloped
Fine-tuning LLMs on specialized cybersecurity data for improved penetration testing performance
Integration of RAG systems with vector databases of penetration testing techniques and past experiences
Multi-modal input support (images, videos) for analyzing network setups and security camera feeds during penetration testing
LLMs' ability to maintain context over extended interactions remains a challenge for complex multi-step testing scenarios

需要用于评估自动化渗透测试工具的标准基准（建议使用 OWASP WSTG, OSCP）。
自主渗透测试系统的人力监督机制仍不发达。
在专业网络安全数据上微调 LLM 以提高渗透测试性能。
将 RAG 系统与渗透测试技术和过去经验的向量数据库集成。
在渗透测试期间，支持多模态输入（图像、视频）以分析网络设置和安全摄像头反馈。
LLM 在长时间交互中维持上下文的能力仍然是复杂多步测试场景的挑战。

Novel Techniques 新颖技术

Using LangGraph's graph-based architecture to create a multi-agent penetration testing workflow with supervisor, pentester, evaluator, and recorder nodes
Separation of concerns across agents to mitigate context window limitations inherent in single-agent LLM approaches
Evaluator agent that assesses output quality and task completion accuracy as a feedback loop within the testing workflow

使用 LangGraph 的基于图的架构来创建多智能体渗透测试工作流，包含主管、渗透测试员、评估员和记录员节点。
跨智能体分离关注点，以缓解单智能体 LLM 方法固有的上下文窗口限制。
评估员智能体在测试工作流中作为一个反馈循环，评估输出质量和任务完成的准确性。

Open Questions 开放问题

How does the multi-agent approach quantitatively compare to single-agent LLM penetration testing in terms of success rate, coverage, and cost?
Can the system handle real-world networks with complex architectures beyond intentionally vulnerable machines?
What safety mechanisms are sufficient for autonomous penetration testing tools operating without human oversight?
How effectively can fine-tuned open-source models (Llama 3.1) match commercial models (Claude 3.5 Sonnet) for penetration testing tasks?
How does the system handle unexpected situations or novel vulnerability types not seen during training?

多智能体方法在成功率、覆盖范围和成本方面，与单智能体 LLM 渗透测试相比，定量结果如何？
系统能否处理具有复杂架构的现实网络，而不仅仅是特意设置的易受攻击机器？
对于在没有人工监督的情况下运行的自主渗透测试工具，哪些安全机制是足够的？
微调后的开源模型 (Llama 3.1) 在渗透测试任务中匹配商业模型 (Claude 3.5 Sonnet) 的效果如何？
系统如何处理训练期间未见过的意外情况或新型漏洞？

Builds On 基于前人工作

PentestGPT
LangChain
LangGraph

Open Source 开源信息

Yes - https://github.com/snow10100/pena/