#29

AutoPenBench: Benchmarking Generative Agents for Penetration Testing AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, Roberto Bifulco

2024 | arXiv (preprint)

benchmark penetration-testing fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Despite growing interest in automating penetration testing with LLM-based generative agents, there is no comprehensive and standardized benchmark framework for evaluating, comparing, and developing such agents.

尽管人们对使用基于LLM的生成式智能体自动化渗透测试的兴趣日益浓厚，但目前还缺乏一个用于评估、比较和开发此类智能体的全面且标准化的基准测试框架。

Existing works like PentestGPT, AutoAttacker, and HPTSA MultiAgent lack common benchmarks, limiting reproducibility and fair comparison. The only related benchmark, Cybench, is based on gamified CTF challenges that oversimplify real pentesting scenarios and lack active operations like network scanning or traffic manipulation. AutoPenBench fills this gap by providing an open, extensible benchmark with realistic pentest tasks of increasing difficulty.

现有的工作如PentestGPT、AutoAttacker和HPTSA MultiAgent缺乏共同的基准，限制了可复现性和公平比较。唯一相关的基准测试Cybench基于游戏化的CTF挑战，过度简化了真实的渗透测试场景，且缺乏网络扫描或流量操纵等主动操作。AutoPenBench通过提供一个开放、可扩展且包含难度递增的真实渗透测试任务的基准来填补这一空白。

Threat Model 威胁模型

The agent operates from a Kali Linux workstation on a Docker virtual network targeting vulnerable containers. The agent has full access to any Kali command and standard pentesting tools. Tasks are structured as CTF-style flag captures requiring vulnerability discovery and exploitation.

智能体在Docker虚拟网络上的Kali Linux工作站上运行，目标是脆弱的容器。智能体拥有对任何Kali命令和标准渗透测试工具的全权访问权限。任务结构为CTF风格的旗标捕获，要求进行漏洞发现和利用。

Methodology 核心方法

AutoPenBench is built on the AgentQuest framework and provides 33 penetration testing tasks organized into two difficulty levels: 22 in-vitro tasks (basic cybersecurity fundamentals across access control, web security, network security, and cryptography) and 11 real-world tasks based on publicly disclosed CVEs from 2014-2024. Each task is decomposed into milestones (command milestones and stage milestones) to measure partial agent progress. The benchmark uses Docker containers to create isolated, reproducible test environments with a Kali workstation and vulnerable target machines on a virtual network.

AutoPenBench构建在AgentQuest框架之上，提供33个渗透测试任务，分为两个难度级别：22个离体（in-vitro）任务（涵盖访问控制、Web安全、网络安全和密码学的基础网络安全知识）和11个基于2014-2024年公开披露的CVE的真实任务。每个任务被分解为里程碑（命令里程碑和阶段里程碑），以衡量智能体的部分进度。该基准使用Docker容器创建孤立、可复现的测试环境，包含一个Kali工作站和虚拟网络上的脆弱目标机。

Architecture 架构设计

The benchmark infrastructure consists of a Kali Linux workstation container (with Metasploit, Hydra, Nmap, etc.) connected via a /16 Docker virtual network to one or more vulnerable target containers. Two agent architectures are implemented on top of the CoALA framework: (1) a fully autonomous ReACT-based agent with three sequential reasoning procedures (summary, thought, action) per execution step, and (2) a semi-autonomous assisted agent that breaks tasks into human-provided sub-tasks, autonomously solving each one and reporting back. Structured output is enforced via the Python Instructor library with Pydantic objects.

该基准基础设施由一个安装有Metasploit、Hydra、Nmap等的Kali Linux工作站容器组成，通过/16 Docker虚拟网络连接到一个或多个脆弱目标容器。在CoALA框架之上实现了两种智能体架构：(1) 完全自主的基于ReACT的智能体，每步执行具有三个顺序推理程序（摘要、思考、行动）；(2) 半自主的辅助智能体，将任务分解为人类提供的子任务，自主解决每个子任务并汇报。通过Python Instructor库配合Pydantic对象强制执行结构化输出。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

The fully autonomous agent (GPT-4o) achieves only 21% overall success rate (27% on in-vitro, 9% on real-world tasks), completing about 40% of intermediate milestones on average. The assisted agent triples performance to 64% success rate (59% in-vitro, 73% real-world), demonstrating that human-AI collaboration significantly improves pentesting outcomes. Among LLMs tested on a simple task, GPT-4o was the only model to achieve 100% success rate across 5 runs; GPT-4-turbo reached 40%, while Gemini 1.5 Flash, o1-preview, o1-mini, and GPT-4o-mini all failed.

完全自主的智能体（GPT-4o）的总成功率仅为21%（离体任务为27%，真实任务为9%），平均完成约40%的中间里程碑。辅助智能体将性能提高到三倍，成功率达到64%（离体任务59%，真实任务73%），证明了人机协作能显著改善渗透测试结果。在针对简单任务测试的LLM中，GPT-4o是唯一在5次运行中达到100%成功率的模型；GPT-4-turbo达到40%，而Gemini 1.5 Flash、o1-preview、o1-mini和GPT-4o-mini均告失败。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Autonomous agent vs. assisted (human-in-the-loop) agent
Comparison across 6 LLMs: GPT-4o, GPT-4-turbo, GPT-4o-mini, o1-preview, o1-mini, Gemini 1.5 Flash

Scale 评估规模

33 tasks (22 in-vitro across 4 categories + 11 real-world CVE-based tasks)

Contributions 核心贡献

An open-source benchmark framework (AutoPenBench) with 33 penetration testing tasks of increasing difficulty, including both in-vitro and real-world CVE-based scenarios
A milestone-based evaluation methodology with command milestones and stage milestones that enables fine-grained measurement of partial agent progress using GPT-4o as an automated evaluator
Two modular agent cognitive architectures (fully autonomous and human-assisted) built on the CoALA framework, featuring a novel decomposition of ReACT into separate summary, thought, and action procedures
Comprehensive empirical evaluation comparing autonomous vs. assisted agents and benchmarking six different LLMs, revealing critical limitations in current LLM-based pentesting
An extensible Docker-based infrastructure that allows the community to add new tasks by providing container configurations, gold steps, and milestones

一个开源基准测试框架（AutoPenBench），包含33个难度递增的渗透测试任务，包括离体和基于真实CVE的场景
一种基于里程碑的评估方法，包含命令里程碑和阶段里程碑，能够使用GPT-4o作为自动化评估器，对智能体的部分进度进行细粒度测量
基于CoALA框架构建的两种模块化智能体认知架构（完全自主和人类辅助），其特点是将ReACT分解为独立的摘要、思考和行动程序
全面的实证评估，比较了自主型与辅助型智能体，并对六种不同的LLM进行了基准测试，揭示了当前基于LLM的渗透测试的关键局限性
一个可扩展的基于Docker的基础设施，允许社区通过提供容器配置、标准步骤（gold steps）和里程碑来添加新任务

Limitations 局限性

The benchmark currently covers only basic penetration testing areas (access control, web security, network security, cryptography) and does not encompass the full breadth of real-world pentesting
The fully autonomous agent achieves very low success rates (21%), indicating current LLMs are far from reliably automating penetration testing
Agent consistency is problematic: even on simple tasks (AC_0), the number of steps varies significantly across runs, and on moderately complex tasks (AC_2), success drops to 40% over 10 runs
The milestone evaluation uses GPT-4o as a judge with manual correction, which may introduce bias and does not fully scale
Only one LLM (GPT-4o) was extensively evaluated across all tasks; other LLMs were only tested on a single simple task
The benchmark does not yet include advanced scenarios like Active Directory attacks, cloud-specific vulnerabilities, or multi-stage lateral movement

该基准测试目前仅涵盖基础渗透测试领域（访问控制、Web安全、网络安全、密码学），尚未包含现实世界渗透测试的完整广度
完全自主智能体的成功率非常低（21%），表明目前的LLM远未达到能可靠自动化渗透测试的水平
智能体的一致性存在问题：即使在简单任务（AC_0）中，不同运行之间的步数差异也很大；在复杂程度中等的任务（AC_2）中，10次运行后的成功率降至40%
里程碑评估使用GPT-4o作为评委并辅以人工校正，这可能会引入偏差且不具备完全的可扩展性
只有一种LLM（GPT-4o）在所有任务中进行了广泛评估；其他LLM仅在单个简单任务上进行了测试
基准测试尚未包含高级场景，如活动目录（AD）攻击、云特定漏洞或多阶段横向移动

Research Gaps 研究空白

Need for RAG-based agent modules to retrieve pentesting best practices from cybersecurity manuals and knowledge bases
Exploration of how different AI architectures (beyond ReACT) perform on pentesting tasks
LLM-specific prompt optimization: current prompts are not tailored per model, which likely hurts performance for non-GPT-4o models
Bridging the gap between vulnerability detection and exploitation: agents can often identify vulnerabilities but fail at configuring and executing exploits correctly
Improving agent reliability and consistency for safety-critical cybersecurity operations
Extending benchmarks to cover more complex, multi-stage, and realistic penetration testing scenarios

需要基于RAG的智能体模块，从网络安全手册和知识库中检索渗透测试最佳实践
探索不同的AI架构（除ReACT外）在渗透测试任务中的表现
特定于LLM的提示词优化：目前的提示词并非针对每个模型量身定制，这可能会损害非GPT-4o模型的性能
弥合漏洞检测与利用之间的差距：智能体通常能识别漏洞，但在正确配置和执行漏洞利用（exploit）方面经常失败
提高安全关键型网络安全操作中智能体的可靠性和一致性
扩展基准测试以涵盖更复杂、多阶段和真实的渗透测试场景

Novel Techniques 新颖技术

Decomposition of ReACT into three separate LLM calls (summary, thought, action) to reduce hallucination and improve thought-action consistency
Milestone-based evaluation with two granularity levels (command milestones and stage milestones) enabling fine-grained progress measurement beyond binary success/failure
Systematic working memory clearing at sub-task boundaries in the assisted agent to reduce context pollution and improve LLM contextual awareness
Use of LLM-as-judge (GPT-4o) for automated milestone evaluation with human correction step

将ReACT分解为三个独立的LLM调用（摘要、思考、行动），以减少幻觉并提高思考与行动的一致性
具有两个粒度级别（命令里程碑和阶段里程碑）的基于里程碑的评估，能够实现超出二元成功/失败结果的细粒度进度测量
辅助智能体在子任务边界系统性地清除工作记忆，以减少上下文污染并提高LLM的上下文感知能力
使用LLM作为评委（GPT-4o）进行自动化的里程碑评估，并辅以人工校正步骤

Open Questions 开放问题

Can RAG-augmented agents significantly improve performance on tasks requiring domain-specific knowledge (e.g., cryptography, CVE exploitation)?
What is the optimal level of human involvement for cost-effective semi-autonomous pentesting?
How can agent consistency be improved for safety-critical security operations where unreliable behavior is unacceptable?
Would fine-tuning open-source LLMs on pentesting data close the gap with GPT-4o?
How do agents scale to more complex, multi-host network environments with lateral movement requirements?

RAG增强型智能体能否显著提高需要特定领域知识（如密码学、CVE利用）的任务性能？
对于具有成本效益的半自主渗透测试，人类参与的最佳程度是多少？
如何提高在安全关键型安全操作中无法接受不可靠行为的智能体一致性？
在渗透测试数据上微调开源LLM能否缩小与GPT-4o的差距？
智能体如何扩展到具有横向移动要求的更复杂、多主机的网络环境？

Builds On 基于前人工作

AgentQuest (modular benchmark framework for LLM agents)
CoALA (Cognitive Architectures for Language Agents)
ReACT (Reasoning and Acting in Language Models)
Cybench (CTF-based LLM cybersecurity benchmark)
PentestGPT
AutoAttacker
HPTSA MultiAgent

Open Source 开源信息

Yes - https://github.com/lucagioacchini/auto-pen-bench