#37

Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Andreas Happe, Jürgen Cito

2025 | arXiv (preprint)

survey penetration-testing fully-autonomous

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

The quality of empirical evaluation of LLM-driven offensive security tools is highly dependent on the chosen testbed, captured metrics, and analysis methods, yet there is no systematic understanding of current benchmarking practices in this space.

LLM驱动的进攻性安全工具的实证评估质量高度依赖于所选的测试床（testbed）、获取的指标以及分析方法，但目前对于该领域的基准测试实践尚缺乏系统性的理解。

Due to the opaque nature of LLMs, empirical methods are the primary way to evaluate offensive security prototypes. The substantial costs of running LLM experiments make sound experiment design critical. This paper fills the gap by being the first empirical investigation of testbed composition, experiment design (metrics, sample sizes, LLM selection), and analysis methods used for evaluating offensive LLMs.

由于LLM的不透明特性，实证方法是评估进攻性安全原型的首要方式。运行LLM实验的高昂成本使得合理的实验设计变得至关重要。本文通过对用于评估进攻性LLM的测试床组成、实验设计（指标、样本量、LLM选择）和分析方法进行首次实证调查，填补了这一空白。

Threat Model 威胁模型

Not directly applicable; the paper studies how other works evaluate LLM-driven attacks rather than proposing a specific threat model itself.

不直接适用；本文研究的是其他作品如何评估LLM驱动的攻击，而非提出具体的威胁模型。

Methodology 核心方法

The authors conducted a systematic literature review of 19 research papers (detailing 18 prototypes and their testbeds) that evaluate LLM-driven offensive security tools. They used Google Scholar with keywords 'offensive security LLM' and applied exponential non-discriminative snowball sampling (forward-referencing) to identify relevant publications from 2023-2025. They performed multi-stage thematic analysis where each author independently coded papers, employed reflexive journaling, discussed themes with two professional penetration testers (peer debriefing), and used team consensus to finalize themes.

作者对19篇评估LLM驱动进攻性安全工具的研究论文（详细介绍了18个原型及其测试床）进行了系统性文献综述。他们使用Google Scholar并配合关键词“offensive security LLM”，并应用指数非歧视性滚雪球采样法（前向引用）来识别2023-2025年间的相关出版物。他们执行了多阶段主题分析，每位作者独立对论文进行编码，采用反思性日志，与两位专业渗透测试人员讨论主题（同行简报），并使用团队共识来确定最终主题。

Memory Mechanism 记忆机制

none

Evaluation 评估结果

Across the 19 reviewed papers, testbeds contained 1-200 high-level tasks (average 26.1, median 15), experiments used 1-10 LLMs (average 4.0, median 3) with 1-6 test runs per LLM (average 4.6, median 5). OpenAI non-reasoning models were used in all but one publication, while only two papers included reasoning models. CTF-based challenges may not fully represent real-world penetration testing scenarios, and there is a significant gap between synthetic testbeds and messy real-world conditions including non-deterministic exploits, side effects, and background noise.

在审查的19篇论文中，测试床包含1-200个高层任务（平均26.1个，中位数为15个），实验使用了1-10个LLM（平均4.0个，中位数为3个），每个LLM进行了1-6次测试运行（平均4.6次，中位数为5次）。除一份出版物外，所有论文均使用了OpenAI的非推理模型，而只有两篇论文包含了推理模型。基于CTF的挑战可能无法完全代表真实的渗透测试场景，且合成测试床与包含非确定性漏洞利用、副作用和背景噪声的复杂现实条件之间存在显著差距。

Environment 评估环境

Metrics 评估指标

Scale 评估规模

19 papers reviewed, covering 18 prototypes with testbeds ranging from 1 to 200 tasks

Contributions 核心贡献

First systematic empirical investigation of testbed composition and provenance used for evaluating offensive LLMs
Detailed analysis of experiment design practices including metrics, sample sizes, LLM selection, and baselines across 19 publications
Analysis of methods used for qualitative and quantitative evaluation of LLM-driven offensive tools
Actionable recommendations for future research across 10 dimensions: technology choices, benchmark composition, practitioner relevance, training data contamination, baselines, clean vs. real-life test cases, sub-task tracking, LLM selection, experiment run configuration, and metrics/analysis
Identification of key discrepancy between CTF-based security research and real-world penetration testing practice

首次对用于评估进攻性LLM的测试床组成和来源进行了系统的实证调查
对19份出版物中的实验设计实践进行了详细分析，包括指标、样本量、LLM选择和基准线
分析了用于LLM驱动进攻性工具的定性和定量评估方法
跨10个维度为未来研究提供了可操作的建议：技术选择、基准组成、从业者相关性、训练数据污染、基准线、干净 vs 现实测试用例、子任务跟踪、LLM选择、实验运行配置以及指标/分析
识别出基于CTF的安全研究与真实渗透测试实践之间的关键差异

Limitations 局限性

Limited to English-language publications from 2023-2025 found via Google Scholar
Only 19 papers were included, limiting generalizability of findings
Focused exclusively on offensive use of LLMs, excluding defensive and red-teaming testbeds
The rapidly evolving nature of LLM capabilities means findings may quickly become outdated
Selection bias despite forward-referencing snowball sampling

仅限于通过Google Scholar找到的2023-2025年间的英文出版物
仅包含了19篇论文，限制了发现的普遍性
专门聚焦于LLM的进攻性用途，排除了防御性和红队测试床
LLM能力的快速演进意味着研究结果可能会迅速过时
尽管采用了前向引用滚雪球采样，但仍存在选择偏差

Research Gaps 研究空白

No common vocabulary or semantics for what constitutes a sub-task in penetration testing benchmarks
Lack of Attacker/Defender style CTF benchmarks; all reviewed benchmarks are Jeopardy-style
No reverse mapping showing benchmark coverage of hacking disciplines (e.g., which MITRE ATT&CK phases are covered)
Multi-step attack chains in real-world networks are rarely modeled; testbeds simplify into single atomic attack steps
Non-deterministic nature of real exploits is ignored by synthetic testbeds
Background noise and activity present in real networks is absent from benchmarks
Assumed Breach scenarios (network-based attacks) are underrepresented compared to single-host CTF challenges
Training data contamination is not systematically addressed; canaries and parametrizable identifiers are rarely used
Reproducibility of baselines is poor, especially human baselines and LLM-based baselines due to stochasticity
Reasoning LLMs are largely unevaluated with only 2 of 19 papers including them
Small Language Models (SLMs) are underexplored with only 5 papers including models under 16B parameters

对于渗透测试基准测试中什么构成“子任务”，缺乏通用的词汇或语义
缺乏攻防（Attacker/Defender）风格的CTF基准；所有审查的基准均为解题（Jeopardy）风格
缺乏展示基准测试对黑客领域覆盖范围的反向映射（例如，涵盖了哪些MITRE ATT&CK阶段）
现实网络中的多步攻击链很少被建模；测试床将其简化为单个原子攻击步骤
真实漏洞利用的非确定性特性被合成测试床忽略了
真实网络中存在的背景噪声和活动在基准测试中是缺失的
与单机CTF挑战相比，假设失陷（Assumed Breach）场景（基于网络的攻击）的代表性不足
训练数据污染问题未得到系统解决；极少使用金丝雀（canaries）和参数化标识符
基准线的可复现性较差，特别是人类基准线和由于随机性导致的基于LLM的基准线
推理型LLM在很大程度上未得到评估，19篇论文中只有2篇包含了它们
小语言模型（SLMs）探索不足，仅有5篇论文包含了参数量在16B以下的模型

Novel Techniques 新颖技术

Multi-stage thematic analysis with peer debriefing by professional penetration testers for systematic review rigor
Taxonomy of testbed properties: testcase origin (reused/scratch/borrowed), implementation (container/VM), provenance (released/documented/coarse), and target architecture (localhost/single-host/network)
Classification of sub-task tracking approaches: MITRE ATT&CK mapping, NIST 800-115 categories, golden steps with milestones, and subtask-guided performance
Concept of canaries for detecting training data contamination in security benchmarks

多阶段主题分析，配合专业渗透测试人员的同行简报，以增强系统综述的严谨性
测试床属性分类法：测试用例来源（重用/原创/借用）、实现方式（容器/虚拟机）、出处（已发布/已记录/粗略）和目标架构（本地/单机/网络）
子任务跟踪方法的分类：MITRE ATT&CK映射、NIST 800-115类别、带有里程碑的标准步骤以及子任务引导的性能评估
用于检测安全基准测试中训练数据污染的金丝雀（canaries）概念

Open Questions 开放问题

How should benchmarks model multi-step attack chains and attack trees rather than singular atomic exploits?
How can benchmarks incorporate non-deterministic exploit behavior and system side-effects?
What is the right level of granularity for sub-tasks, and how should dependencies between sub-tasks be represented?
How can Attacker/Defender style benchmarks be created to evaluate LLM resilience against active defensive countermeasures?
How should Goodhart's law effects be mitigated as benchmarks become optimization targets?
Can network-oriented benchmarks with multiple VMs better approximate real penetration testing?

基准测试应如何建模多步攻击链和攻击树，而非单一的原子漏洞利用？
基准测试如何纳入非确定性的漏洞利用行为和系统副作用？
子任务的合适粒度级别是什么，子任务之间的依赖关系应如何表示？
如何创建攻防风格的基准测试，以评估LLM应对主动防御对策的韧性？
随着基准测试成为优化目标，应如何减轻古德哈特定律（Goodhart's law）的影响？
具有多个虚拟机的面向网络的基准测试能否更好地近似真实的渗透测试？

Builds On 基于前人工作

Getting pwn'd by AI (Happe et al., 2023)
PentestGPT (Deng et al., 2023)
LLMs as Hackers (Happe et al.)
AutoAttacker
CyBench
NYU CTF Dataset
AutoPenBench
HackSynth
PenHeal
Vulnbot