基准测试全景 Benchmark Landscape

8 个专用基准测试 · 12 个评估平台 · 基于 69 篇论文 8 dedicated benchmarks · 12 evaluation platforms · from 69 papers

基准测试目录 Benchmark Catalog

论文类型为 benchmark 的 8 篇论文 8 papers with paper_type = benchmark

#28 NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

ctf 2024

The capacity of LLMs to solve Capture-the-Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. Existing studies are limited in scope, use closed benchmarks, lack automated frameworks, and test only a small number of challenges or models.

LLM 在解决网络安全中的夺旗赛（CTF）挑战方面的能力尚未得到透彻评估。现有研究范围有限、使用闭源基准、缺乏自动化框架，且仅测试了少量的挑战或模型。

组成 Composition 200 CTF challenges across 6 categories (32 crypto, 7 forensics, 29 pwn, 31 rev, 20 web, 13 misc from qualifying rounds; 8 crypto, 10 forensics, 20 pwn, 11 rev, 6 web from finals)

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

open-source-llmbinary-exploitationreverse-engineeringcryptographyweb-securityforensicsopen-source-benchmark

开源 Open Source Yes - https://github.com/NYU-LLM-CTF/NYU_CTF_Bench and https://github.com/NYU-LLM-CTF/llm_ctf_automation

#29 AutoPenBench: Benchmarking Generative Agents for Penetration Testing AutoPenBench: Benchmarking Generative Agents for Penetration Testing

penetration-testing 2024

Despite growing interest in automating penetration testing with LLM-based generative agents, there is no comprehensive and standardized benchmark framework for evaluating, comparing, and developing such agents.

尽管人们对使用基于LLM的生成式智能体自动化渗透测试的兴趣日益浓厚，但目前还缺乏一个用于评估、比较和开发此类智能体的全面且标准化的基准测试框架。

组成 Composition 33 tasks (22 in-vitro across 4 categories + 11 real-world CVE-based tasks)

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

network-pentestweb-pentestCVE-exploitationopen-source

开源 Open Source Yes - https://github.com/lucagioacchini/auto-pen-bench

#30 VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models

vulnerability-assessment 2025

There is a scarcity of comprehensive benchmarks for evaluating LLMs in the Vulnerability Assessment and Penetration Testing (VAPT) domain, especially for small, open-source models suitable for local deployment. Existing general LLM benchmarks and even emerging cybersecurity-specific benchmarks do not adequately address the specific requirements of VAPT or the particular considerations for smaller, locally deployable models.

在漏洞评估和渗透测试（VAPT）领域，缺乏全面的基准来评估大语言模型（LLM），尤其是适用于本地部署的小型开源模型。现有的通用 LLM 基准和新兴的网络安全基准都未能充分满足 VAPT 的特定需求，也未考虑小型本地可部署模型的特殊要求。

组成 Composition 7800 questions across 6 datasets

涉及领域 Domains

open-source-llmcve

开源 Open Source No (benchmark datasets not publicly released)

#39 Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

penetration-testing 2025

There is no comprehensive, open, end-to-end penetration testing benchmark to evaluate and drive progress of LLM-based automated penetration testing. Existing tools like PentestGPT rely heavily on human participation, and the degree of human involvement and the specific challenges LLMs face at each pentest stage are not well understood.

目前缺乏一个全面的、开放的、端到端的渗透测试基准测试，用于评估和推动基于LLM的自动化渗透测试的进展。现有的工具（如PentestGPT）严重依赖人工参与，而LLM在渗透测试各阶段的人工参与程度以及面临的具体挑战尚不明确。

组成 Composition 152 tasks across 13 VulnHub machines (7 easy, 4 medium, 2 hard), with tasks distributed as: 72 reconnaissance, 14 general techniques, 44 exploitation, 22 privilege escalation

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

open-source-llmnetwork-pentestprivilege-escalation

开源 Open Source Yes - Benchmark: https://github.com/isamu-isozaki/AI-Pentest-Benchmark; Modified PentestGPT: https://github.com/isamu-isozaki/PentestGPT

#46 CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

penetration-testing 2025

Existing LLM-based offensive agent evaluations operate in closed-world settings with predefined goals and binary success criteria, using isolated single-service environments that fail to capture realistic multi-target attack scenarios involving reconnaissance, target selection, and exploitation under uncertainty.

现有的基于大语言模型（LLM）的攻防智能体评估通常在闭循环设置中进行，具有预定义的目标和二进制成功标准，且使用隔离的单一服务环境。这些方法未能捕捉到真实世界中涉及侦察、目标选择以及在不确定性下进行利用的多目标攻击场景。

组成 Composition 40 web-based CTF challenges deployed on a single virtual machine

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

web-pentestopen-source-llm

#53 HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

vulnerability-assessment 2025

While computer-use agents (CUAs) have shown strong capabilities in web browsing and visual task automation, their potential to discover and exploit web application vulnerabilities through graphical interfaces remains unknown. Existing benchmarks for CUAs assume sanitized environments and overlook security aspects.

虽然计算机使用智能体（CUA）在网页浏览和视觉任务自动化方面表现出强大能力，但它们通过图形界面发现和利用 Web 应用程序漏洞的潜力尚不清楚。现有的 CUA 基准测试通常假设环境是经过清理的，忽略了安全方面的问题。

组成 Composition 36 web CTF challenges from NYY CTF Bench (26), Cybench (8), and InterCode-CTF (2)

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

web-pentestopen-source-llm

开源 Open Source Yes - https://github.com/GUI-Agent/HackWorld

#58 PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

penetration-testing 2025

Existing LLM-based penetration testing approaches rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across individual penetration testing stages.

现有的基于 LLM 的渗透测试方法依赖于简单的提示词，缺乏任务分解或领域自适应，导致黑盒行为不可靠，且无法深入了解模型在各个渗透测试阶段的能力。

组成 Composition 346 tasks across 12 realistic vulnerable scenarios, over 3000 stage-level evaluations and 180 end-to-end tests

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

web-pentestopen-source-llm

开源 Open Source Yes

#63 Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

ctf 2025

There is a need to quantify the cybersecurity capabilities and risks of language model agents, but existing benchmarks use easy, non-professional-level tasks, are not open-source, or lack objective difficulty grounding.

需要量化语言模型代理的网络安全能力和风险，但现有基准测试使用的是简单的、非专业级别的任务，且不是开源的，或者缺乏客观的难度依据。

组成 Composition 40 professional-level CTF tasks from 4 competitions

渗透阶段 Attack Phases

reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting

涉及领域 Domains

open-source-llmweb-pentestnetwork-pentest

开源 Open Source Yes - https://cybench.github.io

基准测试对比表 Benchmark Comparison Table

各基准测试的关键维度对比（横向滚动查看更多列） Key dimensions across all benchmarks (scroll horizontally for more columns)

名称 Name	年份 Year	范围 Scope	规模 Scale	渗透阶段 Attack Phases	评估指标 Metrics	环境 Environment	开源 Open Source	主要发现 Key Finding
#28 NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security	2024	ctf	200 CTF challenges across 6 categories (32 crypto, 7 forensics, 29 pwn, 31 rev, 20 web, 13 misc from qualifying rounds; 8 crypto, 10 forensics, 20 pwn, 11 rev, 6 web from finals)	reconnaissance enumeration exploitation	success-rate, failure-type-distribution, comparison-with-human-scores	custom-lab, CTF-competition	Yes	GPT-4 performed best overall, solving challenges across multiple categories… GPT-4 总体表现最佳，解决了多个类别的挑战（密码学 6.67%、二进制利用 7.69%、逆向 9.80%、Web 5.26%，杂项最高达…
#29 AutoPenBench: Benchmarking Generative Agents for Penetration Testing AutoPenBench: Benchmarking Generative Agents for Penetration Testing	2024	penetration-testing	33 tasks (22 in-vitro across 4 categories + 11 real-world CVE-based tasks)	reconnaissance scanning enumeration exploitation privilege escalation post exploitation	success-rate, progress-rate (fraction of command milestones achieved), stage-level success rate (per pentest phase), consistency (variability across repeated runs)	custom-lab (Docker-based vulnerable containers)	Yes	The fully autonomous agent (GPT-4o) achieves only 21% overall success rate (27%… 完全自主的智能体（GPT-4o）的总成功率仅为21%（离体任务为27%，真实任务为9%），平均完成约40%的中间里程碑。辅助智能体将性能提高到三倍，成功率达到6…
#30 VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models	2025	vulnerability-assessment	7800 questions across 6 datasets		accuracy, RMSE (CVSS base score prediction), Levenshtein distance (CVSS vector string similarity), exact match (CVSS vector strings)	Local deployment via Ollama (16GB RAM, 4GB Nvidia vRAM)	Yes	All three models achieved well above typical certification passing thresholds… 三个模型在 CEH 和 PenTest+ 风格题目上均远超典型认证通过阈值（70-85%），其中 Qwen 2.5 和 Llama 3.2 超过 96%…
#39 Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements	2025	penetration-testing	152 tasks across 13 VulnHub machines (7 easy, 4 medium, 2 hard), with tasks distributed as: 72 reconnaissance, 14 general techniques, 44 exploitation, 22 privilege escalation	reconnaissance enumeration exploitation privilege escalation	success-rate, task-completion	VulnHub	Yes	Llama 3.1-405B outperforms GPT-4o on 7 out of 13 machines, with equal… Llama…
#46 CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment	2025	penetration-testing	40 web-based CTF challenges deployed on a single virtual machine	reconnaissance scanning enumeration exploitation	success-rate, precision, recall, time-to-first-flag, average-cost-per-challenge, average-interaction-rounds, agent-count, dead-end-persistence-ratio, vulnerability-discovery-signal, agent-inflation-factor	custom-lab	No	Claude Opus 4.5 and Gemini 3 Pro achieved the highest recall (22.50% each, 9/40… Claude Opus 4.5 和 Gemini 3 Pro 获得了最高的召回率（各为 22.50%，即 40 个 flag 中获得 9 个），其中 Opus…
#53 HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities	2025	vulnerability-assessment	36 web CTF challenges from NYY CTF Bench (26), Cybench (8), and InterCode-CTF (2)	reconnaissance scanning enumeration exploitation	success-rate, tool-usage-frequency, num-steps	custom-lab	Yes	CUAs achieve exploitation rates below 12%, with Claude-3.7-Sonnet performing… CUA 的利用率低于 12%，其中 Claude-3.7-Sonnet 表现最佳，在各种观察空间下的平均成功率为 10.18%（在使用屏幕截图或标记集时最高达到…
#58 PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design	2025	penetration-testing	346 tasks across 12 realistic vulnerable scenarios, over 3000 stage-level evaluations and 180 end-to-end tests	reconnaissance scanning enumeration exploitation post exploitation	Jaccard-similarity, Spearman-rank-correlation, syntax-correctness, functional-correctness, revision-success-rate, recall, NonCVE-identification-rate, end-to-end-success-rate	custom-lab, VulnHub, Docker-containers, Amazon-Lightsail	Yes	At the stage level, LLMs exhibit generally weak performance with a mean success… 在阶段级别，LLM 表现通常较弱，所有阶段的平均成功率仅为 0.41，其中攻击决策 (0.25) 和漏洞利用生成的功能正确性 (0.26)…
#63 Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models	2025	ctf	40 professional-level CTF tasks from 4 competitions	reconnaissance enumeration exploitation privilege escalation	success-rate, task-completion, subtask-performance, first-solve-time-correlation, token-usage	custom-lab, CTF-competition	Yes	Claude 3.5 Sonnet achieves the highest unguided performance at 17.5%, GPT-4o… Claude 3.5 Sonnet 在无指导下的表现最高，达到 17.5%；GPT-4o 在子任务指导下的表现领先，达到 17.5%；OpenAI…

评估平台使用频率 Evaluation Platform Usage

各评估平台/基准被多少篇论文使用（点击柱条查看论文列表） Number of papers using each evaluation platform (click bars to see paper list)

Custom Lab

VulnHub

HackTheBox

PicoCTF

Cybench

InterCode-CTF

Metasploitable

CTF Competition

AutoPenBench

NYU CTF Bench

OverTheWire

GOAD

论文 × 评估平台矩阵 Paper × Evaluation Platform Matrix

哪些论文使用了哪些评估平台（深色 = 使用） Which papers use which evaluation platforms (filled = used)

	#01	#02	#03	#04	#05	#06	#07	#08	#09	#10	#11	#12	#13	#14	#15	#16	#18	#19	#20	#21	#22	#23	#24	#25	#26	#27	#28	#29	#31	#32	#33	#34	#36	#37	#38	#39	#40	#41	#43	#44	#45	#46	#47	#48	#49	#50	#51	#52	#53	#54	#55	#56	#58	#59	#61	#62	#63	#64	#65	#66	#68
Custom Lab
VulnHub
HackTheBox
PicoCTF
Cybench
InterCode-CTF
Metasploitable
CTF Competition
AutoPenBench
NYU CTF Bench
OverTheWire
GOAD

基准测试全景 Benchmark Landscape

基准测试目录 Benchmark Catalog

基准测试对比表 Benchmark Comparison Table

评估平台使用频率 Evaluation Platform Usage

论文 × 评估平台 矩阵 Paper × Evaluation Platform Matrix

论文 × 评估平台矩阵 Paper × Evaluation Platform Matrix