基准测试全景 Benchmark Landscape

8 个专用基准测试 · 12 个评估平台 · 基于 69 篇论文 8 dedicated benchmarks · 12 evaluation platforms · from 69 papers

← 返回可视化总览 ← Back to Analytics

基准测试目录 Benchmark Catalog

论文类型为 benchmark 的 8 篇论文 8 papers with paper_type = benchmark

ctf 2024

The capacity of LLMs to solve Capture-the-Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. Existing studies are limited in scope, use closed benchmarks, lack automated frameworks, and test only a small number of challenges or models.

LLM 在解决网络安全中的夺旗赛(CTF)挑战方面的能力尚未得到透彻评估。现有研究范围有限、使用闭源基准、缺乏自动化框架,且仅测试了少量的挑战或模型。

组成 Composition 200 CTF challenges across 6 categories (32 crypto, 7 forensics, 29 pwn, 31 rev, 20 web, 13 misc from qualifying rounds; 8 crypto, 10 forensics, 20 pwn, 11 rev, 6 web from finals)
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
open-source-llmbinary-exploitationreverse-engineeringcryptographyweb-securityforensicsopen-source-benchmark
开源 Open Source Yes - https://github.com/NYU-LLM-CTF/NYU_CTF_Bench and https://github.com/NYU-LLM-CTF/llm_ctf_automation
penetration-testing 2024

Despite growing interest in automating penetration testing with LLM-based generative agents, there is no comprehensive and standardized benchmark framework for evaluating, comparing, and developing such agents.

尽管人们对使用基于LLM的生成式智能体自动化渗透测试的兴趣日益浓厚,但目前还缺乏一个用于评估、比较和开发此类智能体的全面且标准化的基准测试框架。

组成 Composition 33 tasks (22 in-vitro across 4 categories + 11 real-world CVE-based tasks)
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
network-pentestweb-pentestCVE-exploitationopen-source
开源 Open Source Yes - https://github.com/lucagioacchini/auto-pen-bench
vulnerability-assessment 2025

There is a scarcity of comprehensive benchmarks for evaluating LLMs in the Vulnerability Assessment and Penetration Testing (VAPT) domain, especially for small, open-source models suitable for local deployment. Existing general LLM benchmarks and even emerging cybersecurity-specific benchmarks do not adequately address the specific requirements of VAPT or the particular considerations for smaller, locally deployable models.

在漏洞评估和渗透测试(VAPT)领域,缺乏全面的基准来评估大语言模型(LLM),尤其是适用于本地部署的小型开源模型。现有的通用 LLM 基准和新兴的网络安全基准都未能充分满足 VAPT 的特定需求,也未考虑小型本地可部署模型的特殊要求。

组成 Composition 7800 questions across 6 datasets
涉及领域 Domains
open-source-llmcve
开源 Open Source No (benchmark datasets not publicly released)
penetration-testing 2025

There is no comprehensive, open, end-to-end penetration testing benchmark to evaluate and drive progress of LLM-based automated penetration testing. Existing tools like PentestGPT rely heavily on human participation, and the degree of human involvement and the specific challenges LLMs face at each pentest stage are not well understood.

目前缺乏一个全面的、开放的、端到端的渗透测试基准测试,用于评估和推动基于LLM的自动化渗透测试的进展。现有的工具(如PentestGPT)严重依赖人工参与,而LLM在渗透测试各阶段的人工参与程度以及面临的具体挑战尚不明确。

组成 Composition 152 tasks across 13 VulnHub machines (7 easy, 4 medium, 2 hard), with tasks distributed as: 72 reconnaissance, 14 general techniques, 44 exploitation, 22 privilege escalation
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
open-source-llmnetwork-pentestprivilege-escalation
开源 Open Source Yes - Benchmark: https://github.com/isamu-isozaki/AI-Pentest-Benchmark; Modified PentestGPT: https://github.com/isamu-isozaki/PentestGPT
penetration-testing 2025

Existing LLM-based offensive agent evaluations operate in closed-world settings with predefined goals and binary success criteria, using isolated single-service environments that fail to capture realistic multi-target attack scenarios involving reconnaissance, target selection, and exploitation under uncertainty.

现有的基于大语言模型(LLM)的攻防智能体评估通常在闭循环设置中进行,具有预定义的目标和二进制成功标准,且使用隔离的单一服务环境。这些方法未能捕捉到真实世界中涉及侦察、目标选择以及在不确定性下进行利用的多目标攻击场景。

组成 Composition 40 web-based CTF challenges deployed on a single virtual machine
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
web-pentestopen-source-llm
vulnerability-assessment 2025

While computer-use agents (CUAs) have shown strong capabilities in web browsing and visual task automation, their potential to discover and exploit web application vulnerabilities through graphical interfaces remains unknown. Existing benchmarks for CUAs assume sanitized environments and overlook security aspects.

虽然计算机使用智能体(CUA)在网页浏览和视觉任务自动化方面表现出强大能力,但它们通过图形界面发现和利用 Web 应用程序漏洞的潜力尚不清楚。现有的 CUA 基准测试通常假设环境是经过清理的,忽略了安全方面的问题。

组成 Composition 36 web CTF challenges from NYY CTF Bench (26), Cybench (8), and InterCode-CTF (2)
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
web-pentestopen-source-llm
开源 Open Source Yes - https://github.com/GUI-Agent/HackWorld
penetration-testing 2025

Existing LLM-based penetration testing approaches rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across individual penetration testing stages.

现有的基于 LLM 的渗透测试方法依赖于简单的提示词,缺乏任务分解或领域自适应,导致黑盒行为不可靠,且无法深入了解模型在各个渗透测试阶段的能力。

组成 Composition 346 tasks across 12 realistic vulnerable scenarios, over 3000 stage-level evaluations and 180 end-to-end tests
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
web-pentestopen-source-llm
开源 Open Source Yes
ctf 2025

There is a need to quantify the cybersecurity capabilities and risks of language model agents, but existing benchmarks use easy, non-professional-level tasks, are not open-source, or lack objective difficulty grounding.

需要量化语言模型代理的网络安全能力和风险,但现有基准测试使用的是简单的、非专业级别的任务,且不是开源的,或者缺乏客观的难度依据。

组成 Composition 40 professional-level CTF tasks from 4 competitions
渗透阶段 Attack Phases
reconnaissance scanning enumeration exploitation post exploitation privilege escalation lateral movement reporting
涉及领域 Domains
open-source-llmweb-pentestnetwork-pentest
开源 Open Source Yes - https://cybench.github.io

基准测试对比表 Benchmark Comparison Table

各基准测试的关键维度对比(横向滚动查看更多列) Key dimensions across all benchmarks (scroll horizontally for more columns)

名称 Name 年份 Year 范围 Scope 规模 Scale 渗透阶段 Attack Phases 评估指标 Metrics 环境 Environment 开源 Open Source 主要发现 Key Finding
#28 NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security 2024 ctf 200 CTF challenges across 6 categories (32 crypto, 7 forensics, 29 pwn, 31 rev, 20 web, 13 misc from qualifying rounds; 8 crypto, 10 forensics, 20 pwn, 11 rev, 6 web from finals)
reconnaissance enumeration exploitation
success-rate, failure-type-distribution, comparison-with-human-scores custom-lab, CTF-competition Yes GPT-4 performed best overall, solving challenges across multiple categories… GPT-4 总体表现最佳,解决了多个类别的挑战(密码学 6.67%、二进制利用 7.69%、逆向 9.80%、Web 5.26%,杂项最高达…
#29 AutoPenBench: Benchmarking Generative Agents for Penetration Testing AutoPenBench: Benchmarking Generative Agents for Penetration Testing 2024 penetration-testing 33 tasks (22 in-vitro across 4 categories + 11 real-world CVE-based tasks)
reconnaissance scanning enumeration exploitation privilege escalation post exploitation
success-rate, progress-rate (fraction of command milestones achieved), stage-level success rate (per pentest phase), consistency (variability across repeated runs) custom-lab (Docker-based vulnerable containers) Yes The fully autonomous agent (GPT-4o) achieves only 21% overall success rate (27%… 完全自主的智能体(GPT-4o)的总成功率仅为21%(离体任务为27%,真实任务为9%),平均完成约40%的中间里程碑。辅助智能体将性能提高到三倍,成功率达到6…
#30 VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models 2025 vulnerability-assessment 7800 questions across 6 datasets
accuracy, RMSE (CVSS base score prediction), Levenshtein distance (CVSS vector string similarity), exact match (CVSS vector strings) Local deployment via Ollama (16GB RAM, 4GB Nvidia vRAM) Yes All three models achieved well above typical certification passing thresholds… 三个模型在 CEH 和 PenTest+ 风格题目上均远超典型认证通过阈值(70-85%),其中 Qwen 2.5 和 Llama 3.2 超过 96%…
#39 Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements 2025 penetration-testing 152 tasks across 13 VulnHub machines (7 easy, 4 medium, 2 hard), with tasks distributed as: 72 reconnaissance, 14 general techniques, 44 exploitation, 22 privilege escalation
reconnaissance enumeration exploitation privilege escalation
success-rate, task-completion VulnHub Yes Llama 3.1-405B outperforms GPT-4o on 7 out of 13 machines, with equal… Llama…
#46 CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment 2025 penetration-testing 40 web-based CTF challenges deployed on a single virtual machine
reconnaissance scanning enumeration exploitation
success-rate, precision, recall, time-to-first-flag, average-cost-per-challenge, average-interaction-rounds, agent-count, dead-end-persistence-ratio, vulnerability-discovery-signal, agent-inflation-factor custom-lab No Claude Opus 4.5 and Gemini 3 Pro achieved the highest recall (22.50% each, 9/40… Claude Opus 4.5 和 Gemini 3 Pro 获得了最高的召回率(各为 22.50%,即 40 个 flag 中获得 9 个),其中 Opus…
#53 HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities 2025 vulnerability-assessment 36 web CTF challenges from NYY CTF Bench (26), Cybench (8), and InterCode-CTF (2)
reconnaissance scanning enumeration exploitation
success-rate, tool-usage-frequency, num-steps custom-lab Yes CUAs achieve exploitation rates below 12%, with Claude-3.7-Sonnet performing… CUA 的利用率低于 12%,其中 Claude-3.7-Sonnet 表现最佳,在各种观察空间下的平均成功率为 10.18%(在使用屏幕截图或标记集时最高达到…
#58 PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design 2025 penetration-testing 346 tasks across 12 realistic vulnerable scenarios, over 3000 stage-level evaluations and 180 end-to-end tests
reconnaissance scanning enumeration exploitation post exploitation
Jaccard-similarity, Spearman-rank-correlation, syntax-correctness, functional-correctness, revision-success-rate, recall, NonCVE-identification-rate, end-to-end-success-rate custom-lab, VulnHub, Docker-containers, Amazon-Lightsail Yes At the stage level, LLMs exhibit generally weak performance with a mean success… 在阶段级别,LLM 表现通常较弱,所有阶段的平均成功率仅为 0.41,其中攻击决策 (0.25) 和漏洞利用生成的功能正确性 (0.26)…
#63 Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models 2025 ctf 40 professional-level CTF tasks from 4 competitions
reconnaissance enumeration exploitation privilege escalation
success-rate, task-completion, subtask-performance, first-solve-time-correlation, token-usage custom-lab, CTF-competition Yes Claude 3.5 Sonnet achieves the highest unguided performance at 17.5%, GPT-4o… Claude 3.5 Sonnet 在无指导下的表现最高,达到 17.5%;GPT-4o 在子任务指导下的表现领先,达到 17.5%;OpenAI…

评估平台使用频率 Evaluation Platform Usage

各评估平台/基准被多少篇论文使用(点击柱条查看论文列表) Number of papers using each evaluation platform (click bars to see paper list)

Custom Lab
28
VulnHub
17
HackTheBox
14
PicoCTF
8
Cybench
5
InterCode-CTF
5
Metasploitable
5
CTF Competition
5
AutoPenBench
4
NYU CTF Bench
4
OverTheWire
4
GOAD
3

论文 × 评估平台 矩阵 Paper × Evaluation Platform Matrix

哪些论文使用了哪些评估平台(深色 = 使用) Which papers use which evaluation platforms (filled = used)

#01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12 #13 #14 #15 #16 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #31 #32 #33 #34 #36 #37 #38 #39 #40 #41 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #58 #59 #61 #62 #63 #64 #65 #66 #68
Custom Lab
VulnHub
HackTheBox
PicoCTF
Cybench
InterCode-CTF
Metasploitable
CTF Competition
AutoPenBench
NYU CTF Bench
OverTheWire
GOAD