基准测试全景 Benchmark Landscape
8 个专用基准测试 · 12 个评估平台 · 基于 69 篇论文 8 dedicated benchmarks · 12 evaluation platforms · from 69 papers
← 返回可视化总览 ← Back to Analytics基准测试目录 Benchmark Catalog
论文类型为 benchmark 的 8 篇论文 8 papers with paper_type = benchmark
The capacity of LLMs to solve Capture-the-Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. Existing studies are limited in scope, use closed benchmarks, lack automated frameworks, and test only a small number of challenges or models.
LLM 在解决网络安全中的夺旗赛(CTF)挑战方面的能力尚未得到透彻评估。现有研究范围有限、使用闭源基准、缺乏自动化框架,且仅测试了少量的挑战或模型。
Despite growing interest in automating penetration testing with LLM-based generative agents, there is no comprehensive and standardized benchmark framework for evaluating, comparing, and developing such agents.
尽管人们对使用基于LLM的生成式智能体自动化渗透测试的兴趣日益浓厚,但目前还缺乏一个用于评估、比较和开发此类智能体的全面且标准化的基准测试框架。
There is a scarcity of comprehensive benchmarks for evaluating LLMs in the Vulnerability Assessment and Penetration Testing (VAPT) domain, especially for small, open-source models suitable for local deployment. Existing general LLM benchmarks and even emerging cybersecurity-specific benchmarks do not adequately address the specific requirements of VAPT or the particular considerations for smaller, locally deployable models.
在漏洞评估和渗透测试(VAPT)领域,缺乏全面的基准来评估大语言模型(LLM),尤其是适用于本地部署的小型开源模型。现有的通用 LLM 基准和新兴的网络安全基准都未能充分满足 VAPT 的特定需求,也未考虑小型本地可部署模型的特殊要求。
There is no comprehensive, open, end-to-end penetration testing benchmark to evaluate and drive progress of LLM-based automated penetration testing. Existing tools like PentestGPT rely heavily on human participation, and the degree of human involvement and the specific challenges LLMs face at each pentest stage are not well understood.
目前缺乏一个全面的、开放的、端到端的渗透测试基准测试,用于评估和推动基于LLM的自动化渗透测试的进展。现有的工具(如PentestGPT)严重依赖人工参与,而LLM在渗透测试各阶段的人工参与程度以及面临的具体挑战尚不明确。
Existing LLM-based offensive agent evaluations operate in closed-world settings with predefined goals and binary success criteria, using isolated single-service environments that fail to capture realistic multi-target attack scenarios involving reconnaissance, target selection, and exploitation under uncertainty.
现有的基于大语言模型(LLM)的攻防智能体评估通常在闭循环设置中进行,具有预定义的目标和二进制成功标准,且使用隔离的单一服务环境。这些方法未能捕捉到真实世界中涉及侦察、目标选择以及在不确定性下进行利用的多目标攻击场景。
While computer-use agents (CUAs) have shown strong capabilities in web browsing and visual task automation, their potential to discover and exploit web application vulnerabilities through graphical interfaces remains unknown. Existing benchmarks for CUAs assume sanitized environments and overlook security aspects.
虽然计算机使用智能体(CUA)在网页浏览和视觉任务自动化方面表现出强大能力,但它们通过图形界面发现和利用 Web 应用程序漏洞的潜力尚不清楚。现有的 CUA 基准测试通常假设环境是经过清理的,忽略了安全方面的问题。
Existing LLM-based penetration testing approaches rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across individual penetration testing stages.
现有的基于 LLM 的渗透测试方法依赖于简单的提示词,缺乏任务分解或领域自适应,导致黑盒行为不可靠,且无法深入了解模型在各个渗透测试阶段的能力。
There is a need to quantify the cybersecurity capabilities and risks of language model agents, but existing benchmarks use easy, non-professional-level tasks, are not open-source, or lack objective difficulty grounding.
需要量化语言模型代理的网络安全能力和风险,但现有基准测试使用的是简单的、非专业级别的任务,且不是开源的,或者缺乏客观的难度依据。
基准测试对比表 Benchmark Comparison Table
各基准测试的关键维度对比(横向滚动查看更多列) Key dimensions across all benchmarks (scroll horizontally for more columns)
| 名称 Name | 年份 Year | 范围 Scope | 规模 Scale | 渗透阶段 Attack Phases | 评估指标 Metrics | 环境 Environment | 开源 Open Source | 主要发现 Key Finding |
|---|---|---|---|---|---|---|---|---|
| #28 NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security | 2024 | ctf | 200 CTF challenges across 6 categories (32 crypto, 7 forensics, 29 pwn, 31 rev, 20 web, 13 misc from qualifying rounds; 8 crypto, 10 forensics, 20 pwn, 11 rev, 6 web from finals) | reconnaissance enumeration exploitation | success-rate, failure-type-distribution, comparison-with-human-scores | custom-lab, CTF-competition | Yes | GPT-4 performed best overall, solving challenges across multiple categories… GPT-4 总体表现最佳,解决了多个类别的挑战(密码学 6.67%、二进制利用 7.69%、逆向 9.80%、Web 5.26%,杂项最高达… |
| #29 AutoPenBench: Benchmarking Generative Agents for Penetration Testing AutoPenBench: Benchmarking Generative Agents for Penetration Testing | 2024 | penetration-testing | 33 tasks (22 in-vitro across 4 categories + 11 real-world CVE-based tasks) | reconnaissance scanning enumeration exploitation privilege escalation post exploitation | success-rate, progress-rate (fraction of command milestones achieved), stage-level success rate (per pentest phase), consistency (variability across repeated runs) | custom-lab (Docker-based vulnerable containers) | Yes | The fully autonomous agent (GPT-4o) achieves only 21% overall success rate (27%… 完全自主的智能体(GPT-4o)的总成功率仅为21%(离体任务为27%,真实任务为9%),平均完成约40%的中间里程碑。辅助智能体将性能提高到三倍,成功率达到6… |
| #30 VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models | 2025 | vulnerability-assessment | 7800 questions across 6 datasets | | accuracy, RMSE (CVSS base score prediction), Levenshtein distance (CVSS vector string similarity), exact match (CVSS vector strings) | Local deployment via Ollama (16GB RAM, 4GB Nvidia vRAM) | Yes | All three models achieved well above typical certification passing thresholds… 三个模型在 CEH 和 PenTest+ 风格题目上均远超典型认证通过阈值(70-85%),其中 Qwen 2.5 和 Llama 3.2 超过 96%… |
| #39 Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements | 2025 | penetration-testing | 152 tasks across 13 VulnHub machines (7 easy, 4 medium, 2 hard), with tasks distributed as: 72 reconnaissance, 14 general techniques, 44 exploitation, 22 privilege escalation | reconnaissance enumeration exploitation privilege escalation | success-rate, task-completion | VulnHub | Yes | Llama 3.1-405B outperforms GPT-4o on 7 out of 13 machines, with equal… Llama… |
| #46 CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment | 2025 | penetration-testing | 40 web-based CTF challenges deployed on a single virtual machine | reconnaissance scanning enumeration exploitation | success-rate, precision, recall, time-to-first-flag, average-cost-per-challenge, average-interaction-rounds, agent-count, dead-end-persistence-ratio, vulnerability-discovery-signal, agent-inflation-factor | custom-lab | No | Claude Opus 4.5 and Gemini 3 Pro achieved the highest recall (22.50% each, 9/40… Claude Opus 4.5 和 Gemini 3 Pro 获得了最高的召回率(各为 22.50%,即 40 个 flag 中获得 9 个),其中 Opus… |
| #53 HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities | 2025 | vulnerability-assessment | 36 web CTF challenges from NYY CTF Bench (26), Cybench (8), and InterCode-CTF (2) | reconnaissance scanning enumeration exploitation | success-rate, tool-usage-frequency, num-steps | custom-lab | Yes | CUAs achieve exploitation rates below 12%, with Claude-3.7-Sonnet performing… CUA 的利用率低于 12%,其中 Claude-3.7-Sonnet 表现最佳,在各种观察空间下的平均成功率为 10.18%(在使用屏幕截图或标记集时最高达到… |
| #58 PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design | 2025 | penetration-testing | 346 tasks across 12 realistic vulnerable scenarios, over 3000 stage-level evaluations and 180 end-to-end tests | reconnaissance scanning enumeration exploitation post exploitation | Jaccard-similarity, Spearman-rank-correlation, syntax-correctness, functional-correctness, revision-success-rate, recall, NonCVE-identification-rate, end-to-end-success-rate | custom-lab, VulnHub, Docker-containers, Amazon-Lightsail | Yes | At the stage level, LLMs exhibit generally weak performance with a mean success… 在阶段级别,LLM 表现通常较弱,所有阶段的平均成功率仅为 0.41,其中攻击决策 (0.25) 和漏洞利用生成的功能正确性 (0.26)… |
| #63 Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models | 2025 | ctf | 40 professional-level CTF tasks from 4 competitions | reconnaissance enumeration exploitation privilege escalation | success-rate, task-completion, subtask-performance, first-solve-time-correlation, token-usage | custom-lab, CTF-competition | Yes | Claude 3.5 Sonnet achieves the highest unguided performance at 17.5%, GPT-4o… Claude 3.5 Sonnet 在无指导下的表现最高,达到 17.5%;GPT-4o 在子任务指导下的表现领先,达到 17.5%;OpenAI… |
评估平台使用频率 Evaluation Platform Usage
各评估平台/基准被多少篇论文使用(点击柱条查看论文列表) Number of papers using each evaluation platform (click bars to see paper list)
论文 × 评估平台 矩阵 Paper × Evaluation Platform Matrix
哪些论文使用了哪些评估平台(深色 = 使用) Which papers use which evaluation platforms (filled = used)
| #01 | #02 | #03 | #04 | #05 | #06 | #07 | #08 | #09 | #10 | #11 | #12 | #13 | #14 | #15 | #16 | #18 | #19 | #20 | #21 | #22 | #23 | #24 | #25 | #26 | #27 | #28 | #29 | #31 | #32 | #33 | #34 | #36 | #37 | #38 | #39 | #40 | #41 | #43 | #44 | #45 | #46 | #47 | #48 | #49 | #50 | #51 | #52 | #53 | #54 | #55 | #56 | #58 | #59 | #61 | #62 | #63 | #64 | #65 | #66 | #68 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Custom Lab | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| VulnHub | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HackTheBox | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| PicoCTF | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Cybench | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| InterCode-CTF | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Metasploitable | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CTF Competition | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AutoPenBench | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| NYU CTF Bench | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| OverTheWire | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| GOAD |