Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Problem & Motivation 问题与动机
There is a need to quantify the cybersecurity capabilities and risks of language model agents, but existing benchmarks use easy, non-professional-level tasks, are not open-source, or lack objective difficulty grounding.
需要量化语言模型代理的网络安全能力和风险,但现有基准测试使用的是简单的、非专业级别的任务,且不是开源的,或者缺乏客观的难度依据。
Policymakers, model providers, and researchers need open-source, professional-level benchmarks to evaluate LM agents on realistic cybersecurity tasks. Existing CTF benchmarks (InterCode-CTF, NYU CTF Dataset) target high-school or university-level challenges, use subjective point-based difficulty, and risk train-test overlap. Government efforts (UK AISI, OpenAI) are not open-source. Cybench fills this gap with professional-level, open-source, objectively-graded tasks.
决策者、模型提供商和研究人员需要开源、专业级别的基准测试,以便在现实的网络安全任务中评估 LM 代理。现有的 CTF 基准测试(InterCode-CTF, NYU CTF Dataset)针对的是高中或大学水平的挑战,使用主观的基于分数的难度,并存在训练-测试重叠的风险。政府层面的努力(英国 AISI, OpenAI)并非开源。Cybench 通过专业级别、开源、客观评分的任务填补了这一空白。
Threat Model 威胁模型
Autonomous LM agents with access to a Kali Linux environment, capable of executing bash commands, interacting with network services, and reading/writing files. The agent operates without human assistance in the loop.
拥有 Kali Linux 环境访问权限的自主 LM 代理,能够执行 bash 命令、与网络服务交互以及读取/写入文件。代理在没有人类参与的情况下运行。
Methodology 核心方法
Cybench is a framework for specifying cybersecurity tasks (as CTF challenges) and evaluating LM-based agents on them. It includes 40 professional-level CTF tasks from 4 competitions (HackTheBox, SekaiCTF, Glacier, HKCert), spanning 6 categories (crypto, web, reverse engineering, forensics, exploitation, misc). Each task has a description, starter files, an evaluator, and optional subtasks that decompose tasks into intermediary steps for finer-grained evaluation. Task difficulty is objectively grounded using first solve time (FST) from competitions, ranging from 2 minutes to 24 hours 54 minutes.
Cybench 是一个用于规范网络安全任务(作为 CTF 挑战)并评估基于 LM 的代理的框架。它包含来自 4 个竞赛(HackTheBox, SekaiCTF, Glacier, HKCert)的 40 个专业级 CTF 任务,涵盖 6 个类别(加密、Web、逆向工程、取证、利用、其他)。每个任务都有描述、启动文件、评估器和可选子任务,子任务将任务分解为中间步骤,以便进行更细粒度的评估。任务难度根据竞赛中的首次解决时间(FST)进行客观设定,范围从 2 分钟到 24 小时 54 分钟。
Architecture 架构设计
Each task is instantiated in a Docker environment with a Kali Linux container (agent workspace) connected via network to task server container(s). The agent follows an act-execute-update loop: it receives a prompt with task description, generates a response containing a bash command, the command is executed, and the observation is fed back. The agent's memory consists of the initial prompt and the last 3 response-observation pairs.
每个任务都在 Docker 环境中实例化,其中包含一个 Kali Linux 容器(代理工作空间),通过网络连接到任务服务器容器。代理遵循“动作-执行-更新”循环:它接收带有任务描述的提示,生成包含 bash 命令的响应,命令被执行,并将观察结果反馈。代理的记忆由初始提示和最后 3 对响应-观察组成。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
conversation-history
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
Claude 3.5 Sonnet achieves the highest unguided performance at 17.5%, GPT-4o leads subtask-guided performance at 17.5%, and OpenAI o1-preview leads subtask performance at 46.8%. Without subtask guidance, four models (Claude 3.5 Sonnet, GPT-4o, Claude 3 Opus, OpenAI o1-preview) solve tasks with FST up to 11 minutes, but no model solves any task with FST above 11 minutes. Agent scaffolding effects are model-dependent: Claude 3.5 Sonnet outperforms GPT-4o with pseudoterminal and web search scaffolds.
Claude 3.5 Sonnet 在无指导下的表现最高,达到 17.5%;GPT-4o 在子任务指导下的表现领先,达到 17.5%;OpenAI o1-preview 在子任务表现上领先,达到 46.8%。在没有子任务指导的情况下,四种模型(Claude 3.5 Sonnet, GPT-4o, Claude 3 Opus, OpenAI o1-preview)能解决 FST 高达 11 分钟的任务,但没有模型能解决 FST 超过 11 分钟的任务。代理脚手架的效果取决于模型:在使用拟终端(pseudoterminal)和网络搜索脚手架时,Claude 3.5 Sonnet 的表现优于 GPT-4o。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- GPT-4o
- OpenAI o1-preview
- Claude 3.5 Sonnet
- Claude 3 Opus
- Gemini 1.5 Pro
- Mixtral 8x22B Instruct
- Llama 3 70B Chat
- Llama 3.1 405B Instruct
Scale 评估规模
40 professional-level CTF tasks from 4 competitions
Contributions 核心贡献
- Open-source benchmark with 40 recent professional-level CTF tasks from 4 distinct competitions
- Framework to unify tasks across distinct CTF competitions into a single benchmark
- Objective task difficulty grounding based on first solve time (FST) of human teams
- Log-linear scaling of difficulties with a high difficulty ceiling beyond existing benchmarks (747x range in FST)
- Subtasks that break down tasks into intermediary steps for more detailed, partial-credit evaluation
- Task verifiability through solution scripts and continuous integration
- Kali Linux-based agent with reflection and planning (structured bash scaffold)
- Comprehensive experiments across 8 models and 4 agent scaffolds (structured bash, action-only, pseudoterminal, web search)
- 包含来自 4 个不同竞赛的 40 个近期专业级 CTF 任务的开源基准测试
- 将不同 CTF 竞赛的任务统一到一个基准测试中的框架
- 基于人类团队首次解决时间(FST)的客观任务难度设定
- 难度的对数线性扩展,具有超出现有基准测试的高难度上限(FST 范围达 747 倍)
- 将任务分解为中间步骤的子任务,用于更详细的、部分学分的评估
- 通过解决方案脚本和持续集成实现任务可验证性
- 基于 Kali Linux 且具有反射和规划能力的代理(结构化 bash 脚手架)
- 跨 8 个模型和 4 个代理脚手架(结构化 bash、仅动作、拟终端、网络搜索)的综合实验
Limitations 局限性
- Limited to 40 CTF tasks; does not cover all cybersecurity domains or real-world penetration testing scenarios
- Agents fail on all tasks with FST above 11 minutes, indicating current LMs cannot handle harder professional tasks
- Safety refusals from Claude 3 Opus and Claude 3.5 Sonnet required prompt engineering to mitigate
- Pseudoterminal and web search scaffolds increase action space complexity and can hurt rather than help performance
- Single-attempt evaluation for main results; max-of-3 for scaffold comparison
- CTF tasks are proxies for real-world hacking but each typically demonstrates a single skill rather than chained attack sequences
- Potential train-test overlap for Claude 3.5 Sonnet due to knowledge cutoff timing
- 局限于 40 个 CTF 任务;未涵盖所有网络安全领域或现实世界的渗透测试场景
- 代理在所有 FST 超过 11 分钟的任务上都失败了,表明当前的 LM 无法处理更高难度的专业任务
- Claude 3 Opus 和 Claude 3.5 Sonnet 的安全拒绝需要通过提示词工程来缓解
- 拟终端和网络搜索脚手架增加了动作空间复杂度,可能会损害而非提高性能
- 主要结果采用单次尝试评估;脚手架比较采用 3 次尝试中的最大值
- CTF 任务是现实世界黑客攻击的代理,但每个任务通常只展示单一技能,而非链式攻击序列
- 由于知识截止时间的原因,Claude 3.5 Sonnet 可能存在潜在的训练-测试重叠
Research Gaps 研究空白
- Models cannot make the sophisticated 'insights' required for harder tasks, suggesting a fundamental capability gap beyond simple tool use
- Optimal agent scaffolding is model-dependent, with no single scaffold universally best
- Subtask guidance does not always improve performance, indicating agents struggle with intermediate step execution even when guided
- No exploration of multi-agent architectures or more advanced planning strategies for cybersecurity tasks
- Need for benchmarks covering real-world penetration testing beyond CTF-style challenges
- 模型无法产生更难任务所需的复杂‘洞察’,这表明在简单的工具使用之外存在根本的能力差距
- 最佳代理脚手架取决于模型,没有一种脚手架是普遍最佳的
- 子任务指导并不总能提高性能,表明代理即使在有指导的情况下也难以执行中间步骤
- 尚未探索针对网络安全任务的多代理架构或更先进的规划策略
- 需要涵盖 CTF 风格挑战之外的现实世界渗透测试基准测试
Novel Techniques 新颖技术
- First solve time (FST) as an objective, competition-grounded difficulty metric for cybersecurity benchmarks
- Subtask decomposition for partial-credit evaluation of complex cybersecurity tasks
- Structured bash agent scaffold with Reflection/Plan-and-Status/Thought/Log/Action response format
- Comparison of 4 agent scaffolds (structured bash, action-only, pseudoterminal, web search) for cybersecurity agents
- 将首次解决时间(FST)作为网络安全基准测试的一种客观、基于竞赛的难度指标
- 用于复杂网络安全任务部分学分评估的子任务分解
- 具有‘反射/计划与状态/思考/日志/动作’响应格式的结构化 bash 代理脚手架
- 针对网络安全代理的 4 种代理脚手架(结构化 bash、仅动作、拟终端、网络搜索)的比较
Open Questions 开放问题
- Why do models hit a hard ceiling at 11-minute FST tasks, and what capabilities are needed to break through?
- How would multi-agent or hierarchical architectures perform on these professional-level tasks?
- Can fine-tuning on cybersecurity data significantly improve performance on harder tasks?
- How do newer models (GPT-4o successors, Claude 3.5+ models) perform as the benchmark evolves?
- What is the relationship between general reasoning capability and cybersecurity task performance?
- 为什么模型在 FST 11 分钟的任务上遇到了硬上限,突破该上限需要什么能力?
- 多代理或分层架构在这些专业级任务上的表现如何?
- 在网络安全数据上进行微调能否显著提高在更难任务上的表现?
- 随着基准测试的发展,新模型(GPT-4o 后继者、Claude 3.5+ 模型)的表现如何?
- 通用推理能力与网络安全任务表现之间有什么关系?
Builds On 基于前人工作
- Reflexion (Shinn et al., 2024)
- ReAct (Yao et al., 2022)
- MLAgentBench (Huang et al., 2024)
- InterCode-CTF (Yang et al., 2023)
- NYU CTF Dataset (Shao et al., 2024)
- PentestGPT (Deng et al., 2023)
- PenHeal (Huang & Zhu, 2024)
Open Source 开源信息
Yes - https://cybench.github.io