#63

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang

2025 | ICLR (top-conference)

arXiv:2408.08926

benchmark ctf fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

There is a need to quantify the cybersecurity capabilities and risks of language model agents, but existing benchmarks use easy, non-professional-level tasks, are not open-source, or lack objective difficulty grounding.

需要量化语言模型代理的网络安全能力和风险，但现有基准测试使用的是简单的、非专业级别的任务，且不是开源的，或者缺乏客观的难度依据。

Policymakers, model providers, and researchers need open-source, professional-level benchmarks to evaluate LM agents on realistic cybersecurity tasks. Existing CTF benchmarks (InterCode-CTF, NYU CTF Dataset) target high-school or university-level challenges, use subjective point-based difficulty, and risk train-test overlap. Government efforts (UK AISI, OpenAI) are not open-source. Cybench fills this gap with professional-level, open-source, objectively-graded tasks.

决策者、模型提供商和研究人员需要开源、专业级别的基准测试，以便在现实的网络安全任务中评估 LM 代理。现有的 CTF 基准测试（InterCode-CTF, NYU CTF Dataset）针对的是高中或大学水平的挑战，使用主观的基于分数的难度，并存在训练-测试重叠的风险。政府层面的努力（英国 AISI, OpenAI）并非开源。Cybench 通过专业级别、开源、客观评分的任务填补了这一空白。

Threat Model 威胁模型

Autonomous LM agents with access to a Kali Linux environment, capable of executing bash commands, interacting with network services, and reading/writing files. The agent operates without human assistance in the loop.

拥有 Kali Linux 环境访问权限的自主 LM 代理，能够执行 bash 命令、与网络服务交互以及读取/写入文件。代理在没有人类参与的情况下运行。

Methodology 核心方法

Cybench is a framework for specifying cybersecurity tasks (as CTF challenges) and evaluating LM-based agents on them. It includes 40 professional-level CTF tasks from 4 competitions (HackTheBox, SekaiCTF, Glacier, HKCert), spanning 6 categories (crypto, web, reverse engineering, forensics, exploitation, misc). Each task has a description, starter files, an evaluator, and optional subtasks that decompose tasks into intermediary steps for finer-grained evaluation. Task difficulty is objectively grounded using first solve time (FST) from competitions, ranging from 2 minutes to 24 hours 54 minutes.

Cybench 是一个用于规范网络安全任务（作为 CTF 挑战）并评估基于 LM 的代理的框架。它包含来自 4 个竞赛（HackTheBox, SekaiCTF, Glacier, HKCert）的 40 个专业级 CTF 任务，涵盖 6 个类别（加密、Web、逆向工程、取证、利用、其他）。每个任务都有描述、启动文件、评估器和可选子任务，子任务将任务分解为中间步骤，以便进行更细粒度的评估。任务难度根据竞赛中的首次解决时间（FST）进行客观设定，范围从 2 分钟到 24 小时 54 分钟。

Architecture 架构设计

Each task is instantiated in a Docker environment with a Kali Linux container (agent workspace) connected via network to task server container(s). The agent follows an act-execute-update loop: it receives a prompt with task description, generates a response containing a bash command, the command is executed, and the observation is fed back. The agent's memory consists of the initial prompt and the last 3 response-observation pairs.

每个任务都在 Docker 环境中实例化，其中包含一个 Kali Linux 容器（代理工作空间），通过网络连接到任务服务器容器。代理遵循“动作-执行-更新”循环：它接收带有任务描述的提示，生成包含 bash 命令的响应，命令被执行，并将观察结果反馈。代理的记忆由初始提示和最后 3 对响应-观察组成。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Claude 3.5 Sonnet achieves the highest unguided performance at 17.5%, GPT-4o leads subtask-guided performance at 17.5%, and OpenAI o1-preview leads subtask performance at 46.8%. Without subtask guidance, four models (Claude 3.5 Sonnet, GPT-4o, Claude 3 Opus, OpenAI o1-preview) solve tasks with FST up to 11 minutes, but no model solves any task with FST above 11 minutes. Agent scaffolding effects are model-dependent: Claude 3.5 Sonnet outperforms GPT-4o with pseudoterminal and web search scaffolds.

Claude 3.5 Sonnet 在无指导下的表现最高，达到 17.5%；GPT-4o 在子任务指导下的表现领先，达到 17.5%；OpenAI o1-preview 在子任务表现上领先，达到 46.8%。在没有子任务指导的情况下，四种模型（Claude 3.5 Sonnet, GPT-4o, Claude 3 Opus, OpenAI o1-preview）能解决 FST 高达 11 分钟的任务，但没有模型能解决 FST 超过 11 分钟的任务。代理脚手架的效果取决于模型：在使用拟终端（pseudoterminal）和网络搜索脚手架时，Claude 3.5 Sonnet 的表现优于 GPT-4o。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-4o
OpenAI o1-preview
Claude 3.5 Sonnet
Claude 3 Opus
Gemini 1.5 Pro
Mixtral 8x22B Instruct
Llama 3 70B Chat
Llama 3.1 405B Instruct

Scale 评估规模

40 professional-level CTF tasks from 4 competitions

Contributions 核心贡献

Open-source benchmark with 40 recent professional-level CTF tasks from 4 distinct competitions
Framework to unify tasks across distinct CTF competitions into a single benchmark
Objective task difficulty grounding based on first solve time (FST) of human teams
Log-linear scaling of difficulties with a high difficulty ceiling beyond existing benchmarks (747x range in FST)
Subtasks that break down tasks into intermediary steps for more detailed, partial-credit evaluation
Task verifiability through solution scripts and continuous integration
Kali Linux-based agent with reflection and planning (structured bash scaffold)
Comprehensive experiments across 8 models and 4 agent scaffolds (structured bash, action-only, pseudoterminal, web search)

包含来自 4 个不同竞赛的 40 个近期专业级 CTF 任务的开源基准测试
将不同 CTF 竞赛的任务统一到一个基准测试中的框架
基于人类团队首次解决时间（FST）的客观任务难度设定
难度的对数线性扩展，具有超出现有基准测试的高难度上限（FST 范围达 747 倍）
将任务分解为中间步骤的子任务，用于更详细的、部分学分的评估
通过解决方案脚本和持续集成实现任务可验证性
基于 Kali Linux 且具有反射和规划能力的代理（结构化 bash 脚手架）
跨 8 个模型和 4 个代理脚手架（结构化 bash、仅动作、拟终端、网络搜索）的综合实验

Limitations 局限性

Limited to 40 CTF tasks; does not cover all cybersecurity domains or real-world penetration testing scenarios
Agents fail on all tasks with FST above 11 minutes, indicating current LMs cannot handle harder professional tasks
Safety refusals from Claude 3 Opus and Claude 3.5 Sonnet required prompt engineering to mitigate
Pseudoterminal and web search scaffolds increase action space complexity and can hurt rather than help performance
Single-attempt evaluation for main results; max-of-3 for scaffold comparison
CTF tasks are proxies for real-world hacking but each typically demonstrates a single skill rather than chained attack sequences
Potential train-test overlap for Claude 3.5 Sonnet due to knowledge cutoff timing

局限于 40 个 CTF 任务；未涵盖所有网络安全领域或现实世界的渗透测试场景
代理在所有 FST 超过 11 分钟的任务上都失败了，表明当前的 LM 无法处理更高难度的专业任务
Claude 3 Opus 和 Claude 3.5 Sonnet 的安全拒绝需要通过提示词工程来缓解
拟终端和网络搜索脚手架增加了动作空间复杂度，可能会损害而非提高性能
主要结果采用单次尝试评估；脚手架比较采用 3 次尝试中的最大值
CTF 任务是现实世界黑客攻击的代理，但每个任务通常只展示单一技能，而非链式攻击序列
由于知识截止时间的原因，Claude 3.5 Sonnet 可能存在潜在的训练-测试重叠

Research Gaps 研究空白

Models cannot make the sophisticated 'insights' required for harder tasks, suggesting a fundamental capability gap beyond simple tool use
Optimal agent scaffolding is model-dependent, with no single scaffold universally best
Subtask guidance does not always improve performance, indicating agents struggle with intermediate step execution even when guided
No exploration of multi-agent architectures or more advanced planning strategies for cybersecurity tasks
Need for benchmarks covering real-world penetration testing beyond CTF-style challenges

模型无法产生更难任务所需的复杂‘洞察’，这表明在简单的工具使用之外存在根本的能力差距
最佳代理脚手架取决于模型，没有一种脚手架是普遍最佳的
子任务指导并不总能提高性能，表明代理即使在有指导的情况下也难以执行中间步骤
尚未探索针对网络安全任务的多代理架构或更先进的规划策略
需要涵盖 CTF 风格挑战之外的现实世界渗透测试基准测试

Novel Techniques 新颖技术

First solve time (FST) as an objective, competition-grounded difficulty metric for cybersecurity benchmarks
Subtask decomposition for partial-credit evaluation of complex cybersecurity tasks
Structured bash agent scaffold with Reflection/Plan-and-Status/Thought/Log/Action response format
Comparison of 4 agent scaffolds (structured bash, action-only, pseudoterminal, web search) for cybersecurity agents

将首次解决时间（FST）作为网络安全基准测试的一种客观、基于竞赛的难度指标
用于复杂网络安全任务部分学分评估的子任务分解
具有‘反射/计划与状态/思考/日志/动作’响应格式的结构化 bash 代理脚手架
针对网络安全代理的 4 种代理脚手架（结构化 bash、仅动作、拟终端、网络搜索）的比较

Open Questions 开放问题

Why do models hit a hard ceiling at 11-minute FST tasks, and what capabilities are needed to break through?
How would multi-agent or hierarchical architectures perform on these professional-level tasks?
Can fine-tuning on cybersecurity data significantly improve performance on harder tasks?
How do newer models (GPT-4o successors, Claude 3.5+ models) perform as the benchmark evolves?
What is the relationship between general reasoning capability and cybersecurity task performance?

为什么模型在 FST 11 分钟的任务上遇到了硬上限，突破该上限需要什么能力？
多代理或分层架构在这些专业级任务上的表现如何？
在网络安全数据上进行微调能否显著提高在更难任务上的表现？
随着基准测试的发展，新模型（GPT-4o 后继者、Claude 3.5+ 模型）的表现如何？
通用推理能力与网络安全任务表现之间有什么关系？

Builds On 基于前人工作

Reflexion (Shinn et al., 2024)
ReAct (Yao et al., 2022)
MLAgentBench (Huang et al., 2024)
InterCode-CTF (Yang et al., 2023)
NYU CTF Dataset (Shao et al., 2024)
PentestGPT (Deng et al., 2023)
PenHeal (Huang & Zhu, 2024)

Open Source 开源信息

Yes - https://cybench.github.io