#54

Cyber-Zero: Training Cybersecurity Agents Without Runtime Cyber-Zero: Training Cybersecurity Agents Without Runtime

Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang

2025 | arXiv (preprint)

system ctf fully-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

High-quality training data for cybersecurity LLM agents is scarce because CTF challenge environments are ephemeral, making it impossible to collect authentic agent trajectories at scale. Existing cybersecurity datasets either lack agentic interaction patterns or fail to provide training data altogether.

网络安全 LLM 智能体的高质量训练数据稀缺，因为 CTF 挑战环境是瞬态的，导致无法大规模收集真实的智能体轨迹。现有的网络安全数据集要么缺乏智能体交互模式，要么根本无法提供训练数据。

While LLM agents show promise on CTF challenges when paired with frontier proprietary models, open-weight models significantly underperform. The root cause is twofold: open models lack sophisticated agentic capabilities, and there is no scalable way to generate training trajectories without access to the original challenge runtime environments. This paper fills the gap by enabling runtime-free trajectory synthesis.

虽然 LLM 智能体在与顶尖专有模型配对时在 CTF 挑战中表现出潜力，但开源权重模型的表现显著不足。根本原因有两个：开源模型缺乏复杂的智能体能力，并且在无法访问原始挑战运行时环境的情况下，没有可扩展的方法来生成训练轨迹。本文通过实现无运行时的轨迹合成填补了这一空白。

Threat Model 威胁模型

Assumes access to publicly available CTF writeups (from CTFtime) containing step-by-step solution strategies. The trained agents operate in standard CTF evaluation environments with bash terminal access and specialized cybersecurity tools.

假设可以访问公开可用的 CTF writeup（来自 CTFtime），其中包含分步解题策略。训练后的智能体在标准的 CTF 评估环境中运行，具有 bash 终端访问权限和专业的网络安全工具。

Methodology 核心方法

Cyber-Zero is a runtime-free framework that synthesizes high-quality agent trajectories from publicly available CTF writeups. It uses a persona-driven dual-LLM simulation: a CTF Player model that reasons and issues commands, and a Bash Terminal model that simulates realistic system responses. The Terminal model has access to the original writeup and reference flag, acting as a weak oracle that can inject hints when the player goes off track. Trajectories undergo multi-layer validation (flag match, format checks, LLM-based alignment filter) before being used for supervised fine-tuning of open-weight models.

Cyber-Zero 是一个无运行时框架，可从公开的 CTF writeup 中合成高质量的智能体轨迹。它使用角色驱动的双 LLM 模拟：一个推理并发布命令的 CTF 选手（Player）模型，以及一个模拟真实系统响应的 Bash 终端（Terminal）模型。终端模型可以访问原始 writeup 和参考 flag，充当一个弱预言机，在选手偏离轨道时注入提示。在用于开源权重模型的监督微调之前，轨迹会经过多层验证（flag 匹配、格式检查、基于 LLM 的对齐过滤器）。

Architecture 架构设计

Three-stage pipeline: (1) Source data collection from CTFtime with HTML-to-Markdown conversion, quality filtering, and metadata augmentation via DeepSeek-V3-0324; (2) Persona-driven trajectory generation using a dual-LLM setup (Player + Terminal personas) that simulates complete CTF-solving interactions including failed attempts and debugging; (3) Training data construction with rejection sampling, format validation, and LLM-based binary alignment filtering. The ENIGMA scaffold interface is emulated during generation for compatibility with evaluation.

三阶段流水线：(1) 从 CTFtime 收集源数据，进行 HTML 到 Markdown 的转换、质量过滤，并通过 DeepSeek-V3-0324 进行元数据增强；(2) 使用角色驱动的双 LLM 设置（选手 + 终端角色）生成轨迹，模拟完整的 CTF 解决交互，包括失败的尝试和调试；(3) 使用拒绝采样、格式验证和基于 LLM 的二分对齐过滤构建训练数据。在生成过程中模拟了 ENIGMA 支架界面，以确保与评估的兼容性。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Cyber-Zero fine-tuning yields up to 13.1% absolute performance gains over zero-shot baselines across three CTF benchmarks. The best model, Cyber-Zero-32B (Qwen3-32B fine-tuned), achieves 33.4% average Pass@1 across all benchmarks, matching proprietary models like Claude-3.5-Sonnet and DeepSeek-V3-0324 while offering superior cost-effectiveness. Performance scales consistently with model size, training task diversity, and trajectory density.

Cyber-Zero 微调在三个 CTF 基准测试中，相比零样本基准实现了高达 13.1% 的绝对性能提升。表现最好的模型 Cyber-Zero-32B（经微调的 Qwen3-32B）在所有基准测试中实现了 33.4% 的平均 Pass@1，与 Claude-3.5-Sonnet 和 DeepSeek-V3-0324 等专有模型持平，同时提供了卓越的成本效益。性能随着模型大小、训练任务多样性和轨迹密度的增加而持续扩展。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Claude-3.7-Sonnet
Claude-3.5-Sonnet
DeepSeek-V3-0324
Gemini-2.5-Flash
Qwen3-8B
Qwen3-14B
Qwen3-32B
Qwen2.5-Instruct
SWE-agent-LM

Scale 评估规模

323 CTF challenges across 3 benchmarks (91 InterCode-CTF, 192 NYU CTF Bench, 40 Cybench)

Contributions 核心贡献

First runtime-free framework for synthesizing cybersecurity agent trajectories, using persona-driven dual-LLM simulation to reverse-engineer environments from CTF writeups
Large-scale dataset of 6,188 synthesized cybersecurity trajectories covering 4,610 unique challenges from 543 CTF competitions (2017-2025)
Thorough evaluation showing fine-tuned open-weight models close the gap with proprietary models on three major CTF benchmarks
ENIGMA+, an enhanced agent scaffold that reduces evaluation time from days to hours while maintaining fidelity
Identification and patching of problematic challenges affecting 6% of existing CTF benchmarks

第一个用于合成网络安全智能体轨迹的无运行时框架，使用角色驱动的双 LLM 模拟从 CTF writeup 中逆向工程环境
包含 6,188 条合成网络安全轨迹的大规模数据集，涵盖来自 543 场 CTF 竞赛（2017-2025 年）的 4,610 个独特挑战
详尽的评估表明，经过微调的开源权重模型缩小了与专有模型在三个主要 CTF 基准测试上的差距
ENIGMA+，一种增强的智能体支架，在保持保真度的同时将评估时间从几天缩短到几小时
识别并修补了影响 6% 现有 CTF 基准测试的问题挑战

Limitations 局限性

Trajectories are synthesized without actual runtime verification, so terminal outputs are simulated and may not be perfectly realistic
Gains are less pronounced on professional-level benchmarks (Cybench) compared to educational ones (InterCode-CTF), suggesting complex real-world challenges require more sophisticated reasoning than unverified synthetic trajectories can capture
Training limited to samples with maximum 32,768 tokens due to compute constraints
The hint mechanism from the Terminal model, while helpful, introduces a form of guidance not present in real evaluation settings
SWE-focused training does not transfer to cybersecurity, but the reverse direction (cybersecurity to SWE) is not explored

轨迹是在没有实际运行时验证的情况下合成的，因此终端输出是模拟的，可能并不完全真实
与教育性基准（InterCode-CTF）相比，在专业级基准（Cybench）上的提升不那么明显，这表明复杂的现实世界挑战需要比未经验证的合成轨迹所能捕获的更复杂的推理
由于计算限制，训练仅限于最大 32,768 个令牌的样本
终端模型的提示机制虽然有帮助，但也引入了一种在真实评估设置中不存在的引导形式
以软件工程（SWE）为重点的训练无法转移到网络安全领域，但反向（网络安全到 SWE）的研究尚未开展

Research Gaps 研究空白

Runtime-free trajectory synthesis may not fully capture the complexity of professional-level CTF challenges requiring sophisticated multi-step reasoning
No exploration of reinforcement learning or self-play approaches for cybersecurity agent training
Cross-domain transfer between software engineering and cybersecurity agent skills remains poorly understood
Diminishing returns from inference-time scaling beyond k=5 suggest fundamental capability ceilings that training data alone may not address
Benchmark reliability issues (6% of challenges found to be broken) indicate a need for better benchmark maintenance practices

无运行时轨迹合成可能无法完全捕获需要复杂多步推理的专业级 CTF 挑战的复杂性
尚未探索网络安全智能体训练的强化学习或自我博弈方法
软件工程和网络安全智能体技能之间的跨域迁移仍不清楚
推理时间扩展在 k=5 之后收益递减，这表明存在仅靠训练数据可能无法解决的基础能力上限
基准测试的可靠性问题（发现 6% 的挑战已损坏）表明需要更好的基准测试维护实践

Novel Techniques 新颖技术

Persona-driven dual-LLM trajectory synthesis: using one LLM as the agent and another as the environment simulator, guided by writeup content
Hint injection mechanism via [HINT]...[/HINT] tags from the terminal model to steer the player model when it gets stuck
Multi-layer rejection sampling with LLM-based binary alignment filtering for trajectory quality control
Runtime-free training data generation that eliminates dependency on ephemeral challenge environments

角色驱动的双 LLM 轨迹合成：使用一个 LLM 作为智能体，另一个作为环境模拟器，由 writeup 内容引导
通过终端模型的 [HINT]...[/HINT] 标签进行提示注入，在选手模型卡住时进行引导
具有基于 LLM 的二分对齐过滤的多层拒绝采样，用于轨迹质量控制
无运行时的训练数据生成，消除了对瞬态挑战环境的依赖

Open Questions 开放问题

Can RL-based approaches (e.g., GRPO, PPO) further improve cybersecurity agents trained on Cyber-Zero trajectories?
How well do Cyber-Zero-trained models generalize to real-world penetration testing scenarios beyond CTF competitions?
Could the dual-persona approach be extended with a verifier that actually executes commands, creating a hybrid runtime/runtime-free pipeline?
What is the quality ceiling of purely synthesized trajectories compared to trajectories collected from actual runtime environments?

基于强化学习的方法（如 GRPO、PPO）能否进一步改进在 Cyber-Zero 轨迹上训练的网络安全智能体？
经过 Cyber-Zero 训练的模型在 CTF 竞赛之外的现实世界渗透测试场景中的泛化能力如何？
双角色方法能否通过一个实际执行命令的验证器进行扩展，从而创建一个混合的运行时/无运行时流水线？
与从实际运行时环境收集的轨迹相比，纯合成轨迹的质量上限是多少？

Builds On 基于前人工作

ENIGMA
SWE-Gym
SWE-bench
InterCode-CTF
NYU-CTF-Bench
Cybench

Open Source 开源信息

Yes - https://github.com/amazon-science/cyber-zero