#28

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

2024 | NeurIPS 2024 Track on Datasets and Benchmarks (top-conference)

benchmark ctf fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

The capacity of LLMs to solve Capture-the-Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. Existing studies are limited in scope, use closed benchmarks, lack automated frameworks, and test only a small number of challenges or models.

LLM 在解决网络安全中的夺旗赛（CTF）挑战方面的能力尚未得到透彻评估。现有研究范围有限、使用闭源基准、缺乏自动化框架，且仅测试了少量的挑战或模型。

CTF tasks require advanced multi-step reasoning and the ability to take action in a digital environment, making them an excellent test of general LLM reasoning capabilities beyond multiple-choice benchmarks like MMLU. Prior work on LLM-based CTF solving was limited to small-scale manual evaluations (e.g., 7-100 challenges) without open benchmarks, automated frameworks, or tool integration, preventing reproducible and scalable assessment of LLM offensive security capabilities.

CTF 任务需要高级的多步推理和在数字环境中采取行动的能力，这使其成为超越 MMLU 等多项选择基准的 LLM 通用推理能力的极佳测试。之前关于基于 LLM 的 CTF 求解工作仅限于小规模的人工评估（例如 7-100 个挑战），缺乏开源基准、自动化框架或工具集成，阻碍了对 LLM 攻击性安全能力的可重复和可扩展评估。

Threat Model 威胁模型

LLMs are given access to a Linux container environment with a shell, network access to challenge servers, and a set of cybersecurity tools. The model receives the challenge description, category, files, and point value, and must autonomously find the flag within a 48-hour time limit and 5 attempts per challenge.

LLM 被授予访问带有 shell 的 Linux 容器环境、对挑战服务器的网络访问权限以及一套网络安全工具。模型接收挑战描述、类别、文件和分值，必须在 48 小时时限和每个挑战 5 次尝试内自主找到 flag。

Methodology 核心方法

The authors create a scalable, open-source benchmark of 200 validated CTF challenges from NYU CSAW competitions (2017-2023) across six categories (crypto, forensics, pwn, rev, web, misc). They build a fully automated evaluation framework that deploys challenges in Docker containers, provides LLMs with cybersecurity tools via function calling, manages prompting and logging, and validates solutions by checking extracted flags. Five LLMs (three black-box, two open-source) are evaluated on all 200 challenges with 5 attempts each.

作者创建了一个可扩展的开源基准，包含来自 NYU CSAW 竞赛（2017-2023 年）的 200 个经过验证的 CTF 挑战，涵盖六个类别（密码学、取证、二进制利用、逆向工程、Web、杂项）。他们构建了一个全自动评估框架，在 Docker 容器中部署挑战，通过函数调用为 LLM 提供网络安全工具，管理提示词和日志，并通过检查提取的 flag 来验证解决方案。在所有 200 个挑战上对五种 LLM（三种黑盒模型，两种开源模型）进行了评估，每个挑战尝试 5 次。

Architecture 架构设计

The framework consists of five modules: (1) Backend Module for communication with LLM services (OpenAI, Anthropic, or open-source via TGI/vLLM); (2) Data Loader for deploying challenges from Docker containers or local files; (3) External Tools providing six function-callable tools (run_command, createfile, disassemble, decompile, check_flag, give_up); (4) Logging System for structured Markdown logs; (5) Prompt Module that constructs system, user, and model prompts from challenge metadata.

该框架由五个模块组成：(1) 后端模块，用于与 LLM 服务（OpenAI, Anthropic 或通过 TGI/vLLM 的开源模型）通信；(2) 数据加载模块，用于从 Docker 容器或本地文件部署挑战；(3) 外部工具，提供六个可供函数调用的工具（运行命令、创建文件、汇编、反编译、检查 flag、放弃）；(4) 日志系统，用于结构化 Markdown 日志；(5) 提示词模块，根据挑战元数据构建系统、用户和模型提示词。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

GPT-4 performed best overall, solving challenges across multiple categories (6.67% crypto, 7.69% pwn, 9.80% rev, 5.26% web, with highest misc at 12.5%). Open-source models Mixtral and LLaMA solved zero challenges. Claude 3 outperformed the median human score in the 2022 CSAW finals with 1500 points and matched human performance in some events. GPT-3.5 and Claude 3 had high give-up rates (47% and 53%), while Mixtral and LLaMA had 100% wrong-answer failure rates.

GPT-4 总体表现最佳，解决了多个类别的挑战（密码学 6.67%、二进制利用 7.69%、逆向 9.80%、Web 5.26%，杂项最高达 12.5%）。开源模型 Mixtral 和 LLaMA 解决的挑战数为零。Claude 3 在 2022 年 CSAW 决赛中以 1500 分超过了人类得分的中位数，并在某些项目中与人类表现持平。GPT-3.5 和 Claude 3 的放弃率很高（47% 和 53%），而 Mixtral 和 LLaMA 的错误答案失败率为 100%。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-4
GPT-3.5 Turbo
Claude 3 (Haiku/Sonnet/Opus)
Mixtral-8x7B
LLaMA 3-70B
Human CTF participants (CSAW 2022 and 2023)

Scale 评估规模

200 CTF challenges across 6 categories (32 crypto, 7 forensics, 29 pwn, 31 rev, 20 web, 13 misc from qualifying rounds; 8 crypto, 10 forensics, 20 pwn, 11 rev, 6 web from finals)

Contributions 核心贡献

An open-source benchmark dataset of 200 diverse, validated CTF challenges from CSAW competitions (2017-2023) spanning six categories with metadata, Docker deployment configs, and source code
A fully automated framework for evaluating LLMs on CTF challenges, supporting both black-box and open-source models with tool integration via function calling
A comprehensive toolkit of six domain-specific tools (run_command, createfile, disassemble, decompile, check_flag, give_up) integrated through function calling to augment LLM capabilities
Empirical evaluation of five LLMs with comparison to human CTF competition performance, revealing that current LLMs can sometimes match or exceed median human scores

一个包含 200 个多样化、经过验证的来自 CSAW 竞赛（2017-2023 年）的开源 CTF 基准数据集，涵盖六个类别，包含元数据、Docker 部署配置和源代码
一个用于在 CTF 挑战上评估 LLM 的全自动框架，支持通过函数调用集成工具的黑盒和开源模型
一套包含六个领域特定工具（运行命令、创建文件、汇编、反编译、检查 flag、放弃）的综合工具包，通过函数调用集成以增强 LLM 的能力
对五种 LLM 进行了实证评估，并与人类 CTF 竞赛表现进行了对比，揭示了当前的 LLM 有时可以达到或超过人类得分的中位数

Limitations 局限性

Only 200 of the initial 567 challenges were validated and included; many challenges could not be verified due to maintenance issues or missing files
Certain categories like Incident Response are excluded entirely as they are harder to validate
Significant category imbalance: crypto, rev, and pwn are overrepresented while forensics and web are underrepresented
All challenges sourced from a single CTF series (NYU CSAW), limiting diversity of challenge styles and difficulty calibration
Models sometimes use inappropriate tools for the category (e.g., C/C++ reverse engineering tools on Python code)
Open-source models (Mixtral, LLaMA) solved zero challenges, suggesting the framework may not adequately support models without native function calling
The 48-hour time limit and 5-attempt structure may not fully capture model capabilities under different configurations
No fine-tuning or prompt optimization was explored; all models used default temperature settings

初始 567 个挑战中仅有 200 个经过验证并被纳入；许多挑战由于维护问题或文件丢失而无法核实
某些类别（如事件响应）被完全排除，因为它们更难验证
类别显著不平衡：密码学、逆向和二进制利用占比过高，而取证和 Web 占比不足
所有挑战均源自单一 CTF 系列 (NYU CSAW)，限制了挑战风格的多样性和难度校准
模型有时会为特定类别使用不恰当的工具（例如在 Python 代码上使用 C/C++ 逆向工程工具）
开源模型（Mixtral, LLaMA）解决的挑战数为零，表明该框架可能无法充分支持没有原生函数调用能力的模型
48 小时时限和 5 次尝试的结构可能无法充分捕捉不同配置下的模型能力
未探索微调或提示词优化；所有模型均使用默认温度设置

Research Gaps 研究空白

No established large-scale, open, standardized benchmark existed for evaluating LLM offensive security capabilities on CTF challenges prior to this work
Open-source LLMs completely fail at CTF solving, revealing a major capability gap compared to proprietary models that warrants investigation
Function calling support is inconsistent across LLMs, requiring format transformation workarounds that may degrade performance for models without native support
LLMs struggle with challenges requiring multi-step exploitation chains, particularly in pwn and web categories, pointing to limitations in complex procedural reasoning
No existing work combines open benchmarks, automated frameworks, tool use, and evaluation of both open and closed LLMs for CTF solving
The relationship between challenge difficulty (point values) and LLM solvability is not deeply analyzed

在此项工作之前，尚不存在用于在 CTF 挑战上评估 LLM 攻击性安全能力的大规模、开放、标准化的基准
开源 LLM 在 CTF 求解方面完全失败，揭示了与专有模型相比存在的重大能力差距，值得进一步研究
不同 LLM 对函数调用的支持不一致，需要格式转换等权衡方案，这可能会降低没有原生支持的模型的性能
LLM 在需要多步漏洞利用链的挑战中表现挣扎，特别是在二进制利用和 Web 类别中，表明在复杂的程序推理方面存在局限
尚未有工作将开源基准、自动化框架、工具使用以及对开源和闭源 LLM 的评估结合起来用于 CTF 求解
挑战难度（分值）与 LLM 可解性之间的关系尚未得到深入分析

Novel Techniques 新颖技术

Category-aware tool provisioning: disassemble/decompile tools are only provided for pwn and rev categories to avoid confusing the model on other challenge types
Function calling abstraction layer that transforms natural language tool requests into structured formats for models without native function calling support
Docker-based challenge deployment with pre-built images from Docker Hub, eliminating environment setup inconsistencies across evaluations
Structured JSON metadata format (challenge.json) that separates model-visible information from ground truth, enabling automated deployment and evaluation
Garbage collector for Docker resources to manage large-scale automated evaluation across 200 challenges and multiple models

类别感知工具提供：仅为二进制利用（pwn）和逆向（rev）类别提供汇编/反编译工具，以避免在其他挑战类型中混淆模型
函数调用抽象层：将自然语言工具请求转换为结构化格式，以支持没有原生函数调用支持的模型
基于 Docker 的挑战部署：使用来自 Docker Hub 的预构建镜像，消除了评估中环境设置不一致的问题
结构化 JSON 元数据格式 (challenge.json)：将模型可见的信息与标准答案分离，支持自动化部署 e 评估
Docker 资源垃圾回收器：用于管理涉及 200 个挑战 e 多个模型的大规模自动化评估中的资源

Open Questions 开放问题

Can fine-tuning or specialized training on CTF-like tasks significantly improve open-source model performance from zero?
Would multi-agent architectures (e.g., planner + executor) outperform single-agent approaches on complex multi-step CTF challenges?
How does prompt engineering and chain-of-thought prompting affect CTF solving performance across different challenge categories?
What is the minimum set of tools and capabilities needed for an LLM to be competitive on each CTF category?
Can reinforcement learning with CTF flag verification as reward signal improve LLM performance through repeated attempts?
How do models decide when to give up versus persist, and can this meta-cognitive strategy be improved?
Why do LLMs fail at web challenges despite having access to relevant tools like curl, and what additional capabilities would help?

在类 CTF 任务上进行微调或专门训练能否显著提高开源模型的表现（从零突破）？
在复杂的多步 CTF 挑战中，多智能体架构（例如规划器 + 执行器）是否会优于单智能体方法？
提示工程和思维链提示如何影响不同挑战类别的 CTF 求解性能？
LLM 在每个 CTF 类别中具有竞争力所需的最小工具和能力集是什么？
以 CTF flag 验证作为奖励信号的强化学习能否通过反复尝试来提高 LLM 性能？
模型如何决定何时放弃与何时坚持，以及这种元认知策略能否得到改进？
为什么 LLM 在拥有 curl 等相关工具的情况下仍无法解决 Web 挑战，哪些额外的能力会有所帮助？

Builds On 基于前人工作

PentestGPT (Deng et al., 2024)
InterCode-CTF (Yang et al., 2023)
Shao et al. (2024) empirical evaluation of LLMs on offensive security
Tann et al. (2023) LLMs on CTF challenges
DARPA Cyber Grand Challenge (2016)

Open Source 开源信息

Yes - https://github.com/NYU-LLM-CTF/NYU_CTF_Bench and https://github.com/NYU-LLM-CTF/llm_ctf_automation