Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges
Problem & Motivation 问题与动机
Existing benchmarks evaluate LLMs' overall CTF-solving performance but fail to disentangle technical knowledge from reasoning ability, leading to unclear evaluation metrics. Additionally, LLMs struggle to apply their mastered technical knowledge to specific CTF scenarios due to the absence of appropriate tools and unfriendly interaction environments.
现有的基准测试虽然能评估大语言模型(LLM)解决夺旗赛(CTF)的整体性能,但未能将技术知识与推理能力分开,导致评估指标不清晰。此外,由于缺乏合适的工具和不友好的交互环境,LLM 在将掌握的技术知识应用于特定的 CTF 场景时面临困难。
While LLMs possess substantial technical knowledge relevant to CTF challenges, they falter in accurately applying this knowledge to specific scenarios and adapting their strategies based on environment feedback. There is a need for both a focused benchmark measuring LLMs' CTF technical knowledge and a framework that augments LLMs' ability to leverage that knowledge in practice.
虽然 LLM 拥有大量与 CTF 挑战相关的技术知识,但在将其准确应用于特定场景以及根据环境反馈调整策略方面表现不佳。因此,既需要一个专注于测量 LLM CTF 技术知识的基准测试,也需要一个能够增强 LLM 在实践中利用这些知识能力的框架。
Threat Model 威胁模型
CTFAgent operates within sandboxed CTF competition environments. The agent has access to a Linux container with shell access and network connectivity to CTF challenge servers. The system uses publicly available vulnerability technical reports and does not engage with production systems.
CTFAgent 在沙盒化的 CTF 竞赛环境中运行。该智能体可以访问具有 shell 权限和网络连接能力的 Linux 容器,以连接到 CTF 挑战服务器。系统使用公开可用的漏洞技术报告,不涉及生产系统。
Methodology 核心方法
The paper makes two main contributions. First, it constructs CTFKnow, a focused benchmark with 3,992 questions (1,996 single-choice and 1,996 open-ended) derived from 1,084 CTF write-ups spanning 700+ competitions over five years, designed to measure LLMs' CTF technical knowledge separately from reasoning ability. Second, it proposes CTFAgent, an LLM-driven framework that integrates two-stage Retrieval-Augmented Generation (RAG) and interactive Environmental Augmentation (EA) to enhance LLMs' CTF problem-solving by providing relevant technical knowledge hints and a more interactive environment.
本文提出了两项主要贡献。首先,构建了 CTFKnow,这是一个包含 3,992 个问题(1,996 个单选题和 1,996 个开放式问题)的专注基准测试,这些问题源自五年内 700 多场比赛的 1,084 篇 CTF 解题报告(write-ups),旨在独立于推理能力之外测量 LLM 的 CTF 技术知识。其次,提出了 CTFAgent,这是一个由 LLM 驱动的框架,集成了两阶段检索增强生成(RAG)和交互式环境增强(EA),通过提供相关的技术知识提示和更具交互性的环境来增强 LLM 的 CTF 问题解决能力。
Architecture 架构设计
CTFAgent comprises two main modules: (1) A two-stage RAG module with DB-Understanding (retrieves vulnerability code snippets and matching technical knowledge during the problem comprehension phase) and DB-Exploiting (retrieves exploit ideas and example payloads during the exploitation phase). The RAG database contains ~2,000 CTF knowledge trunks with CTF Scenario, Exploit Method, and Example Payload segments. (2) An Environmental Augmentation (EA) module providing interactive command lines (dynamic netcat sessions via start_nc_session, nc_send_line, close_nc_session), tool use hints, and advanced CTF tools (IDA Pro 9.1 for decompilation instead of Ghidra).
CTFAgent 包含两个主要模块:(1)两阶段 RAG 模块,包括数据库理解(在问题理解阶段检索漏洞代码片段和匹配的技术知识)和数据库利用(在利用阶段检索利用思路和示例 payload)。RAG 数据库包含约 2,000 个 CTF 知识条目,分为 CTF 场景、利用方法和示例 payload 段落。(2)环境增强(EA)模块,提供交互式命令行(通过 start_nc_session、nc_send_line、close_nc_session 实现动态 netcat 会话)、工具使用提示以及高级 CTF 工具(使用 IDA Pro 9.1 进行反编译,而非 Ghidra)。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
RAG
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
On Intercode-CTF, CTFAgent improved LLM success from 39/100 (baseline) to 73/100, an 85% improvement. On NYU CTF Dataset, CTFAgent solved 18/200 challenges, a 120% improvement over the NYU CTF baseline (8 solved). With o1-preview, CTFAgent solved an additional 11 challenges on Intercode-CTF (84% total, i.e., 84/100). In picoCTF2024, CTFAgent ranked in the top 23.6% of nearly 7,000 participating teams with 1,875 points, significantly outperforming the NYU Framework (top 47.2%).
在 Intercode-CTF 上,CTFAgent 将 LLM 的成功挑战数从 39/100(基线)提高到 73/100,提升了 85%。在 NYU CTF 数据集上,CTFAgent 解决了 200 个挑战中的 18 个,比 NYU CTF 基线(解决 8 个)提升了 120%。配合 o1-preview,CTFAgent 在 Intercode-CTF 上额外解决了 11 个挑战(总成功率 84%)。在 picoCTF2024 中,CTFAgent 以 1,875 分在近 7,000 支参赛队伍中排名前 23.6%,显著优于 NYU 框架(前 47.2%)。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- Intercode-CTF baseline framework
- NYU CTF Framework
- CTFAgent-w/o-RAG (ablation)
- CTFAgent-w/o-EA (ablation)
Scale 评估规模
100 Intercode-CTF challenges, 200 NYU CTF challenges, 46 picoCTF2024 challenges, 3,992 CTFKnow benchmark questions
Contributions 核心贡献
- CTFKnow: A novel benchmark with 3,992 questions based on 1,084 CTF write-ups and ~2,000 knowledge points, specifically measuring LLMs' CTF technical knowledge across varying difficulty levels
- Comprehensive measurement study of five mainstream LLMs on CTF technical knowledge, revealing that LLMs possess strong technical knowledge (87.83% on single-choice) but struggle to apply it to specific scenarios (~50% on open-ended questions)
- CTFAgent: A framework integrating two-stage RAG (DB-Understanding and DB-Exploiting) and interactive Environmental Augmentation, achieving 85-120% performance gains over baselines on established CTF benchmarks
- Practicality validation through picoCTF2024 competition, ranking top 23.6% among ~7,000 teams
- CTFKnow:一个新颖的基准测试,包含基于 1,084 篇 CTF 解题报告和约 2,000 个知识点的 3,992 个问题,专门用于测量 LLM 在不同难度水平下的 CTF 技术知识
- 对五种主流 LLM 在 CTF 技术知识方面进行了全面的测量研究,揭示了 LLM 拥有强大的技术知识(单选题准确率为 87.83%),但在将其应用于特定场景时表现困难(开放式问题准确率约 50%)
- CTFAgent:一个集成两阶段 RAG(数据库理解和数据库利用)及交互式环境增强的框架,在已建立的 CTF 基准测试上比基线性能提升了 85-120%
- 通过 picoCTF2024 竞赛验证了实用性,在约 7,000 支队伍中排名前 23.6%
Limitations 局限性
- RAG misguidance: inaccurate retrieval can produce CTF knowledge deviated from the scenario, inadvertently misguiding the LLM's solving approach
- Lack of multi-modal capabilities: cannot handle OSINT or social engineering challenges involving images, and cannot use GUI-based tools like Burp Suite
- Does not strengthen the base model's reasoning ability at a fundamental level; relies on RAG for knowledge augmentation rather than improving multi-turn reasoning
- Failure analysis shows the primary failure mode is exceeding maximum interaction rounds (82.13% on Intercode-CTF, 43.41% on NYU CTF), indicating difficulty with complex multi-step reasoning
- Context length exceeded is another significant failure mode (14.30% Intercode-CTF, 15.38% NYU CTF)
- CTFAgent underperforms human participants in Misc and Web categories that require extensive cybersecurity tool usage
- RAG 误导:不准确的检索可能会产生偏离场景的 CTF 知识,从而无意中误导 LLM 的解题思路
- 缺乏多模态能力:无法处理涉及图像的 OSINT 或社会工程挑战,也无法使用 Burp Suite 等基于 GUI 的工具
- 未从根本上增强基础模型的推理能力;依赖 RAG 进行知识增强,而非改进多轮推理
- 失败分析显示,主要的失败模式是超过最大交互轮数(Intercode-CTF 为 82.13%,NYU CTF 为 43.41%),表明在复杂的多步推理方面仍有困难
- 超过上下文长度是另一种重要的失败模式(Intercode-CTF 为 14.30%,NYU CTF 为 15.38%)
- CTFAgent 在需要广泛使用网络安全工具的 Misc(杂项)和 Web 类别中表现不如人类选手
Research Gaps 研究空白
- Integrating multi-modal capabilities (vision) to handle image-based CTF challenges and GUI-based security tools
- Strengthening LLM reasoning ability for CTF through chain-of-thought approaches like Tree of Thought or Graph of Thought
- Incorporating reinforcement learning (Tool-Integrated RL, LLM Agent RL) to improve agent reasoning in complex CTF scenarios
- Expanding and improving RAG knowledge base quality to reduce misguidance
- Bridging the gap between possessing technical knowledge and correctly applying it to specific CTF scenarios
- Improving LLMs' ability to debug scripts and handle tool/library installation failures during CTF solving
- 集成多模态能力(视觉)以处理基于图像的 CTF 挑战和基于 GUI 的安全工具
- 通过思维链方法(如思维树或思维图)增强 LLM 的 CTF 推理能力
- 引入强化学习(工具集成 RL、LLM 智能体 RL)以提高智能体在复杂 CTF 场景中的推理能力
- 扩大并提高 RAG 知识库质量以减少误导
- 弥合拥有技术知识与将其正确应用于特定 CTF 场景之间的差距
- 提高 LLM 在 CTF 解题过程中调试脚本和处理工具/库安装失败的能力
Novel Techniques 新颖技术
- Two-stage RAG system with separate databases for understanding (vulnerability code snippet matching) and exploiting (exploit idea matching) phases of CTF solving
- Interactive Environmental Augmentation with dynamic netcat sessions (start_nc_session, nc_send_line, close_nc_session) enabling real-time interaction with remote services
- Tool use hints: manually curated prompts triggered when LLMs invoke specific tools, guiding them to avoid common mistakes (e.g., prompting to read decompiled code before writing exploit scripts)
- CTFKnow benchmark construction pipeline: write-up collection, GPT-4 knowledge extraction, Deepseek hallucination filtering, question generation, question filtering, and manual verification
- Knowledge trunks structured as CTF Scenario + Exploit Method + Example Payload for effective RAG retrieval
- 两阶段 RAG 系统,分别为 CTF 解题的理解阶段(漏洞代码片段匹配)和利用阶段(利用思路匹配)设有独立的数据库
- 带有动态 netcat 会话(start_nc_session、nc_send_line、close_nc_session)的交互式环境增强,实现了与远程服务的实时交互
- 工具使用提示:在 LLM 调用特定工具时触发的手动整理的提示词,引导其避免常见错误(例如,在编写利用脚本前提示阅读反编译代码)
- CTFKnow 基准测试构建流水线:解题报告收集、GPT-4 知识提取、Deepseek 幻觉过滤、问题生成、问题过滤及人工验证
- 知识条目结构化为“CTF 场景 + 利用方法 + 示例 Payload”,以实现有效的 RAG 检索
Open Questions 开放问题
- How to effectively integrate o1-like chain-of-thought reasoning models that lack function-calling capabilities into CTF agent frameworks?
- Can multi-modal LLMs close the gap in Web and Misc CTF categories that currently require GUI-based tools?
- How to prevent RAG misguidance when retrieved knowledge does not match the actual CTF scenario?
- What is the optimal balance between providing knowledge hints and allowing the LLM to reason independently?
- How to scale CTFAgent to handle challenges requiring complex multi-step exploitation chains beyond 30 interaction rounds?
- 如何有效地将缺乏函数调用能力的 o1 类思维链推理模型集成到 CTF 智能体框架中?
- 多模态 LLM 是否能弥合目前需要基于 GUI 工具的 Web 和 Misc CTF 类别中的差距?
- 当检索到的知识与实际 CTF 场景不匹配时,如何防止 RAG 误导?
- 在提供知识提示与允许 LLM 独立推理之间,最佳平衡点是什么?
- 如何扩展 CTFAgent 以处理超过 30 轮交互、需要复杂多步利用链的挑战?
Builds On 基于前人工作
- Intercode-CTF (Yang et al., 2024)
- NYU CTF Bench (Shao et al., 2024)
- PentestGPT (Deng et al., 2024)
- OpenAI Assistants API
- OpenAI Function Calling
- ReAct prompting framework (Yao et al., 2022)
Open Source 开源信息
Partial - CTFKnow benchmark and evaluation scripts are publicly accessible; CTFAgent code is restricted to institute-affiliated research personnel via a review process to minimize misuse