#65

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, Muhammad Shafique

2024 | arXiv (preprint)

empirical-study ctf fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

No prior work has comprehensively evaluated the effectiveness of LLMs in solving CTF challenges using a fully automated workflow. This paper assesses the ability of six LLMs to solve real-world CTF challenges under both human-in-the-loop (HITL) and fully automated settings.

在此之前，尚无工作全面评估 LLM 在完全自动化工作流程中解决 CTF 挑战的有效性。本文评估了六种 LLM 在“人类在环”（HITL）和“完全自动化”设置下解决现实世界 CTF 挑战的能力。

CTF participants increasingly use LLMs to solve challenges, but there is no systematic evaluation of how well LLMs perform compared to human CTF players, nor an understanding of common failure modes and the value of human feedback in LLM-guided CTF solving.

CTF 参赛者越来越多地使用 LLM 来解决挑战，但目前缺乏关于 LLM 与人类 CTF 选手相比表现如何的系统评估，也不了解常见的失败模式以及人类反馈在 LLM 引导的 CTF 解题中的价值。

Threat Model 威胁模型

LLMs are given challenge descriptions, source code, and executable files in a Dockerized environment with pre-installed security tools. In the fully automated workflow, no human intervention is allowed; in HITL, humans provide hints and corrections.

在预装了安全工具的 Docker 化环境中，向 LLM 提供挑战描述、源代码和可执行文件。在完全自动化工作流程中，不允许人类干预；在 HITL 中，人类提供提示和纠正。

Methodology 核心方法

The authors develop two workflows for LLM-guided CTF solving: (1) a human-in-the-loop (HITL) workflow where contestants manually interact with the LLM, providing hints and corrections iteratively, and (2) a fully automated workflow where the LLM operates autonomously in a Docker container with access to security tools via function calling. They evaluate six LLMs on 26 CTF challenges from CSAW 2023, analyze failure modes, and compare LLM performance against 1,176 human CTF teams from the actual competition.

作者开发了两种用于 LLM 引导的 CTF 解题的工作流程：(1) 人类在环 (HITL) 工作流程，参赛者手动与 LLM 交互，迭代提供提示和纠正；(2) 完全自动化工作流程，LLM 在 Docker 容器中自主运行，通过函数调用访问安全工具。他们在来自 CSAW 2023 的 26 个 CTF 挑战上评估了六种 LLM，分析了失败模式，并将 LLM 的表现与来自实际竞赛的 1,176 支人类 CTF 队伍进行了比较。

Architecture 架构设计

A Dockerized evaluation framework with two containers: one hosting the CTF challenge server and one for the LLM agent (ctfenv). The LLM interacts with the environment through tool calls (run_command, createfile, disassemble, decompile, check_flag, give_up). A prompt template provides challenge context and instructions.

一个 Docker 化的评估框架，包含两个容器：一个托管 CTF 挑战服务器，另一个用于 LLM 代理 (ctfenv)。LLM 通过工具调用（run_command, createfile, disassemble, decompile, check_flag, give_up）与环境交互。提示词模板提供挑战背景和指令。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

In the fully automated workflow, GPT-4 solved 12/26 challenges, GPT-3.5 solved 6, and Mixtral solved 5. GPT-4 scored 1,319 points placing in the top 11.5% (135th of 1,176 teams), outperforming 88.5% of human CTF players. In HITL evaluation, ChatGPT solved 11/21 challenges. Common failure modes include empty solutions (33-47%), wrong flags (8-29%), and command line errors (10-25%).

在完全自动化工作流程中，GPT-4 解决了 12/26 个挑战，GPT-3.5 解决了 6 个，Mixtral 解决了 5 个。GPT-4 获得了 1,319 分，位列前 11.5%（在 1,176 支队伍中排名第 135 位），优于 88.5% 的人类 CTF 选手。在 HITL 评估中，ChatGPT 解决了 11/21 个挑战。常见的失败模式包括空解决方案（33-47%）、错误的 flag（8-29%）和命令行错误（10-25%）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Human CTF teams from CSAW 2023 (1,176 teams)
Cross-comparison among 6 LLMs

Scale 评估规模

26 CTF challenges across 6 categories (crypto, forensics, misc, pwn, rev, web) from CSAW 2023 qualifying round; 37 originally, reduced to 26 after excluding multimodal, Discord-dependent, and incident-response challenges

Contributions 核心贡献

Quantitative and qualitative assessment of six LLMs on 26 diverse CTF challenges, showing ChatGPT performs comparably to an average human CTF team
Two workflows for LLM-guided CTF solving: human-in-the-loop (HITL) and fully automated with tool use
Comprehensive failure analysis taxonomy identifying common shortcomings (empty solutions, wrong flags, command line errors, faulty code) when LLMs tackle CTFs without human intervention
Comparison of LLM automated performance against 1,176 real human CTF teams, showing GPT-4 reaches top 11.5%

对六种 LLM 在 26 个不同 CTF 挑战上的定量和定性评估，表明 ChatGPT 的表现与人类 CTF 队伍的平均水平相当
两种用于 LLM 引导的 CTF 解题的工作流程：人类在环 (HITL) 和带有工具使用的完全自动化
全面的失败分析分类法，识别了 LLM 在没有人类干预的情况下应对 CTF 时的常见缺陷（空解决方案、错误 flag、命令行错误、错误代码）
将 LLM 的自动化表现与 1,176 支真实人类 CTF 队伍进行比较，显示 GPT-4 达到了前 11.5% 的水平

Limitations 局限性

Limited to 26 CTF challenges from a single competition (CSAW 2023), restricting generalizability
Low success rate in the fully automated workflow, suggesting need for better prompt engineering
No evaluation on cryptography challenges in the automated workflow (0% success for all models)
Safety guardrails in ChatGPT required creative prompt engineering to bypass, and these guardrails are continuously strengthened
Automated framework cannot assess correctness of intermediate reasoning steps
Web challenges excluded from HITL evaluation due to LLM lacking web access

局限于单一竞赛（CSAW 2023）的 26 个 CTF 挑战，限制了泛化性
完全自动化工作流程的成功率较低，表明需要更好的提示词工程
自动化工作流程中未对密码学（crypto）挑战进行评估（所有模型成功率均为 0%）
ChatGPT 的安全护栏需要创造性的提示词工程来绕过，且这些护栏正在不断加强
自动化框架无法评估中间推理步骤的正确性
由于 LLM 缺乏网络访问权限，HITL 评估中排除了 Web 挑战

Research Gaps 研究空白

Need for better prompt engineering techniques to improve automated CTF-solving performance
Longitudinal studies on how LLM capability and safety guardrails co-evolve for offensive security tasks
Evaluation on CTF challenges from diverse sources and databases beyond a single competition
Understanding of how LLM updates and improvements affect CTF-solving capabilities over time
Better integration of function calling and tool use for open-source LLMs like Mixtral

需要更好的提示词工程技术来提高自动 CTF 解题表现
关于 LLM 能力和安全护栏在攻防安全任务中如何共同演变的纵向研究
对来自单一竞赛之外的各种来源和数据库的 CTF 挑战进行评估
了解 LLM 的更新和改进随时间推移如何影响其 CTF 解题能力
更好地为 Mixtral 等开源 LLM 集成函数调用和工具使用

Novel Techniques 新颖技术

Dockerized automated CTF-solving framework with tool-calling interface (run_command, disassemble, decompile, check_flag, give_up)
Systematic comparison of HITL vs fully automated workflows for LLM-guided CTF solving
Failure taxonomy for LLM-based CTF solving (empty solution, connect error, faulty code, import error, cmd line error, file error, wrong flag)

带有工具调用接口（run_command, disassemble, decompile, check_flag, give_up）的 Docker 化自动 CTF 解题框架
LLM 引导的 CTF 解题中 HITL 与完全自动化工作流程的系统比较
基于 LLM 的 CTF 解题失败分类法（空解决方案、连接错误、错误代码、导入错误、命令行错误、文件错误、错误 flag）

Open Questions 开放问题

Can better prompt engineering significantly close the gap between HITL and fully automated performance?
How will evolving safety guardrails impact the long-term viability of using LLMs for offensive security education?
Can multi-agent architectures or more sophisticated planning improve automated CTF solving?
Why do all LLMs completely fail on cryptography challenges in the automated setting?

更好的提示词工程能否显著缩小 HITL 与完全自动化表现之间的差距？
不断演变的安全护栏将如何影响在攻防安全教育中使用 LLM 的长期可行性？
多代理架构 or 更复杂的规划能否改进自动 CTF 解题？
为什么所有 LLM 在自动化设置下的密码学挑战中都完全失败了？

Builds On 基于前人工作

InterCode (Yang et al., 2023)
Language agents as hackers (Yang et al., NeurIPS 2023)
Tann et al. (2023) - Using LLMs for CTF and certification questions

Open Source 开源信息

Yes - https://github.com/NickNameInvalid/LLM_CTF and https://github.com/osirislab/CSAW-CTF-2023-Quals