#67

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

Wesley Tann, Yuancheng Liu, Jun Heng Sim, Choon Meng Seah, Ee-Chien Chang

2023 | arXiv (preprint)

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

LLMs are freely available to students and can perform well on many CTF challenges and professional certification exams, raising concerns about academic integrity in cybersecurity education. There is no prior study on the performance of LLMs in solving CTF challenges and answering security professional certification questions.

LLM 对学生来说是免费可得的，并且在许多 CTF 挑战和专业认证考试中表现良好，这引起了人们对网络安全教育中学术诚信的担忧。目前尚无研究探讨 LLM 在解决 CTF 挑战和回答安全专业认证问题方面的表现。

Educators need to understand LLM capabilities in CTF contexts to modify their teaching and accommodate generative AI assistance. The paper fills the gap of understanding how well LLMs can solve CTF challenges across five categories and answer Cisco certification questions across difficulty levels from Associate to Expert.

教育者需要了解 LLM 在 CTF 背景下的能力，以便修改教学方式并适应生成式 AI 的辅助。本文填补了理解 LLM 在五个类别的 CTF 挑战中表现如何，以及在从初级到专家级的不同难度级别下回答 Cisco 认证问题的空白。

Threat Model 威胁模型

Students or CTF participants with access to free, publicly available LLMs (ChatGPT, Google Bard, Microsoft Bing) use them to solve CTF challenges or answer professional certification exam questions, potentially bypassing the learning objectives.

学生或 CTF 参赛者通过访问免费、公开可用的 LLM（ChatGPT, Google Bard, Microsoft Bing），使用它们来解决 CTF 挑战或回答专业认证考试问题，从而可能规避学习目标。

Methodology 核心方法

The authors conduct a two-part empirical study. First, they evaluate ChatGPT on Cisco certification questions across five levels (CCNA to CCIE), classifying questions as factual vs. conceptual and MCQ vs. MRQ. Second, they test three LLMs (ChatGPT, Bard, Bing) on seven CTF test cases spanning all five CTF challenge types (web security, binary exploitation, cryptography, reverse engineering, forensics). They also demonstrate how jailbreak prompts (e.g., AIM - Always Intelligent and Machiavellian) can bypass LLM safety policies to obtain exploit information.

作者进行了一项分为两个部分的实证研究。首先，他们在五个级别（从 CCNA 到 CCIE）的 Cisco 认证问题上评估了 ChatGPT，并将问题分类为事实性 vs. 概念性，以及单选题 (MCQ) vs. 多选题 (MRQ)。其次，他们在涵盖所有五种 CTF 挑战类型（Web 安全、二进制利用、密码学、逆向工程、取证）的七个 CTF 测试案例中测试了三种 LLM（ChatGPT, Bard, Bing）。他们还演示了越狱提示词（例如 AIM - Always Intelligent and Machiavellian）如何绕过 LLM 安全策略以获取漏洞利用信息。

Architecture 架构设计

No formal system architecture. Manual question-and-answer interaction with LLMs, with participants copying outputs between the LLM and the CTF environment. The authors also mention a goal of using AutoGPT to create an automatic interface tool connecting a CTF-GPT module to CTFd websites and test cloud environments.

没有正式的系统架构。通过人工问答的方式与 LLM 进行交互，参与者在 LLM 和 CTF 环境之间手动复制输出。作者还提到了使用 AutoGPT 创建自动接口工具的目标，该工具将连接 CTF-GPT 模块到 CTFd 网站和测试云环境。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

On CTF challenges, ChatGPT solved 6 out of 7 test cases, Bard solved 2, and Bing solved 1. On Cisco certification questions, ChatGPT answered up to 82% of factual MCQs correctly but only around 50% on conceptual questions. Jailbreak prompts like AIM successfully bypassed ChatGPT's safety policies to provide exploit commands.

在 CTF 挑战方面，ChatGPT 解决了 7 个测试案例中的 6 个，Bard 解决了 2 个，Bing 解决了 1 个。在 Cisco 认证问题上，ChatGPT 正确回答了高达 82% 的事实性单选题，但在概念性问题上仅约 50%。像 AIM 这样的越狱提示词成功绕过了 ChatGPT 的安全策略，提供了漏洞利用命令。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

ChatGPT
Google Bard
Microsoft Bing

Scale 评估规模

7 CTF test cases across 5 challenge types, plus 238 Cisco certification questions across 5 certification levels

Contributions 核心贡献

First empirical study evaluating LLM performance on CTF challenges across all five standard CTF categories (web security, binary exploitation, cryptography, reverse engineering, forensics)
Evaluation of ChatGPT on Cisco professional certification questions (CCNA through CCIE) with classification into factual vs. conceptual and MCQ vs. MRQ
Demonstration that jailbreak prompts (e.g., AIM) can bypass LLM safety policies to provide exploit code and attack procedures for CTF challenges
Qualitative analysis of LLM limitations in solving CTF challenges, identifying which challenge types are more easily solved

首次对 LLM 在所有五种标准 CTF 类别（Web 安全、二进制利用、密码学、逆向工程、取证）下的表现进行的实证研究
在 Cisco 专业认证问题（从 CCNA 到 CCIE）上评估了 ChatGPT，并将其分类为事实性 vs. 概念性和单选 vs. 多选
演示了越狱提示词（如 AIM）如何绕过 LLM 安全策略，为 CTF 挑战提供漏洞利用代码和攻击程序
对 LLM 解决 CTF 挑战局限性的定性分析，识别了哪些挑战类型更容易解决

Limitations 局限性

Very small scale evaluation with only 7 CTF test cases, limiting generalizability of results
Used free/unpaid versions of all three LLMs (GPT-3.5, not GPT-4), which may underrepresent current LLM capabilities
Manual human-in-the-loop interaction rather than automated agent-based approach, requiring human judgment to bridge LLM output and CTF environment
No systematic methodology for prompt construction; prompts were ad hoc and manually crafted
CTF challenges were from a single CTFd platform instance; results may not generalize to other competition formats or difficulty levels
Study conducted in July 2023 with rapidly evolving LLM capabilities, making results quickly outdated

评估规模非常小，仅包含 7 个 CTF 测试案例，限制了结果的泛化性
使用了三种 LLM 的免费/未付费版本（GPT-3.5 而非 GPT-4），这可能低估了当前的 LLM 能力
采用人工“在环”交互而非自动代理方法，需要人类判断来桥接 LLM 输出和 CTF 环境
没有系统的提示词构建方法；提示词是随机且手动设计的
CTF 挑战来自单一的 CTFd 平台实例；结果可能无法推广到其他竞赛形式或难度级别
研究于 2023 年 7 月进行，鉴于 LLM 能力的快速演变，结果很快就会过时

Research Gaps 研究空白

Need for automated tools that can interface LLMs with CTF platforms (e.g., via AutoGPT) without manual human mediation
Lack of understanding of how newer, more capable LLMs (e.g., GPT-4) perform on CTF challenges
Need for educational strategies to adapt CTF exercises and cybersecurity curricula to account for LLM capabilities
Limited exploration of LLM performance on harder, multi-step CTF challenges requiring complex reasoning chains
No systematic study of which jailbreak techniques are most effective across different LLMs for security-related queries

需要能够在没有人工干预的情况下将 LLM 与 CTF 平台（例如通过 AutoGPT）连接起来的自动化工具
缺乏对更新、能力更强的 LLM（如 GPT-4）在 CTF 挑战中表现的了解
需要教育策略来调整 CTF 练习和网络安全课程，以考虑到 LLM 的能力
对 LLM 在需要复杂推理链的、多步骤的较难 CTF 挑战中表现的探索有限
尚未系统研究哪些越狱技术在针对安全相关查询的不同 LLM 中最有效

Novel Techniques 新颖技术

Using jailbreak prompts (AIM persona) to bypass LLM safety filters for generating exploit commands in CTF contexts
Classification of certification questions into factual vs. conceptual to understand LLM reasoning limitations

使用越狱提示词（AIM 人格）绕过 LLM 安全过滤器，在 CTF 环境中生成漏洞利用命令
将认证问题分类为事实性 vs. 概念性，以了解 LLM 的推理局限性

Open Questions 开放问题

Can fully autonomous LLM agents (without human mediation) solve CTF challenges end-to-end?
How should cybersecurity education adapt assignments and assessments given LLM capabilities?
What is the performance ceiling of state-of-the-art LLMs (GPT-4, Claude) on CTF challenges compared to GPT-3.5?

全自动 LLM 代理（无需人工干预）能否端到端地解决 CTF 挑战？
鉴于 LLM 的能力，网络安全教育应如何调整作业和评估？
与 GPT-3.5 相比，最先进的 LLM（GPT-4, Claude）在 CTF 挑战上的表现上限是多少？

Builds On 基于前人工作

CySecBERT
AutoGPT

Open Source 开源信息

Partial - experiments available on GitHub (URL not specified in paper)