#15

PwnGPT: Automatic Exploit Generation Based on Large Language Models PwnGPT: Automatic Exploit Generation Based on Large Language Models

Wanzong Peng, Lin Ye, Xuetao Du, Hongli Zhang, Dongyang Zhan, Yunting Zhang, Yicheng Guo, Chen Zhang

2025 | ACL 2025 (63rd Annual Meeting of the Association for Computational Linguistics) (top-conference)

system ctf fully-autonomous single-agent prompt-chaining

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Traditional automatic exploit generation (AEG) systems target only specific vulnerability types and rely on manually crafted templates built from expert experience. LLMs alone struggle with vulnerability location and complex exploit chain construction, limiting their direct application to binary exploitation tasks.

传统的自动化漏洞利用生成（AEG）系统仅针对特定类型的漏洞，且依赖于基于专家经验手动构建的模板。大语言模型（LLM）本身在漏洞定位和复杂利用链构建方面表现欠佳，限制了其在二进制漏洞利用任务中的直接应用。

There is a significant gap in applying LLMs to executable file exploit generation (binary exploitation). While LLMs show strong information analysis and code generation capabilities, they are not adept at locating vulnerabilities or constructing complex exploit chains. A systematic benchmark and an enhanced LLM-based framework are needed to bridge this gap and enable more intelligent, flexible AEG that can handle diverse vulnerability types.

将 LLM 应用于可执行文件漏洞利用生成（二进制漏洞利用）仍存在显著空白。虽然 LLM 展示了强大的信息分析和代码生成能力，但它们不擅长定位漏洞或构建复杂的利用链。需要一个系统的基准测试和增强的基于 LLM 的框架来填补这一空白，并实现更智能、更灵活、能够处理多种漏洞类型的 AEG。

Threat Model 威胁模型

The system assumes access to the target executable binary (ELF file) from a CTF pwn challenge. The attacker has the ability to decompile the binary and run generated exploits against a local or remote instance of the vulnerable program. Security measures such as NX, RELRO, PIE, canaries, and ASLR may be present.

系统假设可以访问来自 CTF pwn 挑战的目标可执行二进制文件（ELF 文件）。攻击者能够反编译二进制文件，并在针对易受攻击程序的本地或远程实例运行生成的利用脚本。可能存在诸如 NX, RELRO, PIE, canaries 和 ASLR 等安全防护措施。

Methodology 核心方法

PwnGPT is a modular LLM-based AEG framework with three main components: (1) an Analysis Module that preprocesses the target ELF binary by extracting file metadata, security measures, decompiling to C code, and using static analysis with prompt chaining to identify key functions; (2) a Generation Module that uses zero-shot role-play prompting and Structured Outputs to have the LLM produce exploits in a structured format (introduction, imports, code); and (3) a Verification Module that iteratively tests generated exploits against the target, feeds error information back to the LLM for reflection, and requests modifications until success or a retry limit is reached.

PwnGPT 是一个基于大语言模型的模块化 AEG 框架，包含三个主要组件：(1) 分析模块：通过提取文件元数据、安全措施、反编译为 C 代码，并使用带有提示词链的静态分析来识别关键函数，从而对目标 ELF 二进制文件进行预处理；(2) 生成模块：使用零样本（zero-shot）角色扮演提示词和结构化输出，使 LLM 以结构化格式（介绍、导入、代码）生成漏洞利用脚本；(3) 验证模块：迭代测试生成的利用脚本，将错误信息反馈给 LLM 进行反思，并请求修改，直到成功或达到重试限制。

Architecture 架构设计

Three-module pipeline: Analysis Module (ELF parsing, decompilation via Hex-Rays, static analysis, prompt chaining for key function extraction) -> Generation Module (role-play prompting, zero-shot reasoning, Structured Outputs for formatted exploit generation) -> Verification Module (exploit execution, error feedback, iterative reflection and modification). The Analysis Module has two workflows: a simple code workflow for small decompiled files and a complex code workflow for large files that uses LLM-assisted function ranking and static analysis to reduce code volume.

三模块流水线：分析模块（ELF 解析、Hex-Rays 反编译、静态分析、关键函数提取提示词链）-> 生成模块（角色扮演提示词、零推理、格式化漏洞利用生成的结构化输出）-> 验证模块（漏洞利用执行、错误反馈、迭代反思与修改）。分析模块有两种工作流：针对小型反编译文件的简单代码工作流，以及针对大型文件的复杂代码工作流，后者使用 LLM 辅助的函数排名和静态分析来减少代码量。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

PwnGPT significantly improves exploit completion rates across all tested LLMs compared to direct prompting: from 26.3% to 57.9% with OpenAI o1-preview, from 21.1% to 36.8% with GPT-4o, from 10.5% to 10.5% with qwen-max, and from 10.5% to 21.1% with qwen-plus. The o1-preview model with PwnGPT solved 11 out of 19 challenges (7 stack overflow, 3 format string, 1 integer overflow). On two real-world CVE challenges (CVE-2011-2523 and CVE-2018-10933), PwnGPT failed due to stripped symbols and complex code logic respectively.

与直接提示相比，PwnGPT 显著提高了所有测试 LLM 的漏洞利用完成率：OpenAI o1-preview 从 26.3% 提高到 57.9%，GPT-4o 从 21.1% 提高到 36.8%，qwen-max 维持在 10.5%，qwen-plus 从 10.5% 提高到 21.1%。搭载 PwnGPT 的 o1-preview 模型解决了 19 个挑战中的 11 个（7 个栈溢出，3 个格式化字符串，1 个整数溢出）。在两个真实世界的 CVE 挑战（CVE-2011-2523 和 CVE-2018-10933）中，PwnGPT 分别由于去除了符号表和复杂的代码逻辑而失败。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Direct LLM prompting (o1-preview, GPT-4o, qwen-plus, qwen-max without PwnGPT framework)

Scale 评估规模

19 CTF pwn challenges (10 stack overflow, 5 format string, 2 integer overflow, 2 heap exploitation) plus 2 CVE-based challenges

Contributions 核心贡献

Comprehensive evaluation of LLM capabilities in exploit generation, building a pwn benchmark that systematically identifies LLM shortcomings across four key capabilities: key information analysis, vulnerability location, exploit chain construction, and code generation
Design of PwnGPT, a novel LLM-based AEG system with modular architecture (analysis, generation, verification) that handles diverse vulnerability types with greater automation than existing AEG frameworks
Thorough evaluation demonstrating PwnGPT markedly improves LLM exploit generation performance, with the o1-preview model achieving 57.9% success rate compared to 26.3% baseline

全面评估了 LLM 在漏洞利用生成方面的能力，构建了一个 pwn 基准测试，系统地识别了 LLM 在四项关键能力上的缺陷：关键信息分析、漏洞定位、利用链构建和代码生成。
设计了 PwnGPT，这是一种新型的基于 LLM 的 AEG 系统，具有模块化架构（分析、生成、验证），与现有的 AEG 框架相比，它能以更高的自动化程度处理多种漏洞类型。
深入的评估证明 PwnGPT 显著提升了 LLM 的利用生成性能，其中 o1-preview 模型实现了 57.9% 的成功率，而基准线仅为 26.3%。

Limitations 局限性

Evaluation ignored memory offset errors in buffer overflow and format string exploits, meaning reported success rates may be optimistic for real-world deployment
LLMs do not truly understand memory state and heap memory management, making heap exploitation challenges unsolvable
The analysis module does not process dynamic link libraries, causing it to miss gadgets from libc (e.g., rop-10 challenge)
Failed on both real-world CVE challenges: stripped binaries prevent function identification, and complex code logic exceeds the analysis module's capabilities
The benchmark is relatively easy and limited in quantity (19 challenges), designed for human experts to solve
All LLMs are constrained by context window limitations (32k-128k tokens), preventing processing of large decompiled files
LLMs make errors on architectural details like x86 vs x64 calling conventions and 64-bit byte alignment requirements
The verification module cannot convert originally unfeasible exploits into feasible ones; it only improves code quality of already-viable approaches
Does not provide sufficient exploit knowledge to LLMs, relying solely on the model's pre-training knowledge

评估忽略了缓冲区溢出和格式化字符串利用中的内存偏移错误，这意味着报告的成功率对于实际部署可能偏向乐观。
LLM 并不真正理解内存状态和堆内存管理，导致堆漏洞利用挑战无法解决。
分析模块不处理动态链接库，导致其错过了来自 libc 的 gadget（例如 rop-10 挑战）。
在两个真实世界的 CVE 挑战中均告失败：去除符号表的二进制文件阻碍了函数识别，复杂的代码逻辑超出了分析模块的处理能力。
基准测试相对简单且数量有限（19 个挑战），主要是为人类专家设计的。
所有 LLM 都受限于上下文窗口（32k-128k token），无法处理大型反编译文件。
LLM 会在架构细节上出错，例如 x86 与 x64 的调用约定以及 64 位字节对齐要求。
验证模块无法将原本不可行的利用脚本转化为可行的；它仅能提高已有可行方案的代码质量。
未向 LLM 提供足够的漏洞利用知识，完全依赖于模型的预训练知识。

Research Gaps 研究空白

No existing benchmark for systematically evaluating LLM capabilities on binary exploitation tasks prior to this work
Lack of research on applying LLMs to executable file exploits, with most prior LLM security work focused on web vulnerabilities and penetration testing
LLMs fundamentally lack understanding of runtime memory state, heap layouts, and low-level system details critical for exploitation
No integration of RAG or external exploit knowledge bases to supplement LLM pre-training knowledge for exploit generation
Dynamic analysis is absent from the framework, limiting the ability to handle complex real-world vulnerabilities
No mechanism for LLMs to interact with a live debugging environment to observe actual memory states during exploitation

在此工作之前，缺乏系统评估 LLM 在二进制漏洞利用任务上能力的基准测试。
缺乏将 LLM 应用于可执行文件利用的研究，此前大多数 LLM 安全工作侧重于 Web 漏洞和渗透测试。
LLM 从根本上缺乏对运行时内存状态、堆布局以及对漏洞利用至关重要的底层系统细节的理解。
未集成 RAG 或外部漏洞利用知识库，以补充 LLM 的预训练知识用于利用生成。
框架中缺乏动态分析，限制了其处理复杂现实世界漏洞的能力。
缺乏让 LLM 与实时调试环境交互的机制，以便在漏洞利用过程中观察实际内存状态。

Novel Techniques 新颖技术

Dual-workflow code analysis: routing decompiled C files to simple or complex analysis pipelines based on code volume, with the complex workflow using LLM-assisted function ranking and static analysis to reduce input size
Structured Outputs for exploit generation: requiring LLMs to produce exploits in a three-part format (introduction, imports, code) to facilitate automated verification
Modular decomposition of exploit generation into analysis, generation, and verification phases, each addressing specific LLM weaknesses identified through systematic capability benchmarking
Iterative verification with error reflection: feeding execution errors back to LLMs for self-reflection and exploit modification

双工作流代码分析：根据代码量将反编译的 C 文件路由到简单或复杂的分析流水线，其中复杂工作流使用 LLM 辅助的函数排名和静态分析来减少输入规模。
漏洞利用生成的结构化输出：要求 LLM 以三部分格式（介绍、导入、代码）产生脚本，以利于自动化验证。
将漏洞利用生成模块化分解为分析、生成和验证阶段，每个阶段针对通过系统能力基准测试识别出的特定 LLM 弱点。
带有错误反思的迭代验证：将执行错误反馈给 LLM，用于自我反思和利用脚本修改。

Open Questions 开放问题

Can RAG with exploit knowledge bases meaningfully improve LLM performance on vulnerability types they currently cannot handle (heap exploitation, integer overflow)?
How can LLMs be given access to runtime memory state information to enable exploitation of heap vulnerabilities and other memory-dependent attacks?
Would fine-tuning LLMs on exploit generation data improve performance beyond the prompt engineering approaches used here?
How well does PwnGPT scale to real-world software with stripped binaries, complex control flow, and multiple interacting components?
Can the verification module be enhanced with dynamic analysis tools (debuggers, memory inspection) to provide richer feedback than just error messages?
What is the upper bound of LLM-based AEG performance, and which vulnerability classes are fundamentally beyond current LLM capabilities?

配备漏洞利用知识库的 RAG 能否显著提高 LLM 在当前无法处理的漏洞类型（堆利用、整数溢出）上的表现？
如何让 LLM 获得运行时内存状态信息，以实现对堆漏洞和其他依赖内存状态攻击的利用？
在漏洞利用生成数据上微调 LLM，是否能比此处使用的提示词工程方法进一步提升性能？
PwnGPT 在处理具有去除符号表、复杂控制流和多个交互组件的现实世界软件时表现如何？
能否通过动态分析工具（调试器、内存检查）增强验证模块，以提供比单纯错误消息更丰富的反馈？
基于 LLM 的 AEG 性能上限在哪里，哪些漏洞类别从根本上超出了当前 LLM 的能力？

Builds On 基于前人工作

Mayhem (Cha et al., 2012)
Revery (Wang et al., 2018)
KOOBE (Chen et al., 2020)
MAZE (Wang et al., 2021)
AEM (Jiang et al., 2023)
AutoPwn (Xu et al., 2024a)
PentestGPT (Deng et al., 2024)
Fang et al., 2024b (Teams of LLM agents can exploit zero-day vulnerabilities)
Anthropic Building Effective Agents (2024)

Open Source 开源信息

Yes - https://github.com/aeg-hit/PwnGPT