#38

Getting Pwn'd by AI: Penetration Testing with Large Language Models Getting Pwn'd by AI: Penetration Testing with Large Language Models

Andreas Happe, Juergen Cito

2023 | ESEC/FSE '23 (31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering) (top-conference)

10.1145/3611643.3613083

system penetration-testing semi-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Penetration testing requires high levels of expertise and involves many manual testing and analysis steps, yet the cybersecurity field suffers from a chronic shortage of personnel. This paper explores to what extent large language models can automate security testing by serving as AI sparring partners for penetration testers.

渗透测试需要高水平的专业知识，并涉及许多手动测试和分析步骤，然而网络安全领域正面临长期的人才短缺。本文探讨了大语言模型在多大程度上可以通过作为渗透测试人员的AI“对练伙伴（sparring partners）”来自动化安全测试。

The cybersecurity workforce gap is growing faster than the workforce itself (26.2% gap growth vs 11.1% workforce growth per ISC2 2022). Penetration testers have highlighted the need for human sparring partners who offer alternative ideas when stuck. AI-based sparring partners could augment existing testers, counteract the personnel shortage, and benefit the training of novice penetration testers while keeping a human in the loop to reduce ethical concerns.

网络安全人才缺口增长速度超过了人才本身的增长速度（根据ISC2 2022报告，缺口增长26.2%，而劳动力增长仅11.1%）。渗透测试人员强调，在陷入困境时，需要一个能够提供替代思路的人类对练伙伴。基于AI的对练伙伴可以增强现有的测试人员，缓解人才短缺，并使初级渗透测试人员在培训中受益，同时通过保持“人在回路”来减少伦理顾虑。

Threat Model 威胁模型

The paper assumes a post-authentication scenario for the low-level use case: a penetration tester has already gained low-privilege access to a Linux system (via SSH) and seeks privilege escalation to root. For the high-level use case, the attacker seeks to become domain admin in an Active Directory environment. The LLM acts as an advisor/executor rather than a fully autonomous attacker.

论文针对低级别用例假设了一个“认证后”场景：渗透测试人员已经获得了Linux系统的低权限访问（通过SSH），并寻求提权到root。对于高级别用例，攻击者的目标是成为活动目录（Active Directory）环境中的域管理员。LLM充当顾问/执行者，而非完全自主的攻击者。

Methodology 核心方法

The paper explores two distinct use cases for LLM-augmented penetration testing. First, high-level task planning where AutoGPT is asked to generate penetration testing plans for Active Directory attacks and external engagements against a real target organization. Second, a low-level attack-execution system where GPT-3.5 is integrated with a vulnerable Linux virtual machine via SSH in a closed feedback loop: the LLM suggests shell commands, the commands are executed on the target, and the output is fed back to the LLM for iterative vulnerability discovery and exploitation.

论文探讨了LLM增强渗透测试的两个不同用例。首先是高级任务规划，要求AutoGPT为针对真实目标组织的活动目录攻击和外部渗透生成渗透测试计划。其次是低级攻击执行系统，将GPT-3.5通过SSH与一台脆弱的Linux虚拟机集成在一个闭环反馈回路中：LLM建议shell命令，命令在目标上执行，输出反馈给LLM，进行迭代的漏洞发现和利用。

Architecture 架构设计

For the low-level system (hackingBuddyGPT), a Python script connects to a deliberately vulnerable lin.security Linux VM via SSH. An infinite loop prompts GPT-3.5 to imagine being a low-privilege user wanting to become root, state a Linux shell command, which is then executed over SSH on the VM. The command output is fed back to GPT-3.5, which also identifies potential vulnerabilities and suggests verification commands. This creates a closed feedback loop between the LLM and the target system.

对于低级系统（hackingBuddyGPT），一个Python脚本通过SSH连接到一台故意设置漏洞的lin.security Linux虚拟机。一个无限循环提示GPT-3.5想象自己是一个想要成为root的低权限用户，并给出一个Linux shell命令，该命令随后在VM上通过SSH执行。命令输出反馈给GPT-3.5，它还会识别潜在漏洞并建议验证命令。这在LLM和目标系统之间创建了一个闭环反馈回路。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

The simple LLM-shell feedback loop was able to routinely gain root privileges on the vulnerable VM through multiple attack vectors including sudo misconfigurations, GTFOBins exploitation, and weak password attacks via /etc/passwd. For high-level planning, AutoGPT generated realistic and feasible Active Directory attack plans covering password spraying, Kerberoasting, AS-REP roasting, and ADCS exploitation. Individual runs were not stable but results converged over multiple runs.

简单的LLM-shell反馈回路能够例行公事地通过多种攻击向量（包括sudo配置错误、GTFOBins利用以及通过/etc/passwd进行的弱密码攻击）在脆弱的虚拟机上获得root权限。对于高级规划，AutoGPT生成了现实且可行的活动目录攻击计划，涵盖了密码喷洒（password spraying）、Kerberoasting、AS-REP roasting和ADCS利用。虽然单次运行并不稳定，但多次运行的结果会趋于一致。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

linpeas.sh (enumeration tool, qualitative comparison)

Scale 评估规模

1 VulnHub VM (lin.security) for low-level testing; 1 real organization for high-level planning

Contributions 核心贡献

Demonstrated two complementary use cases for LLM-augmented penetration testing: high-level task planning and low-level attack execution
Implemented a closed-feedback-loop prototype (hackingBuddyGPT) connecting GPT-3.5 to a vulnerable VM via SSH for automated privilege escalation
Showed that even GPT-3.5 (not GPT-4) could routinely achieve root access on a vulnerable Linux system through iterative command execution
Provided qualitative analysis of LLM behavior including grounding, hallucinations, stability, and ethical moderation bypasses
Outlined a vision for AI-augmented penetration testing covering integration of high/low-level tasks, model options, memory/verification, and prompt optimization

展示了LLM增强渗透测试的两个互补用例：高级任务规划和低级攻击执行
实现了一个闭环反馈原型（hackingBuddyGPT），通过SSH将GPT-3.5连接到脆弱的虚拟机，用于自动提权
证明了即使是GPT-3.5（而非GPT-4）也能通过迭代执行命令，例行公事地在脆弱的Linux系统上获得root访问权限
对LLM行为（包括接地/grounding、幻觉、稳定性以及绕过伦理审查）进行了定性分析
勾勒了AI增强渗透测试的愿景，涵盖高级/低级任务集成、模型选择、记忆/验证以及提示词优化

Limitations 局限性

Single prototype runs were not stable; the sequence and selection of commands varied between runs, though results converged over multiple iterations
LLM suggestions appeared based on pattern-matching and preconceptions from training data rather than deep understanding of the target system
The LLM could not perform multi-step planning for complex exploitation chains (e.g., found SUID binaries but did not exploit them)
Memory was simplistic, limited to context window (4k tokens for GPT-3 model used), with executed command output stored only until context limit was reached
Evaluation limited to a single deliberately vulnerable VM (lin.security), not tested on hardened or real-world production systems
Ethical moderation in GPT-3.5-turbo could be easily bypassed through slight prompt variations (e.g., asking for 'verification commands' instead of 'exploits')
Used only cloud-based OpenAI API, raising concerns about sharing sensitive penetration test data

单次原型运行并不稳定；不同运行之间的命令顺序和选择有所不同，尽管结果在多次迭代后趋于收敛
LLM的建议似乎是基于训练数据中的模式匹配和先入为主的概念，而非对目标系统的深入理解
LLM无法针对复杂的利用链执行多步规划（例如，发现了SUID二进制文件但没有利用它们）
记忆机制过于简单，仅限于上下文窗口（所使用的GPT-3模型为4k token），执行命令的输出仅存储到达到上下文限制为止
评估仅限于单个故意设置漏洞的虚拟机（lin.security），未在加固系统或真实生产环境中测试
GPT-3.5-turbo中的伦理审查可以很容易地通过轻微的提示词变化（例如，要求“验证命令”而非“利用代码”）来绕过
仅使用了基于云的OpenAI API，这引发了关于共享敏感渗透测试数据的担忧

Research Gaps 研究空白

Need to evaluate locally-run open-source models (Llama, StableLM, Dolly2) to avoid cloud data leakage and enable customer-specific fine-tuning
Integration of high-level and low-level penetration testing tasks into a unified LLM system
Advanced memory mechanisms including multiple memory streams for commands, security findings, and system model building
Using LLMs to generate and optimize their own prompts for penetration testing (meta-prompting)
Determining what parameter size is 'good enough' for effective security testing LLMs
Fine-tuning models on customer-specific or engagement-specific data for improved penetration testing
Research into better question formulation based on empirical studies of how penetration testers work

需要评估本地运行的开源模型（Llama、StableLM、Dolly2），以避免云数据泄露并支持特定客户的微调
将高级和低级渗透测试任务集成到一个统一的LLM系统中
先进的记忆机制，包括用于存储命令、安全发现和系统模型构建的多个记忆流
使用LLM生成并优化其自身的渗透测试提示词（元提示/meta-prompting）
确定对于有效的安全测试LLM，多大的参数规模是“足够好”的
在特定客户或特定任务的数据上微调模型，以改善渗透测试效果
基于渗透测试人员工作方式的实证研究，研究更好的问题表述（question formulation）

Novel Techniques 新颖技术

Closed feedback loop between LLM and target system via SSH for iterative privilege escalation
Using MITRE ATT&CK TTP hierarchy to structure LLM queries at different abstraction levels (tactics vs techniques vs procedures)
Prompt engineering to bypass ethical moderation (asking for 'verification commands' instead of 'exploits')

LLM与目标系统之间通过SSH建立闭环反馈回路，用于迭代提权
使用MITRE ATT&CK TTP层级结构，在不同抽象级别（战术 vs 技术 vs 过程）结构化LLM查询
通过提示词工程绕过伦理审查（要求“验证命令”而非“利用代码”）

Open Questions 开放问题

Can LLMs move beyond pattern-matching from training data to genuine reasoning about system vulnerabilities?
How to balance the dual-use nature of LLM-based penetration testing tools (defensive vs offensive use)?
What is the minimum model size and capability threshold for effective automated penetration testing?
How to make LLM-based pentest tools deterministic enough for reproducible security assessments?
Can fine-tuned local models match or exceed cloud API models for penetration testing while preserving data confidentiality?

LLM能否从训练数据的模式匹配转向对系统漏洞的真正推理？
如何平衡基于LLM的渗透测试工具的双重用途性质（防御性 vs 进攻性用途）？
有效的自动化渗透测试所需的最小模型规模和能力门槛是多少？
如何使基于LLM的渗透测试工具具有足够的确定性，以便进行可复现的安全评估？
微调后的本地模型能否在保持数据机密性的同时，达到或超过云端API模型的渗透测试表现？

Builds On 基于前人工作

AutoGPT
BabyAGI
MITRE ATT&CK framework
Happe and Cito 2023 (Understanding Hackers' Work interview study)

Open Source 开源信息

Yes - https://github.com/ipa-lab/hackingBuddyGPT