Getting Pwn'd by AI: Penetration Testing with Large Language Models Getting Pwn'd by AI: Penetration Testing with Large Language Models
Problem & Motivation 问题与动机
Penetration testing requires high levels of expertise and involves many manual testing and analysis steps, yet the cybersecurity field suffers from a chronic shortage of personnel. This paper explores to what extent large language models can automate security testing by serving as AI sparring partners for penetration testers.
渗透测试需要高水平的专业知识,并涉及许多手动测试和分析步骤,然而网络安全领域正面临长期的人才短缺。本文探讨了大语言模型在多大程度上可以通过作为渗透测试人员的AI“对练伙伴(sparring partners)”来自动化安全测试。
The cybersecurity workforce gap is growing faster than the workforce itself (26.2% gap growth vs 11.1% workforce growth per ISC2 2022). Penetration testers have highlighted the need for human sparring partners who offer alternative ideas when stuck. AI-based sparring partners could augment existing testers, counteract the personnel shortage, and benefit the training of novice penetration testers while keeping a human in the loop to reduce ethical concerns.
网络安全人才缺口增长速度超过了人才本身的增长速度(根据ISC2 2022报告,缺口增长26.2%,而劳动力增长仅11.1%)。渗透测试人员强调,在陷入困境时,需要一个能够提供替代思路的人类对练伙伴。基于AI的对练伙伴可以增强现有的测试人员,缓解人才短缺,并使初级渗透测试人员在培训中受益,同时通过保持“人在回路”来减少伦理顾虑。
Threat Model 威胁模型
The paper assumes a post-authentication scenario for the low-level use case: a penetration tester has already gained low-privilege access to a Linux system (via SSH) and seeks privilege escalation to root. For the high-level use case, the attacker seeks to become domain admin in an Active Directory environment. The LLM acts as an advisor/executor rather than a fully autonomous attacker.
论文针对低级别用例假设了一个“认证后”场景:渗透测试人员已经获得了Linux系统的低权限访问(通过SSH),并寻求提权到root。对于高级别用例,攻击者的目标是成为活动目录(Active Directory)环境中的域管理员。LLM充当顾问/执行者,而非完全自主的攻击者。
Methodology 核心方法
The paper explores two distinct use cases for LLM-augmented penetration testing. First, high-level task planning where AutoGPT is asked to generate penetration testing plans for Active Directory attacks and external engagements against a real target organization. Second, a low-level attack-execution system where GPT-3.5 is integrated with a vulnerable Linux virtual machine via SSH in a closed feedback loop: the LLM suggests shell commands, the commands are executed on the target, and the output is fed back to the LLM for iterative vulnerability discovery and exploitation.
论文探讨了LLM增强渗透测试的两个不同用例。首先是高级任务规划,要求AutoGPT为针对真实目标组织的活动目录攻击和外部渗透生成渗透测试计划。其次是低级攻击执行系统,将GPT-3.5通过SSH与一台脆弱的Linux虚拟机集成在一个闭环反馈回路中:LLM建议shell命令,命令在目标上执行,输出反馈给LLM,进行迭代的漏洞发现和利用。
Architecture 架构设计
For the low-level system (hackingBuddyGPT), a Python script connects to a deliberately vulnerable lin.security Linux VM via SSH. An infinite loop prompts GPT-3.5 to imagine being a low-privilege user wanting to become root, state a Linux shell command, which is then executed over SSH on the VM. The command output is fed back to GPT-3.5, which also identifies potential vulnerabilities and suggests verification commands. This creates a closed feedback loop between the LLM and the target system.
对于低级系统(hackingBuddyGPT),一个Python脚本通过SSH连接到一台故意设置漏洞的lin.security Linux虚拟机。一个无限循环提示GPT-3.5想象自己是一个想要成为root的低权限用户,并给出一个Linux shell命令,该命令随后在VM上通过SSH执行。命令输出反馈给GPT-3.5,它还会识别潜在漏洞并建议验证命令。这在LLM和目标系统之间创建了一个闭环反馈回路。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
conversation-history
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
The simple LLM-shell feedback loop was able to routinely gain root privileges on the vulnerable VM through multiple attack vectors including sudo misconfigurations, GTFOBins exploitation, and weak password attacks via /etc/passwd. For high-level planning, AutoGPT generated realistic and feasible Active Directory attack plans covering password spraying, Kerberoasting, AS-REP roasting, and ADCS exploitation. Individual runs were not stable but results converged over multiple runs.
简单的LLM-shell反馈回路能够例行公事地通过多种攻击向量(包括sudo配置错误、GTFOBins利用以及通过/etc/passwd进行的弱密码攻击)在脆弱的虚拟机上获得root权限。对于高级规划,AutoGPT生成了现实且可行的活动目录攻击计划,涵盖了密码喷洒(password spraying)、Kerberoasting、AS-REP roasting和ADCS利用。虽然单次运行并不稳定,但多次运行的结果会趋于一致。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- linpeas.sh (enumeration tool, qualitative comparison)
Scale 评估规模
1 VulnHub VM (lin.security) for low-level testing; 1 real organization for high-level planning
Contributions 核心贡献
- Demonstrated two complementary use cases for LLM-augmented penetration testing: high-level task planning and low-level attack execution
- Implemented a closed-feedback-loop prototype (hackingBuddyGPT) connecting GPT-3.5 to a vulnerable VM via SSH for automated privilege escalation
- Showed that even GPT-3.5 (not GPT-4) could routinely achieve root access on a vulnerable Linux system through iterative command execution
- Provided qualitative analysis of LLM behavior including grounding, hallucinations, stability, and ethical moderation bypasses
- Outlined a vision for AI-augmented penetration testing covering integration of high/low-level tasks, model options, memory/verification, and prompt optimization
- 展示了LLM增强渗透测试的两个互补用例:高级任务规划和低级攻击执行
- 实现了一个闭环反馈原型(hackingBuddyGPT),通过SSH将GPT-3.5连接到脆弱的虚拟机,用于自动提权
- 证明了即使是GPT-3.5(而非GPT-4)也能通过迭代执行命令,例行公事地在脆弱的Linux系统上获得root访问权限
- 对LLM行为(包括接地/grounding、幻觉、稳定性以及绕过伦理审查)进行了定性分析
- 勾勒了AI增强渗透测试的愿景,涵盖高级/低级任务集成、模型选择、记忆/验证以及提示词优化
Limitations 局限性
- Single prototype runs were not stable; the sequence and selection of commands varied between runs, though results converged over multiple iterations
- LLM suggestions appeared based on pattern-matching and preconceptions from training data rather than deep understanding of the target system
- The LLM could not perform multi-step planning for complex exploitation chains (e.g., found SUID binaries but did not exploit them)
- Memory was simplistic, limited to context window (4k tokens for GPT-3 model used), with executed command output stored only until context limit was reached
- Evaluation limited to a single deliberately vulnerable VM (lin.security), not tested on hardened or real-world production systems
- Ethical moderation in GPT-3.5-turbo could be easily bypassed through slight prompt variations (e.g., asking for 'verification commands' instead of 'exploits')
- Used only cloud-based OpenAI API, raising concerns about sharing sensitive penetration test data
- 单次原型运行并不稳定;不同运行之间的命令顺序和选择有所不同,尽管结果在多次迭代后趋于收敛
- LLM的建议似乎是基于训练数据中的模式匹配和先入为主的概念,而非对目标系统的深入理解
- LLM无法针对复杂的利用链执行多步规划(例如,发现了SUID二进制文件但没有利用它们)
- 记忆机制过于简单,仅限于上下文窗口(所使用的GPT-3模型为4k token),执行命令的输出仅存储到达到上下文限制为止
- 评估仅限于单个故意设置漏洞的虚拟机(lin.security),未在加固系统或真实生产环境中测试
- GPT-3.5-turbo中的伦理审查可以很容易地通过轻微的提示词变化(例如,要求“验证命令”而非“利用代码”)来绕过
- 仅使用了基于云的OpenAI API,这引发了关于共享敏感渗透测试数据的担忧
Research Gaps 研究空白
- Need to evaluate locally-run open-source models (Llama, StableLM, Dolly2) to avoid cloud data leakage and enable customer-specific fine-tuning
- Integration of high-level and low-level penetration testing tasks into a unified LLM system
- Advanced memory mechanisms including multiple memory streams for commands, security findings, and system model building
- Using LLMs to generate and optimize their own prompts for penetration testing (meta-prompting)
- Determining what parameter size is 'good enough' for effective security testing LLMs
- Fine-tuning models on customer-specific or engagement-specific data for improved penetration testing
- Research into better question formulation based on empirical studies of how penetration testers work
- 需要评估本地运行的开源模型(Llama、StableLM、Dolly2),以避免云数据泄露并支持特定客户的微调
- 将高级和低级渗透测试任务集成到一个统一的LLM系统中
- 先进的记忆机制,包括用于存储命令、安全发现和系统模型构建的多个记忆流
- 使用LLM生成并优化其自身的渗透测试提示词(元提示/meta-prompting)
- 确定对于有效的安全测试LLM,多大的参数规模是“足够好”的
- 在特定客户或特定任务的数据上微调模型,以改善渗透测试效果
- 基于渗透测试人员工作方式的实证研究,研究更好的问题表述(question formulation)
Novel Techniques 新颖技术
- Closed feedback loop between LLM and target system via SSH for iterative privilege escalation
- Using MITRE ATT&CK TTP hierarchy to structure LLM queries at different abstraction levels (tactics vs techniques vs procedures)
- Prompt engineering to bypass ethical moderation (asking for 'verification commands' instead of 'exploits')
- LLM与目标系统之间通过SSH建立闭环反馈回路,用于迭代提权
- 使用MITRE ATT&CK TTP层级结构,在不同抽象级别(战术 vs 技术 vs 过程)结构化LLM查询
- 通过提示词工程绕过伦理审查(要求“验证命令”而非“利用代码”)
Open Questions 开放问题
- Can LLMs move beyond pattern-matching from training data to genuine reasoning about system vulnerabilities?
- How to balance the dual-use nature of LLM-based penetration testing tools (defensive vs offensive use)?
- What is the minimum model size and capability threshold for effective automated penetration testing?
- How to make LLM-based pentest tools deterministic enough for reproducible security assessments?
- Can fine-tuned local models match or exceed cloud API models for penetration testing while preserving data confidentiality?
- LLM能否从训练数据的模式匹配转向对系统漏洞的真正推理?
- 如何平衡基于LLM的渗透测试工具的双重用途性质(防御性 vs 进攻性用途)?
- 有效的自动化渗透测试所需的最小模型规模和能力门槛是多少?
- 如何使基于LLM的渗透测试工具具有足够的确定性,以便进行可复现的安全评估?
- 微调后的本地模型能否在保持数据机密性的同时,达到或超过云端API模型的渗透测试表现?
Builds On 基于前人工作
- AutoGPT
- BabyAGI
- MITRE ATT&CK framework
- Happe and Cito 2023 (Understanding Hackers' Work interview study)
Open Source 开源信息
Yes - https://github.com/ipa-lab/hackingBuddyGPT