#38

Getting Pwn'd by AI: Penetration Testing with Large Language Models Getting Pwn'd by AI: Penetration Testing with Large Language Models

Andreas Happe, Juergen Cito

2023 | ESEC/FSE '23 (31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering) (top-conference)

10.1145/3611643.3613083

Problem & Motivation 问题与动机

Penetration testing requires high levels of expertise and involves many manual testing and analysis steps, yet the cybersecurity field suffers from a chronic shortage of personnel. This paper explores to what extent large language models can automate security testing by serving as AI sparring partners for penetration testers.

渗透测试需要高水平的专业知识,并涉及许多手动测试和分析步骤,然而网络安全领域正面临长期的人才短缺。本文探讨了大语言模型在多大程度上可以通过作为渗透测试人员的AI“对练伙伴(sparring partners)”来自动化安全测试。

The cybersecurity workforce gap is growing faster than the workforce itself (26.2% gap growth vs 11.1% workforce growth per ISC2 2022). Penetration testers have highlighted the need for human sparring partners who offer alternative ideas when stuck. AI-based sparring partners could augment existing testers, counteract the personnel shortage, and benefit the training of novice penetration testers while keeping a human in the loop to reduce ethical concerns.

网络安全人才缺口增长速度超过了人才本身的增长速度(根据ISC2 2022报告,缺口增长26.2%,而劳动力增长仅11.1%)。渗透测试人员强调,在陷入困境时,需要一个能够提供替代思路的人类对练伙伴。基于AI的对练伙伴可以增强现有的测试人员,缓解人才短缺,并使初级渗透测试人员在培训中受益,同时通过保持“人在回路”来减少伦理顾虑。

Threat Model 威胁模型

The paper assumes a post-authentication scenario for the low-level use case: a penetration tester has already gained low-privilege access to a Linux system (via SSH) and seeks privilege escalation to root. For the high-level use case, the attacker seeks to become domain admin in an Active Directory environment. The LLM acts as an advisor/executor rather than a fully autonomous attacker.

论文针对低级别用例假设了一个“认证后”场景:渗透测试人员已经获得了Linux系统的低权限访问(通过SSH),并寻求提权到root。对于高级别用例,攻击者的目标是成为活动目录(Active Directory)环境中的域管理员。LLM充当顾问/执行者,而非完全自主的攻击者。

Methodology 核心方法

The paper explores two distinct use cases for LLM-augmented penetration testing. First, high-level task planning where AutoGPT is asked to generate penetration testing plans for Active Directory attacks and external engagements against a real target organization. Second, a low-level attack-execution system where GPT-3.5 is integrated with a vulnerable Linux virtual machine via SSH in a closed feedback loop: the LLM suggests shell commands, the commands are executed on the target, and the output is fed back to the LLM for iterative vulnerability discovery and exploitation.

论文探讨了LLM增强渗透测试的两个不同用例。首先是高级任务规划,要求AutoGPT为针对真实目标组织的活动目录攻击和外部渗透生成渗透测试计划。其次是低级攻击执行系统,将GPT-3.5通过SSH与一台脆弱的Linux虚拟机集成在一个闭环反馈回路中:LLM建议shell命令,命令在目标上执行,输出反馈给LLM,进行迭代的漏洞发现和利用。

Architecture 架构设计

For the low-level system (hackingBuddyGPT), a Python script connects to a deliberately vulnerable lin.security Linux VM via SSH. An infinite loop prompts GPT-3.5 to imagine being a low-privilege user wanting to become root, state a Linux shell command, which is then executed over SSH on the VM. The command output is fed back to GPT-3.5, which also identifies potential vulnerabilities and suggests verification commands. This creates a closed feedback loop between the LLM and the target system.

对于低级系统(hackingBuddyGPT),一个Python脚本通过SSH连接到一台故意设置漏洞的lin.security Linux虚拟机。一个无限循环提示GPT-3.5想象自己是一个想要成为root的低权限用户,并给出一个Linux shell命令,该命令随后在VM上通过SSH执行。命令输出反馈给GPT-3.5,它还会识别潜在漏洞并建议验证命令。这在LLM和目标系统之间创建了一个闭环反馈回路。

LLM Models 使用的大模型

GPT-3.5-turboGPT-4 (mentioned via AutoGPT's use)

Tool Integration 工具集成

SSH (remote command execution)AutoGPT (for high-level task planning)Linux shell commands (sudo, cat, etc.)GTFOBins (referenced for privilege escalation)

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

The simple LLM-shell feedback loop was able to routinely gain root privileges on the vulnerable VM through multiple attack vectors including sudo misconfigurations, GTFOBins exploitation, and weak password attacks via /etc/passwd. For high-level planning, AutoGPT generated realistic and feasible Active Directory attack plans covering password spraying, Kerberoasting, AS-REP roasting, and ADCS exploitation. Individual runs were not stable but results converged over multiple runs.

简单的LLM-shell反馈回路能够例行公事地通过多种攻击向量(包括sudo配置错误、GTFOBins利用以及通过/etc/passwd进行的弱密码攻击)在脆弱的虚拟机上获得root权限。对于高级规划,AutoGPT生成了现实且可行的活动目录攻击计划,涵盖了密码喷洒(password spraying)、Kerberoasting、AS-REP roasting和ADCS利用。虽然单次运行并不稳定,但多次运行的结果会趋于一致。

Environment 评估环境

lin.security VulnHub VMReal target organization (with approval, for high-level planning only)

Metrics 评估指标

success-ratequalitative-analysis

Baseline Comparisons 基准对比

  • linpeas.sh (enumeration tool, qualitative comparison)

Scale 评估规模

1 VulnHub VM (lin.security) for low-level testing; 1 real organization for high-level planning

Contributions 核心贡献

  • Demonstrated two complementary use cases for LLM-augmented penetration testing: high-level task planning and low-level attack execution
  • Implemented a closed-feedback-loop prototype (hackingBuddyGPT) connecting GPT-3.5 to a vulnerable VM via SSH for automated privilege escalation
  • Showed that even GPT-3.5 (not GPT-4) could routinely achieve root access on a vulnerable Linux system through iterative command execution
  • Provided qualitative analysis of LLM behavior including grounding, hallucinations, stability, and ethical moderation bypasses
  • Outlined a vision for AI-augmented penetration testing covering integration of high/low-level tasks, model options, memory/verification, and prompt optimization
  • 展示了LLM增强渗透测试的两个互补用例:高级任务规划和低级攻击执行
  • 实现了一个闭环反馈原型(hackingBuddyGPT),通过SSH将GPT-3.5连接到脆弱的虚拟机,用于自动提权
  • 证明了即使是GPT-3.5(而非GPT-4)也能通过迭代执行命令,例行公事地在脆弱的Linux系统上获得root访问权限
  • 对LLM行为(包括接地/grounding、幻觉、稳定性以及绕过伦理审查)进行了定性分析
  • 勾勒了AI增强渗透测试的愿景,涵盖高级/低级任务集成、模型选择、记忆/验证以及提示词优化

Limitations 局限性

  • Single prototype runs were not stable; the sequence and selection of commands varied between runs, though results converged over multiple iterations
  • LLM suggestions appeared based on pattern-matching and preconceptions from training data rather than deep understanding of the target system
  • The LLM could not perform multi-step planning for complex exploitation chains (e.g., found SUID binaries but did not exploit them)
  • Memory was simplistic, limited to context window (4k tokens for GPT-3 model used), with executed command output stored only until context limit was reached
  • Evaluation limited to a single deliberately vulnerable VM (lin.security), not tested on hardened or real-world production systems
  • Ethical moderation in GPT-3.5-turbo could be easily bypassed through slight prompt variations (e.g., asking for 'verification commands' instead of 'exploits')
  • Used only cloud-based OpenAI API, raising concerns about sharing sensitive penetration test data
  • 单次原型运行并不稳定;不同运行之间的命令顺序和选择有所不同,尽管结果在多次迭代后趋于收敛
  • LLM的建议似乎是基于训练数据中的模式匹配和先入为主的概念,而非对目标系统的深入理解
  • LLM无法针对复杂的利用链执行多步规划(例如,发现了SUID二进制文件但没有利用它们)
  • 记忆机制过于简单,仅限于上下文窗口(所使用的GPT-3模型为4k token),执行命令的输出仅存储到达到上下文限制为止
  • 评估仅限于单个故意设置漏洞的虚拟机(lin.security),未在加固系统或真实生产环境中测试
  • GPT-3.5-turbo中的伦理审查可以很容易地通过轻微的提示词变化(例如,要求“验证命令”而非“利用代码”)来绕过
  • 仅使用了基于云的OpenAI API,这引发了关于共享敏感渗透测试数据的担忧

Research Gaps 研究空白

  • Need to evaluate locally-run open-source models (Llama, StableLM, Dolly2) to avoid cloud data leakage and enable customer-specific fine-tuning
  • Integration of high-level and low-level penetration testing tasks into a unified LLM system
  • Advanced memory mechanisms including multiple memory streams for commands, security findings, and system model building
  • Using LLMs to generate and optimize their own prompts for penetration testing (meta-prompting)
  • Determining what parameter size is 'good enough' for effective security testing LLMs
  • Fine-tuning models on customer-specific or engagement-specific data for improved penetration testing
  • Research into better question formulation based on empirical studies of how penetration testers work
  • 需要评估本地运行的开源模型(Llama、StableLM、Dolly2),以避免云数据泄露并支持特定客户的微调
  • 将高级和低级渗透测试任务集成到一个统一的LLM系统中
  • 先进的记忆机制,包括用于存储命令、安全发现和系统模型构建的多个记忆流
  • 使用LLM生成并优化其自身的渗透测试提示词(元提示/meta-prompting)
  • 确定对于有效的安全测试LLM,多大的参数规模是“足够好”的
  • 在特定客户或特定任务的数据上微调模型,以改善渗透测试效果
  • 基于渗透测试人员工作方式的实证研究,研究更好的问题表述(question formulation)

Novel Techniques 新颖技术

  • Closed feedback loop between LLM and target system via SSH for iterative privilege escalation
  • Using MITRE ATT&CK TTP hierarchy to structure LLM queries at different abstraction levels (tactics vs techniques vs procedures)
  • Prompt engineering to bypass ethical moderation (asking for 'verification commands' instead of 'exploits')
  • LLM与目标系统之间通过SSH建立闭环反馈回路,用于迭代提权
  • 使用MITRE ATT&CK TTP层级结构,在不同抽象级别(战术 vs 技术 vs 过程)结构化LLM查询
  • 通过提示词工程绕过伦理审查(要求“验证命令”而非“利用代码”)

Open Questions 开放问题

  • Can LLMs move beyond pattern-matching from training data to genuine reasoning about system vulnerabilities?
  • How to balance the dual-use nature of LLM-based penetration testing tools (defensive vs offensive use)?
  • What is the minimum model size and capability threshold for effective automated penetration testing?
  • How to make LLM-based pentest tools deterministic enough for reproducible security assessments?
  • Can fine-tuned local models match or exceed cloud API models for penetration testing while preserving data confidentiality?
  • LLM能否从训练数据的模式匹配转向对系统漏洞的真正推理?
  • 如何平衡基于LLM的渗透测试工具的双重用途性质(防御性 vs 进攻性用途)?
  • 有效的自动化渗透测试所需的最小模型规模和能力门槛是多少?
  • 如何使基于LLM的渗透测试工具具有足够的确定性,以便进行可复现的安全评估?
  • 微调后的本地模型能否在保持数据机密性的同时,达到或超过云端API模型的渗透测试表现?

Builds On 基于前人工作

  • AutoGPT
  • BabyAGI
  • MITRE ATT&CK framework
  • Happe and Cito 2023 (Understanding Hackers' Work interview study)

Open Source 开源信息

Yes - https://github.com/ipa-lab/hackingBuddyGPT

Tags