#55

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

Andreas Happe, Aaron Kaplan, J\xFCrgen Cito

2026 | Empirical Software Engineering (journal)

https://doi.org/10.1007/s10664-025-10758-3

Problem & Motivation 问题与动机

There is no comprehensive understanding of LLMs' efficacy and limitations in performing autonomous Linux privilege-escalation attacks. Existing work lacks reproducible benchmarks and controlled evaluation of different LLMs on this critical penetration-testing subtask.

目前对于 LLM 在执行自主 Linux 权限提升攻击方面的有效性和局限性缺乏全面的了解。现有的工作在这一关键的渗透测试子任务上缺乏可复现的基准测试和对不同 LLM 的受控评估。

Privilege escalation is a critical subtask of penetration testing that is largely performed manually. LLMs present opportunities to automate this task, but their capabilities have not been systematically evaluated. Understanding LLM performance on privilege escalation can guide future research toward more effective and reliable LLM-guided penetration testing tools.

权限提升是渗透测试的一个关键子任务,目前主要由人工执行。LLM 为自动化该任务提供了机会,但其能力尚未得到系统评估。了解 LLM 在权限提升方面的表现可以指导未来的研究,开发更有效、更可靠的 LLM 引导的渗透测试工具。

Threat Model 威胁模型

A low-privilege local user on a Linux system attempts to escalate to root (uid 0) by exploiting misconfigurations and vulnerabilities. The attacker has SSH access and a standard user account. Each VM contains exactly one exploitable vulnerability.

Linux 系统上的低权限本地用户尝试通过利用配置错误和漏洞提升至 root 权限 (uid 0)。攻击者拥有 SSH 访问权限和标准用户帐户。每个虚拟机 (VM) 仅包含一个可利用的漏洞。

Methodology 核心方法

The authors introduce hackingBuddyGPT, a fully automated LLM-driven prototype that connects to vulnerable Linux VMs via SSH and iteratively queries an LLM to generate shell commands for privilege escalation. They curate a novel benchmark of 12 single-vulnerability Linux VMs and evaluate multiple LLMs (GPT-4-Turbo, GPT-3.5-Turbo, Llama3-70b, Llama3-8b) against baselines of human penetration testers and traditional automated tools (traitor, pwncat-cs). The study investigates the impact of context management strategies (history vs. state compaction), context sizes (4k to 128k), high-level guidance hints, and enumeration-tool-derived guidance on LLM performance.

作者介绍了 hackingBuddyGPT,这是一个全自动的 LLM 驱动原型,它通过 SSH 连接到易受攻击的 Linux 虚拟机,并迭代地查询 LLM 以生成用于权限提升的 shell 命令。他们策划了一个包含 12 个单漏洞 Linux 虚拟机的创新基准测试,并针对人类渗透测试员和传统自动化工具(traitor, pwncat-cs)基准评估了多个 LLM(GPT-4-Turbo, GPT-3.5-Turbo, Llama3-70b, Llama3-8b)。该研究调查了上下文管理策略(历史 vs. 状态压缩)、上下文大小(4k 到 128k)、高层指导提示以及源自枚举工具的指导对 LLM 性能的影响。

Architecture 架构设计

Single LLM control loop architecture. The main module uses a next-command prompt to query the LLM for the next shell command to execute on the target VM. An optional State Management Module compresses execution history into a compact state via an update-state prompt (Reflection/Iterated Amplification pattern). Optional guidance mechanisms include high-level hints per test case and enumeration-tool-derived hints from linux-smart-enumeration.sh analyzed by an LLM. The prototype supports two capabilities: execute_command and test_credentials.

单 LLM 控制循环架构。主模块使用“下一条命令”提示词来查询 LLM 以获取要在目标虚拟机上执行的下一条 shell 命令。可选的状态管理模块通过“更新状态”提示词将执行历史压缩为紧凑状态(反映/迭代放大模式)。可选的指导机制包括针对每个测试用例的高层提示,以及由 LLM 分析的源自 linux-smart-enumeration.sh 的枚举工具提示。该原型支持两种功能:execute_command(执行命令)和 test_credentials(测试凭据)。

LLM Models 使用的大模型

GPT-4-TurboGPT-3.5-TurboLlama3-70b-q4Llama3-8b-q8

Tool Integration 工具集成

SSHlinux-smart-enumeration.shVagrantAnsible

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

GPT-4-Turbo achieved 33-83% success rates across configurations, comparable to human penetration testers (75% unaided, 91% with hints). High-level guidance significantly boosted success rates (e.g., GPT-4-Turbo from 33% to 66% unaided, from 66% to 83% with hints). State compaction via Reflection doubled GPT-4-Turbo success from 33% to 66%. GPT-3.5-Turbo achieved 16-50%, Llama3-70b 25-33%, and Llama3-8b 0-16%. Traditional tools (traitor, pwncat-cs) only achieved 8-16%. GPT-4-Turbo cost approximately $1.54-$11.43 per exploited vulnerability depending on configuration.

GPT-4-Turbo 在不同配置下实现了 33-83% 的成功率,与人类渗透测试员相当(无辅助时为 75%,有提示时为 91%)。高层指导显著提高了成功率(例如,GPT-4-Turbo 从无辅助的 33% 提高到有提示的 66%,再到有提示时的 83%)。通过“反映”(Reflection)进行的状态压缩使 GPT-4-Turbo 的成功率从 33% 翻倍至 66%。GPT-3.5-Turbo 达到 16-50%,Llama3-70b 达到 25-33%,Llama3-8b 达到 0-16%。传统工具(traitor, pwncat-cs)仅达到 8-16%。根据配置不同,GPT-4-Turbo 每次利用漏洞的成本约为 1.54 美元至 11.43 美元。

Environment 评估环境

custom-lab

Metrics 评估指标

success-ratenum-stepscosttoken-usagecontext-size

Baseline Comparisons 基准对比

  • human-penetration-tester
  • traitor
  • pwncat-cs

Scale 评估规模

12 single-vulnerability Linux VMs covering SUID/sudo, cron-based, information disclosure, and Docker vulnerability classes

Contributions 核心贡献

  • A publicly available Linux privilege-escalation benchmark consisting of 12 single-vulnerability Debian VMs deployable in air-gapped environments via Vagrant and Ansible
  • hackingBuddyGPT: a fully automated, open-source LLM-driven privilege escalation prototype
  • Quantitative empirical analysis comparing GPT-4-Turbo, GPT-3.5-Turbo, Llama3-70b, and Llama3-8b against human penetration testers and traditional automated tools
  • Analysis of context management strategies showing state compaction (Reflection pattern) doubles GPT-4-Turbo success rates
  • Evaluation of high-level guidance mechanisms showing they significantly boost LLM success rates, with high-level hints outperforming enumeration-tool-derived guidance
  • Qualitative analysis of LLM-generated commands revealing challenges in common-sense reasoning, error handling, multi-step exploitation, and temporal dependencies
  • Cost analysis demonstrating GPT-4-Turbo can achieve human-comparable performance at competitive costs
  • 一个公开可用的 Linux 权限提升基准测试,由 12 个单漏洞 Debian 虚拟机组成,可通过 Vagrant 和 Ansible 在隔离网络环境中部署
  • hackingBuddyGPT:一个全自动、开源的 LLM 驱动权限提升原型
  • 对 GPT-4-Turbo、GPT-3.5-Turbo、Llama3-70b 和 Llama3-8b 与人类渗透测试员及传统自动化工具进行的定量经验分析
  • 上下文管理策略分析表明,状态压缩(反映模式)使 GPT-4-Turbo 的成功率翻倍
  • 高层指导机制评估表明,它们能显著提高 LLM 成功率,且高层提示优于源自枚举工具的指导
  • 对 LLM 生成命令的定性分析揭示了在常识推理、错误处理、多步利用和时间依赖性方面的挑战
  • 成本分析证明 GPT-4-Turbo 能以具有竞争力的成本实现与人类相当的性能

Limitations 局限性

  • Benchmark limited to 12 test cases covering only configuration-based vulnerabilities, excluding kernel exploits, service exploits, and weak file system permissions
  • Each VM contains only a single vulnerability, which does not reflect real-world multi-vulnerability scenarios
  • LLMs struggle with multi-step exploits requiring causal reasoning and temporal dependencies (e.g., cron-based attacks)
  • LLMs fail to apply common-sense reasoning such as using discovered passwords for privilege escalation or recognizing non-exploitable SUID binaries
  • LLMs repetitively execute similar commands, wasting rounds and resources
  • Small language models (Llama3-8b) are currently not capable of autonomous privilege escalation
  • State compaction via update-state prompts is 13.4x slower and potentially cost-ineffective due to asymmetric output token pricing
  • Selection bias possible in both vulnerability classes and LLM choices
  • In-context learning with background hacking material from HackTricks did not improve results for GPT-4-Turbo and substantially increased costs
  • 基准测试仅限于 12 个涵盖基于配置漏洞的测试用例,排除了内核漏洞利用、服务漏洞利用和弱文件系统权限
  • 每个虚拟机仅包含单个漏洞,这不能反映现实世界中的多漏洞场景
  • LLM 在需要因果推理和时间依赖性的多步漏洞利用(如基于 cron 的攻击)方面表现吃力
  • LLM 未能应用常识推理,例如使用发现的密码进行权限提升或识别不可利用的 SUID 二进制文件
  • LLM 重复执行相似命令,浪费回合和资源
  • 小语言模型(Llama3-8b)目前尚不具备自主权限提升的能力
  • 通过更新状态提示词进行的状态压缩速度慢 13.4 倍,且由于非对称的输出令牌定价,可能成本效益不高
  • 漏洞类别和 LLM 选择中可能存在选择偏差
  • 使用来自 HackTricks 的后台黑客资料进行上下文学习(In-context learning)并未改善 GPT-4-Turbo 的结果,且大幅增加了成本

Research Gaps 研究空白

  • Lack of established standards or methodologies for Linux privilege escalation attacks (unlike web application testing with OWASP)
  • Need for improved multi-step and temporal reasoning capabilities in LLMs for complex exploit chains
  • Efficient context usage and prompt design for penetration-testing agents remains underexplored
  • Performance of locally-run open-source LLMs needs substantial improvement for practical privilege escalation
  • Human-AI interaction design for LLM-augmented penetration testing tools is unexplored
  • Fine-tuning or domain-specific training of LLMs for privilege escalation has not been investigated
  • No research on LLM-based defense mechanisms that could detect and differentiate LLM-driven attacks from human attackers
  • 缺乏公认的 Linux 权限提升攻击标准或方法论(不像使用 OWASP 进行 Web 应用测试)
  • LLM 需要改进复杂漏洞利用链中的多步和时间推理能力
  • 渗透测试智能体的高效上下文使用和提示词设计仍有待探索
  • 本地运行的开源 LLM 在实际权限提升方面的性能需要大幅提升
  • LLM 增强型渗透测试工具的人机交互设计尚未探索
  • 针对权限提升的 LLM 微调或领域特定训练尚未进行研究
  • 尚无关于能检测并区分 LLM 驱动攻击与人类攻击者的基于 LLM 的防御机制的研究

Novel Techniques 新颖技术

  • State compaction via Reflection pattern (update-state prompt) to compress execution history into a concise fact list, doubling success rates while reducing context size
  • Hybrid multi-model approach using GPT-4-Turbo for enumeration analysis and GPT-3.5-Turbo for execution command generation to balance cost and efficacy
  • High-level guidance mechanism emulating human penetration-tester checklists to direct LLM attack strategies
  • Single-vulnerability VM benchmark design enabling controlled per-vulnerability-class analysis
  • 通过反映模式(更新状态提示词)进行状态压缩,将执行历史压缩为简洁的事实列表,在减小上下文大小的同时使成功率翻倍
  • 采用混合多模型方法,使用 GPT-4-Turbo 进行枚举分析,使用 GPT-3.5-Turbo 生成执行命令,以平衡成本和功效
  • 模拟人类渗透测试员检查清单的高层指导机制,以指导 LLM 攻击策略
  • 单漏洞虚拟机基准测试设计,可进行受控的各漏洞类别分析

Open Questions 开放问题

  • Can fine-tuning or domain-specific training significantly improve open-source LLM performance on privilege escalation?
  • How can LLMs be taught causal and temporal reasoning for multi-step exploits (e.g., cron-based attacks)?
  • What is the minimum model size required for effective autonomous privilege escalation?
  • How can LLM-driven penetration testing be extended to multi-vulnerability and multi-user scenarios?
  • Can defenders use behavioral differences between LLM and human attackers (e.g., command patterns, lack of working directory changes) for detection?
  • 微调或领域特定训练能否显著提高开源 LLM 在权限提升方面的表现?
  • 如何教导 LLM 针对多步漏洞利用(如基于 cron 的攻击)进行因果和时间推理?
  • 实现有效自主权限提升所需的最小模型尺寸是多少?
  • 如何将 LLM 驱动的渗透测试扩展到多漏洞和多用户场景?
  • 防御者能否利用 LLM 和人类攻击者之间的行为差异(如命令模式、缺少工作目录更改)进行检测?

Builds On 基于前人工作

  • wintermute
  • pentestGPT
  • AutoAttacker
  • PenHeal

Open Source 开源信息

Yes - https://github.com/ipa-lab/hackingBuddyGPT, https://github.com/ipa-lab/benchmark-privesc-linux, https://github.com/ipa-lab/hackingbuddy-results

Tags