Hacking, The Lazy Way: LLM Augmented Pentesting Hacking, The Lazy Way: LLM Augmented Pentesting
Problem & Motivation 问题与动机
Traditional penetration testing is heavily reliant on the expertise of skilled professionals, creating a bifurcated market between expensive manual pentests and low-quality automated tools that are compliance-driven and fail to uncover deeper vulnerabilities. Existing LLM-based pentesting frameworks are limited to static or simulated environments, lack real-time tool chaining, and do not support plugin-based extensibility or live shell orchestration.
传统的渗透测试严重依赖专业人员的专业知识,导致市场两极分化:要么是昂贵的手动渗透测试,要么是低质量、合规驱动且无法发现深层漏洞的自动化工具。现有的基于大语言模型(LLM)的渗透测试框架局限于静态或模拟环境,缺乏实时工具链,且不支持基于插件的扩展或实时 shell 编排。
Most existing automation tools in pentesting are limited to scripted tasks and lack the adaptability required for dynamic, context-aware exploration. The paper identifies six recurring limitations in prior work: static/simulated environments, lack of real-time tool chaining, no plugin-based extensibility, human-in-the-loop dependency, limited input handling for non-textual files, and no live shell or command orchestration. Pentest Copilot aims to bridge this gap by combining LLM reasoning with practical pentesting tools in a live, interactive infrastructure.
渗透测试中大多数现有的自动化工具仅限于脚本化任务,缺乏动态、上下文感知探索所需的适应性。本文识别了先前研究中的六个反复出现的局限:静态/模拟环境、缺乏实时工具链、无插件扩展性、依赖人工干预、对非文本文件的输入处理有限,以及缺乏实时 shell 或命令编排。Pentest Copilot 旨在通过在实时交互式基础设施中结合 LLM 推理与实用渗透测试工具来弥补这一差距。
Threat Model 威胁模型
The system assumes authorized ethical pentesting engagements. The paper includes a risk and mitigation analysis covering malicious compute use, unauthorized targeting, leaked API keys, and RCE on the platform itself. Mitigations include CPU usage monitoring, no egress traffic by default, secrets management, and regular remediation.
该系统假设进行的是经过授权的伦理渗透测试。文中包含了风险与缓解分析,涵盖恶意计算使用、未经授权的目标定位、API 密钥泄露以及平台本身的远程代码执行(RCE)。缓解措施包括 CPU 使用率监控、默认禁止出站流量、机密管理和定期修复。
Methodology 核心方法
The paper introduces 'LLM Augmented Pentesting' via a tool called 'Pentest Copilot' that integrates GPT-4-Turbo into penetration testing workflows. The system uses a step-chaining approach (akin to chain-of-thought) with three sequential prompts per pentest loop: command generation, summarization, and to-do list update. RAG is employed with a curated vector database of Metasploit and MSFVenom modules to reduce hallucinations and provide up-to-date tool knowledge. A file analysis component converts non-textual inputs (ELF binaries, PE32 executables, configuration files, media) into plaintext for LLM reasoning. The system operates within a sandboxed infrastructure accessible via browser with VNC for GUI tools.
本文通过一个名为“Pentest Copilot”的工具引入了“LLM 增强渗透测试”概念,该工具将 GPT-4-Turbo 集成到渗透测试工作流中。系统采用了“步骤链(step-chaining)”方法(类似于思维链),每个渗透循环包含三个顺序提示词:命令生成、总结和待办事项列表更新。采用 RAG 技术,利用精心策划的 Metasploit 和 MSFVenom 模块向量数据库来减少幻觉并提供最新的工具知识。文件分析组件将非文本输入(ELF 二进制文件、PE32 可执行文件、配置文件、媒体文件)转换为纯文本以供 LLM 推理。系统在沙箱化基础设施中运行,可通过浏览器和 VNC 访问图形界面工具。
Architecture 架构设计
Pentest Copilot consists of: (1) a central service that orchestrates the LLM interaction; (2) per-session sandbox compute instances pre-configured with security tools, accessible via SSH for command execution and VNC for GUI tools like Burp Suite; (3) a RAG server backed by a vector database of Metasploit/MSFVenom modules stored in Redis; (4) a dynamic prompt creation engine that injects prior context, target information, and plugin definitions; (5) a file analysis tool that converts binaries, configs, and media to plaintext; (6) VPN support via OpenVPN and Chisel for targeting private subnets. All services and sandboxes are deployed within a single subnet for seamless container communication.
Pentest Copilot 由以下部分组成:(1) 一个编排 LLM 交互的核心服务;(2) 预配置了安全工具的单会话沙箱计算实例,可通过 SSH 进行命令执行,通过 VNC 访问 Burp Suite 等 GUI 工具;(3) 一个由 Redis 存储的 Metasploit/MSFVenom 模块向量数据库支持的 RAG 服务器;(4) 一个动态提示词创建引擎,用于注入先前的上下文、目标信息和插件定义;(5) 一个文件分析工具,将二进制文件、配置和媒体转换为纯文本;(6) 通过 OpenVPN 和 Chisel 支持 VPN,用于针对私有子网。所有服务和沙箱都部署在单个子网中,以实现无缝的容器间通信。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
RAG
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
GPT-4-Turbo achieved the best overall performance with 100% structural accuracy, 60% functional correctness, 60% command accuracy, 80% plugin validity, and 7.11s average response time. Transitioning to GPT-4-Turbo reduced task completion time from 8-10 minutes to 4-5 minutes compared to unoptimized baselines. GPT-4 failed entirely (0% across structural accuracy, functional correctness, and command accuracy) due to output formatting issues.
GPT-4-Turbo 取得了最佳的整体表现:100% 的结构准确度、60% 的功能正确性、60% 的命令准确度、80% 的插件有效性,平均响应时间为 7.11 秒。与未优化的基准线相比,转向 GPT-4-Turbo 后,任务完成时间从 8-10 分钟缩短至 4-5 分钟。GPT-4 则由于输出格式问题完全失败(结构准确度、功能正确性和命令准确度均为 0%)。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- PentestGPT
- GPTPEN
- BreachStorm
- AUTOPOWNED
Scale 评估规模
30 manually curated test cases spanning vulnerability assessments through post-exploitation, drawn from bug-bounty reconnaissance, XSS, SQL injection, privilege escalation scenarios; evaluated on a purposely vulnerable boot2root server
Contributions 核心贡献
- Introduction of the LLM Augmented Pentesting concept with Pentest Copilot, a copilot-style tool integrating GPT-4-Turbo into live pentesting workflows
- A step-chaining approach with three sequential prompts (command generation, summarization, to-do update) to manage token limits and maintain multi-step reasoning across pentesting phases
- RAG integration with a curated vector database of Metasploit and MSFVenom modules to reduce hallucinations and keep tool knowledge current
- A file analysis component that converts non-textual pentesting artifacts (ELF, PE32, configs, media) into plaintext for LLM reasoning
- A complete in-browser pentesting infrastructure with sandboxed environments, VNC desktop access, SSH command execution, and VPN support for private network targeting
- A plugin architecture with web search, run bash, generic response, netcat listener, and generate payload plugins
- An open-source testbenching and evaluation framework (HackerLLMBench) for comparing GPT models on pentesting tasks
- 提出了 LLM 增强渗透测试的概念,并开发了 Pentest Copilot 这一副驾驶风格工具,将 GPT-4-Turbo 集成到实时渗透测试工作流中
- 提出了一种“步骤链(step-chaining)”方法,通过三个顺序提示词(命令生成、总结、待办更新)来管理 token 限制,并维持跨渗透阶段的多步推理
- 集成了 RAG 与精心策划的 Metasploit 和 MSFVenom 模块向量数据库,以减少幻觉并保持工具知识的实时性
- 开发了文件分析组件,可将非文本渗透测试产物(ELF, PE32, 配置, 媒体)转换为纯文本以供 LLM 推理
- 提供了一个完整的浏览器内渗透测试基础设施,包含沙箱环境、VNC 桌面访问、SSH 命令执行以及针对私有网络目标的 VPN 支持
- 设计了插件架构,包含网页搜索、运行 Bash、通用响应、netcat 监听器和生成载荷等插件
- 开源了测试基准和评估框架(HackerLLMBench),用于在渗透测试任务上比较各种 GPT 模型
Limitations 局限性
- Cannot mount zero-day attacks since the LLM's knowledge is limited to known vulnerabilities and techniques
- Unable to bypass security controls unknown to the LLM
- Uses attack patterns that may be easily detected by security systems due to well-known signatures
- Lacks awareness of closed-source and paid frameworks like Cobalt Strike (cannot set up Beacons, Proxies, or mount lateral traversal attacks)
- Limited effectiveness for advanced scenarios such as red teaming, malware writing, EDR bypass, or Active Directory attacks
- GPT-4-Turbo's 4,096-token output limit constrains the depth of guidance per response, requiring aggressive context compression
- Hallucinations and context drops can propagate errors across the step-chaining process
- Evaluation is limited to a custom benchmark with only 30 test cases on a single boot2root server, lacking validation on diverse real-world targets
- No formal user study or comparison with human penetration testers to validate the copilot's practical utility
- The qualitative comparison with other tools (Table 6) is feature-based, not empirically validated through head-to-head testing on the same targets
- 无法发起 0day 攻击,因为 LLM 的知识局限于已知的漏洞和技术
- 无法绕过 LLM 未知的安全控制措施
- 使用的攻击模式可能因广为人知的特征码而容易被安全系统检测到
- 缺乏对 Cobalt Strike 等闭源及付费框架的了解(无法设置 Beacon、代理或发起横向渗透攻击)
- 在红队演练、恶意软件编写、EDR 绕过或活动目录(AD)攻击等高级场景中有效性有限
- GPT-4-Turbo 的 4096 token 输出限制约束了单次响应的指导深度,需要激进的上下文压缩
- 幻觉和上下文丢失可能会在步骤链过程中导致误差累积
- 评估仅限于一个包含 30 个测试案例的自定义基准,且仅在单个 boot2root 服务器上进行,缺乏在多样化现实目标上的验证
- 缺乏正式的用户研究或与人类渗透测试人员的对比,以验证该 Copilot 的实际效用
- 与其他工具的定性比较(表 6)是基于功能的,而非通过在相同目标上的头对头测试进行的实证验证
Research Gaps 研究空白
- Most LLM pentesting tools operate only in static or simulated environments (CTFs, HackTheBox) and cannot adapt to arbitrary, unknown systems
- Lack of real-time tool chaining: few systems explore dynamic chaining of tools or automated follow-up commands based on live output
- No existing plugin-based extensibility in current LLM pentesting frameworks for seamless execution of common pentesting tools
- Limited handling of non-textual inputs (binaries, configuration files, media, shell output) by current LLM-based systems
- No live shell or command orchestration: most tools do not facilitate live sessions, shell access, or session-state awareness
- Absence of fine-tuned open-source models specifically for penetration testing and red teaming activities
- No red-team-specific knowledge base covering non-technical social engineering (phishing) and technology-heavy misconfigurations (e.g., Active Directory GPO)
- 大多数 LLM 渗透测试工具仅在静态或模拟环境(CTF, HackTheBox)中运行,无法适应任意、未知的系统
- 缺乏实时工具链:很少有系统探索根据实时输出进行动态工具链化或自动执行后续命令
- 当前的 LLM 渗透测试框架中不存在基于插件的扩展性,无法无缝执行常见的渗透测试工具
- 当前基于 LLM 的系统对非文本输入(二进制文件、配置文件、媒体、shell 输出)的处理能力有限
- 缺乏实时 shell 或命令编排:大多数工具不提供实时会话、shell 访问或会话状态感知
- 缺乏专门针对渗透测试和红队活动的经过微调的开源模型
- 缺乏涵盖非技术性社会工程(网络钓鱼)和重技术配置错误(如活动目录 GPO)的红队专用知识库
Novel Techniques 新颖技术
- Step-chaining with three specialized prompts per pentest loop (command generation, summarization, to-do update) to manage token budgets while maintaining multi-step reasoning
- Dynamic prompt creation that conditionally injects prior context, target information (domain vs IP), and plugin definitions based on the current state of the engagement
- File analysis pipeline converting diverse non-textual pentesting artifacts (ELF security settings, PE32 imports, EXIF data, config hierarchies) into structured plaintext for LLM consumption
- Plugin-based architecture where tool capabilities (web search, bash execution, netcat listener, payload generation) are specified in the prompt and selected by the LLM
- In-browser sandboxed pentesting infrastructure combining SSH for CLI and VNC for GUI tool access within ephemeral containers
- 步骤链(Step-chaining):每个渗透循环使用三个专门提示词(命令生成、总结、待办更新),在管理 token 预算的同时维持多步推理
- 动态提示词创建:根据任务的当前状态,有条件地注入先前的上下文、目标信息(域名 vs IP)和插件定义
- 文件分析流水线:将各种非文本渗透产物(ELF 安全设置、PE32 导入、EXIF 数据、配置层级)转换为结构化纯文本以供 LLM 消费
- 基于插件的架构:在提示词中指定工具能力(网页搜索、Bash 执行、netcat 监听器、载荷生成),并由 LLM 选择
- 浏览器内沙箱化渗透基础设施:在临时容器中结合了用于 CLI 的 SSH 和用于 GUI 工具访问的 VNC
Open Questions 开放问题
- How well does Pentest Copilot perform against real-world targets beyond a single boot2root server, especially enterprise environments with diverse technology stacks?
- Can the step-chaining approach scale to longer engagements (days/weeks) without critical context loss or error accumulation?
- How does the system compare to a skilled human pentester in terms of vulnerability discovery rate, time, and quality of findings?
- Would fine-tuned open-source models outperform GPT-4-Turbo for pentesting while reducing dependency on commercial APIs and associated cost/latency?
- How effective is the RAG component at preventing hallucinations quantitatively, and what is the retrieval accuracy for Metasploit/MSFVenom modules?
- Can the plugin architecture be extended to support more complex multi-step tool chains without human intervention (e.g., automated pivot from recon to exploitation)?
- Pentest Copilot 在单个 boot2root 服务器之外的真实目标(尤其是具有多样化技术栈的企业环境)上表现如何?
- “步骤链”方法能否在不发生关键上下文丢失或误差累积的情况下,扩展到更长时间的任务(数天/数周)?
- 在漏洞发现率、用时和发现质量方面,该系统与资深人类渗透测试人员相比如何?
- 经过微调的开源模型在渗透测试方面是否能超越 GPT-4-Turbo,同时减少对商业 API 的依赖及其相关的成本和延迟?
- RAG 组件在减少幻觉方面的定量效果如何?Metasploit/MSFVenom 模块的检索准确率是多少?
- 插件架构能否在无人工干预的情况下支持更复杂的工具链(例如从侦察到漏洞利用的自动转换)?
Builds On 基于前人工作
- PentestGPT (Deng et al., 2024)
- Getting Pwn'd by AI (Happe and Cito, 2023)
- Chain-of-thought prompting (Wei et al., 2023)
- RAG (Lewis et al., 2021)
- AUTOATTACKER (Xu et al., 2024)
- CIPHER (Pratama et al., 2024)
- BreachSeek (AlShehri et al., 2024)
Open Source 开源信息
Partial - evaluation framework available at https://github.com/Hackerbone/HackerLLMBench; Pentest Copilot tool itself not open-sourced