#40

Generative AI for pentesting: the good, the bad, the ugly Generative AI for pentesting: the good, the bad, the ugly

Eric Hilario, Sami Azam, Jawahar Sundaram, Khwaja Imran Mohammed, Bharanidharan Shanmugam

2024 | International Journal of Information Security (journal)

https://doi.org/10.1007/s10207-024-00835-x

empirical-study penetration-testing human-in-the-loop human-in-the-loop

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

The paper investigates how Generative AI (specifically ChatGPT 3.5) can be applied to enhance the efficiency of penetration testing methodologies in cyber security, examining the benefits, challenges, and risks across all five standard pentesting phases.

本文探讨了生成式人工智能（特别是 ChatGPT 3.5）如何应用于提高网络安全渗透测试方法的效率，检查了其在渗透测试五个标准阶段中的优势、挑战和风险。

Traditional penetration testing is a mundane and time-consuming process. The advent of GenAI has sparked significant interest in automating and enhancing pentesting, but there is a lack of peer-reviewed, in-depth, step-by-step analysis demonstrating how GenAI tools can be practically applied across the full pentesting lifecycle. Prior work lacked verification of payloads and source code generated by ChatGPT.

传统的渗透测试是一个单调且耗时的过程。生成式 AI 的出现激发了人们对自动化和增强渗透测试的浓厚兴趣，但缺乏同行评审的、深入的、分步分析，以证明生成式 AI 工具如何实际应用于整个渗透测试生命周期。之前的研究缺乏对 ChatGPT 生成的有效载荷和源代码的验证。

Threat Model 威胁模型

A beginner penetration tester using ChatGPT (via Shell_GPT CLI tool) as an interactive assistant to guide and execute commands throughout a full pentesting engagement against a vulnerable target machine in a controlled lab environment. The tester relies on natural language prompts to obtain commands, interpret results, and proceed through attack phases.

一名初级渗透测试人员在受控实验环境中使用 ChatGPT（通过 Shell_GPT 命令行工具）作为交互式助手，在针对易受攻击的目标机器的完整渗透测试过程中引导并执行命令。测试人员依靠自然语言提示来获取命令、解释结果并推进攻击阶段。

Methodology 核心方法

The authors conduct a simulated penetration test against a VulnHub virtual machine (PumpkinFestival) using ChatGPT 3.5 integrated via Shell_GPT (sgpt), a Python-based CLI tool that interfaces with the ChatGPT API. The experiment proceeds through 29 detailed steps spanning all five pentesting phases: reconnaissance, scanning, enumeration, exploitation, and post-exploitation (privilege escalation). At each step, natural language prompts are issued to ChatGPT via sgpt, which returns shell commands that are executed. The paper also provides an extensive qualitative analysis organized as 'The Good' (advantages), 'The Bad' (challenges/limitations), and 'The Ugly' (potential risks and unintended consequences).

作者针对 VulnHub 虚拟机 (PumpkinFestival) 进行了一次模拟渗透测试，使用了通过 Shell_GPT (sgpt) 集成的 ChatGPT 3.5。sgpt 是一个基于 Python 的命令行工具，用于连接 ChatGPT API。实验包含涵盖渗透测试所有五个阶段（侦察、扫描、枚举、利用和后利用/提权）的 29 个详细步骤。在每一步中，都通过 sgpt 向 ChatGPT 发出自然语言提示，ChatGPT 返回执行的 shell 命令。本文还提供了一个广泛的定性分析，分为“好的”（优势）、“坏的”（挑战/局限性）和“丑陋的”（潜在风险和意外后果）。

Architecture 架构设计

Human-in-the-loop architecture where the pentester issues natural language prompts through Shell_GPT (sgpt), a Python CLI wrapper for the ChatGPT API. ChatGPT generates shell commands, code snippets, and analytical interpretations. The pentester executes commands on a Kali Linux VM and feeds results back to ChatGPT for interpretation and next-step guidance. ChatGPT's web interface was also used for analysis tasks such as vulnerability assessment of wpscan output.

人机协作架构，渗透测试人员通过 Shell_GPT (sgpt)（ChatGPT API 的 Python 命令行封装器）发出自然语言提示。ChatGPT 生成 shell 命令、代码片段和分析解释。渗透测试人员在 Kali Linux 虚拟机上执行命令，并将结果反馈给 ChatGPT 进行解释和下一步指导。ChatGPT 的 Web 界面也用于分析任务，例如 wpscan 输出的漏洞评估。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

ChatGPT successfully guided a beginner pentester through the complete compromise of a VulnHub machine (PumpkinFestival), achieving root access and collecting all PumpkinTokens across 29 steps. The AI excelled at generating appropriate shell commands from natural language prompts, interpreting scan results, crafting exploits (e.g., a base62 decoding script), suggesting privilege escalation pathways, and producing a comprehensive penetration testing report. The experiment demonstrated that GenAI can produce commands for every pentesting phase, with exploitation being where it 'shone the greatest.'

ChatGPT 成功引导了一名初学者渗透测试人员通过 29 个步骤完全攻破了 VulnHub 机器 (PumpkinFestival)，获得了 root 权限并收集了所有 PumpkinTokens。AI 在从自然语言提示生成适当的 shell 命令、解释扫描结果、编写利用程序（例如 base62 解码脚本）、建议提权路径以及生成全面的渗透测试报告方面表现出色。实验证明，生成式 AI 可以在渗透测试的每个阶段生成命令，其中漏洞利用是其“表现最出色”的环节。

Environment 评估环境

Metrics 评估指标

Scale 评估规模

1 VulnHub machine (PumpkinFestival v1.0)

Contributions 核心贡献

First peer-reviewed paper to provide an in-depth, step-by-step analysis of using GenAI (ChatGPT) for penetration testing across all five phases, with detailed commands and reproducible methodology
Comprehensive qualitative framework categorizing GenAI's role in pentesting as 'The Good' (improved efficiency, enhanced creativity, customised testing environments, continuous learning, legacy system compatibility), 'The Bad' (overreliance on AI, ethical/legal concerns, inherent bias), and 'The Ugly' (escalation of cyber threats, uncontrolled AI development)
Demonstration that GenAI can produce commands for a full penetration test and generate an accurate, complete pentesting report without missing key findings
Discussion of best practices and guidelines for responsible implementation of GenAI in pentesting, covering responsible AI deployment, data security/privacy, and collaboration/information sharing

第一篇提供深入、分步分析在渗透测试所有五个阶段使用生成式 AI (ChatGPT) 的同行评审论文，具有详细的命令和可重复的方法。
一个全面的定性框架，将生成式 AI 在渗透测试中的作用归类为“好的”（提高效率、增强创造力、定制测试环境、持续学习、遗留系统兼容性）、“坏的”（过度依赖 AI、伦理/法律问题、固有偏差）和“丑陋的”（网络威胁升级、AI 发展不受控）。
证明了生成式 AI 可以为完整的渗透测试生成命令，并生成准确、完整的渗透测试报告，且不遗漏关键发现。
讨论了在渗透测试中负责任地实施生成式 AI 的最佳实践和准则，涵盖负责任地 AI 部署、数据安全/隐私以及协作/信息共享。

Limitations 局限性

Limited to technologies, tools, and techniques available prior to June 17, 2023; specific ChatGPT 3.5 version from May 24, 2023
Only a single VulnHub machine (PumpkinFestival) was tested, limiting generalizability of results
No quantitative metrics or comparison with manual pentesting or other AI-assisted tools
The experiment simulates a beginner pentester scenario; does not evaluate effectiveness for advanced or novel vulnerabilities
ChatGPT's content policies required jailbreaking techniques (DAN prompt) to bypass ethical guardrails for pentesting-related queries
ChatGPT cannot directly interact with target systems or networks due to its policies, limiting active reconnaissance capabilities
No evaluation of false positives, hallucinations, or incorrect commands generated during the process

仅限于 2023 年 6 月 17 日之前可用的技术、工具和方法；使用的是 2023 年 5 月 24 日版本的 ChatGPT 3.5。
仅测试了一台 VulnHub 机器 (PumpkinFestival)，限制了结果的普适性。
缺乏定量指标，也没有与手动渗透测试或其他 AI 辅助工具进行比较。
实验模拟的是初学者渗透测试场景；未评估针对高级或新型漏洞的有效性。
由于 ChatGPT 的内容政策，需要使用越狱技术（DAN 提示）来绕过渗透测试相关查询的伦理护栏。
由于其政策，ChatGPT 无法直接与目标系统或网络交互，限制了主动侦察能力。
未评估过程中产生的幻觉、错误命令或误报率。

Research Gaps 研究空白

Lack of quantitative benchmarks comparing GenAI-assisted pentesting efficiency against manual or traditional automated approaches
Need for evaluation of GenAI across diverse and more complex target environments beyond single VulnHub machines
Absence of frameworks for managing ethical and legal risks when deploying GenAI for offensive security
Limited research on integrating GenAI with fully autonomous pentesting pipelines (e.g., Auto-GPT for pentesting)
Need for research on GenAI's effectiveness against zero-day vulnerabilities versus known vulnerability patterns
Insufficient study of GenAI hallucination rates and false positive/negative rates in pentesting contexts
Research needed on privacy-preserving GenAI integration for pentesting (e.g., privateGPT with sgpt)

缺乏将生成式 AI 辅助的渗透测试效率与手动或传统自动化方法进行比较的定量基准。
需要在除单一 VulnHub 机器之外的更多样化和更复杂的目标环境中评估生成式 AI。
缺乏在部署生成式 AI 进行攻击性安全时管理伦理和法律风险的框架。
关于将生成式 AI 与完全自主的渗透测试管道（例如用于渗透测试的 Auto-GPT）集成的研究有限。
需要研究生成式 AI 针对零日漏洞与已知漏洞模式的有效性。
对渗透测试背景下生成式 AI 的幻觉率以及误报/漏报率的研究不足。
需要研究用于渗透测试的保护隐私的生成式 AI 集成（例如将 privateGPT 与 sgpt 结合）。

Novel Techniques 新颖技术

Using Shell_GPT (sgpt) as a CLI bridge between ChatGPT API and pentesting tools on Kali Linux, enabling seamless command generation and result interpretation in the terminal
Parameterized prompting technique where target information is stored as variables ([target], [hostname], [FQDN]) that ChatGPT can reference throughout the conversation for consistent command generation
Piping tool output (e.g., wpscan results) directly into sgpt's chat mode for AI-powered vulnerability analysis
Using ChatGPT to craft one-liner exploitation scripts by combining previously suggested commands, demonstrating code synthesis for offensive purposes

使用 Shell_GPT (sgpt) 作为 ChatGPT API 与 Kali Linux 上的渗透测试工具之间的命令行桥梁，实现了终端中无缝的命令生成和结果解释。
参数化提示技术，将目标信息存储为变量（[target], [hostname], [FQDN]），ChatGPT 可以在整个对话中引用这些变量，以生成一致的命令。
将工具输出（如 wpscan 结果）直接通过管道传输到 sgpt 的聊天模式中，进行 AI 驱动的漏洞分析。
使用 ChatGPT 通过组合之前建议的命令来编写单行漏洞利用脚本，展示了用于攻击目的的代码合成能力。

Open Questions 开放问题

How does GenAI-assisted pentesting scale to enterprise-level networks with hundreds of hosts and services?
What is the actual false positive and hallucination rate when ChatGPT generates pentesting commands and vulnerability assessments?
Can fully autonomous GenAI pentesting (e.g., Auto-GPT) match or exceed human-in-the-loop approaches in effectiveness and safety?
How should regulatory frameworks evolve to address the dual-use nature of GenAI in offensive security?
What guardrails can prevent malicious actors from using GenAI to automate cyberattacks while still enabling legitimate security research?

生成式 AI 辅助的渗透测试如何扩展到拥有数百个主机和服务的企业级网络？
当 ChatGPT 生成渗透测试命令和漏洞评估时，实际的误报率和幻觉率是多少？
完全自主的生成式 AI 渗透测试（如 Auto-GPT）在有效性和安全性方面能否匹配或超过人机协作方法？
监管框架应如何演变以应对生成式 AI 在攻击性安全中的双重用途性质？
什么样的护栏既能防止恶意行为者利用生成式 AI 自动化网络攻击，又能支持合法的安全研究？

Builds On 基于前人工作

PentestGPT
Shell_GPT (sgpt)
Mayhem (DARPA Cyber Grand Challenge)
DeepExploit
DeepHack
GAIL-PT
Auto-GPT