#16

PenTest++: Elevating Ethical Hacking with AI and Automation PenTest++: Elevating Ethical Hacking with AI and Automation

Haitham S. Al-Sinani, Chris J. Mitchell

2025 | arXiv (preprint)

system penetration-testing semi-autonomous human-in-the-loop

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Traditional ethical hacking relies on skilled professionals performing resource-intensive, time-consuming manual processes across all stages from reconnaissance to exploitation. This limits scalability and efficiency, especially as cyber threats grow in complexity and diversity.

传统的道德黑客攻击依赖于专业人员在从侦察到漏洞利用的所有阶段执行资源密集型、耗时的手动过程。这限制了可扩展性和效率，尤其是随着网络威胁在复杂性和多样性方面的不断增长。

Existing penetration testing workflows require extensive manual expertise and command memorization, making them hard to scale. GenAI offers the potential to automate routine tasks and interpret complex outputs, but challenges like hallucinations and privacy concerns necessitate a human-in-the-loop approach. No prior work had systematically integrated GenAI-powered automation into every phase of the ethical hacking process while maintaining user control.

现有的渗透测试工作流需要广泛的手动专业知识和命令记忆，难以大规模扩展。生成式人工智能（GenAI）具有自动化常规任务和解释复杂输出的潜力，但幻觉和隐私问题等挑战使得必须采用“人机协同”的方法。此前尚无研究能够系统地将 GenAI 驱动的自动化整合到道德黑客攻击过程的每个阶段，同时保持用户的控制权。

Threat Model 威胁模型

Controlled virtual lab environment with an attacker Kali Linux VM and two target Debian Linux VMs on a local NAT network. The system assumes authorized penetration testing with user oversight at all decision points. No defensive mechanisms (IDS/IPS, firewalls) are active on target machines.

受控的虚拟实验室环境，包含一个攻击者 Kali Linux 虚拟机和两个位于本地 NAT 网络上的目标 Debian Linux 虚拟机。系统假设进行的是经过授权的渗透测试，且在所有决策点都有用户监督。目标机器上没有启用防御机制（如 IDS/IPS、防火墙）。

Methodology 核心方法

PenTest++ is a command-line Python tool that integrates ChatGPT (via the GPT-4 API) with traditional penetration testing tools to automate the five-phase ethical hacking workflow. The system automates reconnaissance (network discovery via nmap), scanning and enumeration (port/service scanning, directory brute-forcing), exploitation (tailored payload generation, credential cracking, reverse shells), and report generation. At each phase, the system sends tool outputs to ChatGPT for analysis and actionable recommendations, while prompting the user for confirmation before executing critical actions.

PenTest++ 是一个命令行 Python 工具，它将 ChatGPT（通过 GPT-4 API）与传统的渗透测试工具集成，实现了五个阶段的道德黑客工作流自动化。该系统自动执行侦察（通过 nmap 进行网络发现）、扫描和枚举（端口/服务扫描、目录爆破）、漏洞利用（量身定制的载荷生成、凭据破解、反弹 shell）以及报告生成。在每个阶段，系统都会将工具输出发送给 ChatGPT 进行分析并获取可操作的建议，同时在执行关键操作之前提示用户确认。

Architecture 架构设计

Modular Python-based CLI application with four main modules: (1) Reconnaissance Module for network host discovery, (2) Scanning & Enumeration Module for port scanning and service identification, (3) Gaining Access Module for service-specific exploitation workflows (FTP, HTTP, SSH, NFS, HTTP Proxy), and (4) Reporting and Documentation Module for automated report generation via ChatGPT. The system integrates with the OpenAI GPT-4 API via HTTP requests using Python's requests library, sending pentest logs and prompts as API payloads and receiving structured JSON responses.

基于 Python 的模块化 CLI 应用程序，包含四个主要模块：(1) 侦察模块，用于网络主机发现；(2) 扫描与枚举模块，用于端口扫描和服务识别；(3) 获取访问权限模块，用于针对特定服务（FTP, HTTP, SSH, NFS, HTTP Proxy）的漏洞利用工作流；(4) 报告与文档模块，通过 ChatGPT 自动生成报告。系统通过 Python 的 requests 库使用 HTTP 请求与 OpenAI GPT-4 API 集成，将渗透测试日志和提示词作为 API 载荷发送，并接收结构化的 JSON 响应。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

PenTest++ successfully completed end-to-end penetration testing on two Linux VMs in a controlled lab. On VM1 (192.168.1.7), it exploited anonymous FTP access, cracked an MD5 hash, discovered hidden web directories, and uploaded a PHP reverse shell to gain access. On VM2 (192.168.1.10), it chained NFS misconfiguration exploitation, password cracking, LFI vulnerability exploitation, and SSH key-based authentication to gain shell access. The evaluation was purely qualitative with no quantitative metrics reported.

PenTest++ 在受控实验室的两台 Linux 虚拟机上成功完成了端到端的渗透测试。在虚拟机 1 (192.168.1.7) 上，它利用了匿名 FTP 访问，破解了 MD5 哈希，发现了隐藏的 Web 目录，并上传了 PHP 反弹 shell 以获取访问权限。在虚拟机 2 (192.168.1.10) 上，它链接了 NFS 配置错误利用、密码破解、LFI 漏洞利用和基于 SSH 密钥的身份验证，从而获取了 shell 访问权限。评估纯粹是定性的，没有报告定量指标。

Environment 评估环境

Metrics 评估指标

Scale 评估规模

2 custom Linux VMs in a virtual lab

Contributions 核心贡献

Introduces PenTest++, an AI-augmented command-line tool that automates the five-phase ethical hacking workflow while maintaining human oversight at critical decision points
Demonstrates a mixed-initiative system design that balances AI-driven automation with user control, fostering trust and adaptability in penetration testing workflows
Shows how GenAI (ChatGPT) can be integrated as an analytical assistant to interpret tool outputs, identify vulnerabilities, suggest exploitation strategies, and generate penetration testing reports
Explores ethical implications of AI in penetration testing, including privacy risks, hallucination risks, and the need for robust safeguards
Provides a proof-of-concept with two case studies demonstrating multi-vector attack chaining guided by AI recommendations

引入了 PenTest++，这是一个人工智能增强的命令行工具，在实现五个阶段的道德黑客工作流自动化的同时，在关键决策点保持了人工监督
展示了一种混合主动系统设计，平衡了人工智能驱动的自动化与用户控制，增强了渗透测试工作流中的信任度和适应性
展示了如何将生成式人工智能（ChatGPT）集成为分析助手，以解释工具输出、识别漏洞、建议漏洞利用策略并生成渗透测试报告
探讨了人工智能在渗透测试中的伦理影响，包括隐私风险、幻觉风险以及对稳健防护措施的需求
通过两个案例研究提供了概念验证，展示了在人工智能建议指导下的多向量攻击链

Limitations 局限性

Evaluation is purely qualitative with no quantitative metrics (e.g., time saved, success rates, accuracy of AI recommendations) reported
Tested only on two Linux VMs in a controlled virtual environment with no defensive mechanisms; generalizability to real-world networks with diverse OSs, firewalls, and IDS/IPS is unvalidated
Uses an online LLM (ChatGPT-4o via API), raising significant privacy and data sensitivity concerns as sensitive pentest data (credentials, system configurations, logs) is sent to an external cloud service
ChatGPT hallucinations were observed during testing, with some recommendations requiring manual correction or validation
No comparison with other AI-driven penetration testing tools or with purely manual approaches to demonstrate relative effectiveness
Limited attack surface covered -- does not address privilege escalation, lateral movement, post-exploitation, or persistence
Exploitation modules are hardcoded for specific service types (FTP, HTTP, SSH, NFS, HTTP Proxy); novel or uncommon services are not handled
Ethical safeguards are discussed conceptually but not implemented as technical controls within the system

评估纯粹是定性的，没有报告定量指标（例如节省的时间、成功率、人工智能建议的准确性）
仅在没有防御机制的受控虚拟环境中的两台 Linux 虚拟机上进行了测试；其在具有多种操作系统、防火墙和 IDS/IPS 的现实网络中的泛化能力未经证实
使用在线大语言模型（通过 API 的 ChatGPT-4o），将敏感的渗透测试数据（凭证、系统配置、日志）发送到外部云服务，引发了重大的隐私和数据敏感性担忧
在测试过程中观察到了 ChatGPT 的幻觉，部分建议需要手动纠正或验证
没有与其他人工智能驱动的渗透测试工具或纯手动方法进行对比，以证明其相对有效性
涵盖的攻击面有限 —— 未解决权限提升、横向移动、后期利用或持久化问题
漏洞利用模块针对特定的服务类型（FTP, HTTP, SSH, NFS, HTTP Proxy）进行了硬编码；无法处理新型或罕见服务
伦理防护措施仅在概念层面进行了讨论，未作为技术控制手段实现在系统中

Research Gaps 研究空白

No existing work had systematically integrated GenAI automation into all phases of penetration testing while preserving user control
Lack of quantitative benchmarks and metrics for evaluating AI-augmented penetration testing tools
Privacy-preserving AI integration for penetration testing (e.g., offline/local LLMs) remains unexplored
Comparative analysis between different AI-driven penetration testing approaches is missing from the literature
Effectiveness of AI-augmented pentesting on diverse targets (Windows, macOS, Android, IoT, cloud) has not been studied
No formal framework exists for ethical safeguards in AI-powered offensive security tools

此前尚无研究能够系统地将生成式人工智能自动化整合到渗透测试的所有阶段，同时保留用户控制权
缺乏用于评估人工智能增强的渗透测试工具的定量基准和指标
针对渗透测试的隐私保护人工智能集成（例如离线/本地大语言模型）仍有待探索
文献中缺乏不同人工智能驱动的渗透测试方法之间的对比分析
人工智能增强的渗透测试在多种目标（Windows, macOS, Android, IoT, 云端）上的有效性尚未得到研究
目前尚不存在针对人工智能驱动的攻击性安全工具的正式伦理防护框架

Novel Techniques 新颖技术

Service-specific automated exploitation workflows that dynamically select attack strategies based on detected services (FTP anonymous access, NFS share mounting, HTTP directory brute-forcing, LFI exploitation, SSH key-based auth chaining)
Using ChatGPT as an inline analytical engine to parse and interpret tool outputs (nmap scans, configuration files, FTP file contents) and return structured JSON with actionable findings
Automated multi-format penetration testing report generation (text, JSON, PDF) via ChatGPT prompts constructed from structured log data
Mixed-initiative design pattern where automation proceeds but pauses at critical junctures for user confirmation, balancing efficiency with oversight

针对特定服务的自动化漏洞利用工作流，可根据检测到的服务动态选择攻击策略（匿名 FTP 访问、NFS 共享挂载、HTTP 目录爆破、LFI 利用、SSH 密钥身份验证链）
将 ChatGPT 用作内联分析引擎，解析并解释工具输出（nmap 扫描、配置文件、FTP 文件内容），并返回带有可操作发现的结构化 JSON
通过根据结构化日志数据构建的 ChatGPT 提示词，实现多格式渗透测试报告的自动生成（文本、JSON、PDF）
混合主动设计模式：自动化流程持续进行，但在关键节点暂停以待用户确认，在效率与监督之间取得平衡

Open Questions 开放问题

How much does ChatGPT actually improve penetration testing outcomes compared to manual testing or traditional automation alone?
What is the hallucination rate of LLMs when providing exploitation guidance, and how dangerous are incorrect recommendations in a security context?
Can offline or local LLMs achieve comparable performance to cloud-based models like GPT-4 for penetration testing assistance while preserving data privacy?
How would PenTest++ perform against hardened targets with active defenses (WAFs, IDS/IPS, EDR)?
What technical guardrails can prevent misuse of AI-augmented offensive tools while preserving their utility for authorized testing?
How does the system handle novel or zero-day vulnerabilities that the LLM has not been trained on?

与单纯的手动测试或传统自动化相比，ChatGPT 究竟在多大程度上改善了渗透测试的结果？
大语言模型在提供漏洞利用指导时的幻觉率是多少？在安全语境下，错误的建议有多危险？
离线或本地大语言模型在保持数据隐私的同时，能否在渗透测试辅助方面达到与 GPT-4 等云端模型相当的性能？
PenTest++ 在面对具有主动防御（WAF, IDS/IPS, EDR）的加固目标时表现如何？
哪些技术护栏可以防止人工智能增强的攻击性工具被滥用，同时保留其在授权测试中的效用？
系统如何处理大语言模型未经训练的新型漏洞或 0day 漏洞？

Builds On 基于前人工作

Al-Sinani et al. 2024 - Unleashing AI in ethical hacking
Al-Sinani and Mitchell 2024 - AI-enhanced ethical hacking Linux-focused
Al-Sinani and Mitchell 2024 - AI-augmented manual exploitation and privilege escalation