论文列表 Paper List

共 68 篇论文(按编号排序) Total 68 papers (Sorted by ID)

#01 system 2024

PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, Stefan Rass

multi-agent human-in-the-loop
#02 system 2026

What Makes a Good LLM Agent for Real-world Penetration Testing? What Makes a Good LLM Agent for Real-world Penetration Testing?

Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, Tianwei Zhang

single-agent fully-autonomous
#03 system 2025

Automated Penetration Testing with LLM Agents and Classical Planning Automated Penetration Testing with LLM Agents and Classical Planning

Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan Li, Yan Chen

single-agent fully-autonomous
#04 system 2025

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems

Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, Phan The Duy

multi-agent fully-autonomous
#05 system 2025

AutoPenGPT: Highly automated penetration testing framework based on LLM AutoPenGPT: Highly automated penetration testing framework based on LLM

Tianqi Jiang

multi-agent semi-autonomous
#06 system 2024

AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks

Jiacen Xu, Jack W. Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, Zhou Li

multi-agent fully-autonomous
#07 system 2024

AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? AutoPT: How Far Are We from the End2End Automated Web Penetration Testing?

Benlong Wu, Guoqiang Chen, Kejiang Chen, Xiuwei Shang, Jiapeng Han, Yanru He, Weiming Zhang, Nenghai Yu

multi-agent fully-autonomous
#08 system 2025

AutoPentester: An LLM Agent-based Framework for Automated Pentesting AutoPentester: An LLM Agent-based Framework for Automated Pentesting

Yasod Ginige, Akila Niroshan, Sajal Jain, Suranga Seneviratne

multi-agent fully-autonomous
#09 system 2025

VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu

multi-agent fully-autonomous
#10 system 2025

Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Hui Li, Tong Li

single-agent fully-autonomous
#11 system 2025

PentestAgent: Incorporating LLM Agents to Automated Penetration Testing PentestAgent: Incorporating LLM Agents to Automated Penetration Testing

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, Wei Ruan

multi-agent semi-autonomous
#12 system 2026

PTFusion: LLM-driven context-aware knowledge fusion for web penetration testing PTFusion: LLM-driven context-aware knowledge fusion for web penetration testing

Wenhao Wang, Hao Gu, Zhixuan Wu, Hao Chen, Xingguo Chen, Fan Shi

hierarchical fully-autonomous
#13 system 2025

RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents

Sho Nakatani

single-agent fully-autonomous
#14 system 2025

ARACNE: An LLM-Based Autonomous Shell Pentesting Agent ARACNE: An LLM-Based Autonomous Shell Pentesting Agent

Tomas Nieponice, Veronica Valeros, Sebastian Garcia

multi-agent fully-autonomous
#15 system 2025

PwnGPT: Automatic Exploit Generation Based on Large Language Models PwnGPT: Automatic Exploit Generation Based on Large Language Models

Wanzong Peng, Lin Ye, Xuetao Du, Hongli Zhang, Dongyang Zhan, Yunting Zhang, Yicheng Guo, Chen Zhang

single-agent fully-autonomous
#16 system 2025

PenTest++: Elevating Ethical Hacking with AI and Automation PenTest++: Elevating Ethical Hacking with AI and Automation

Haitham S. Al-Sinani, Chris J. Mitchell

human-in-the-loop semi-autonomous
#17 system 2024

Using LLMs to Automate Threat Intelligence Analysis Workflows in Security Operation Centers Using LLMs to Automate Threat Intelligence Analysis Workflows in Security Operation Centers

PeiYu Tseng, ZihDwo Yeh, Xushu Dai, Peng Liu

single-agent fully-autonomous
#18 system 2024

Hacking, The Lazy Way: LLM Augmented Pentesting Hacking, The Lazy Way: LLM Augmented Pentesting

Dhruva Goyal, Aditya Peela, Sitaraman Subramanian, Nisha P Shetty

single-agent copilot
#19 system 2025

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks

Andreas Happe, Jürgen Cito

hierarchical fully-autonomous
#20 system 2024

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Lajos Muzsai, David Imolai, Andras Lukacs

single-agent fully-autonomous
#21 system 2025

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, Daniel Kang

multi-agent fully-autonomous
#22 system 2025

CAI: An Open, Bug Bounty-Ready Cybersecurity AI CAI: An Open, Bug Bounty-Ready Cybersecurity AI

Victor Mayoral-Vilches, Luis Javier Navarrete-Lozano, Maria Sanz-Gomez, Lidia Salas Espejo, Martino Crespo-Alvarez, Francisco Oca-Gonzalez, Francesco Balassone, Alfonso Glera-Picon, Unai Ayucar-Carbajo, Jon Ander Ruiz-Alcalde, Stefan Rass, Martin Pinzger, Endika Gil-Uriarte

multi-agent semi-autonomous
#23 survey 2025

On the Surprising Efficacy of LLMs for Penetration-Testing On the Surprising Efficacy of LLMs for Penetration-Testing

Andreas Happe, Juergen Cito

Surveys multiple: single-agent, multi-agent, hierarchical, human-in-the-loop Surveys full spectrum: fully-autonomous, semi-autonomous, human-in-the-loop, copilot (vibe-hacking)
#24 system 2013

POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing

Carlos Sarraute, Olivier Buffet, Joerg Hoffmann

single-agent fully-autonomous
#25 system 2019

Markov Game Modeling of Moving Target Defense for Strategic Detection of Threats in Cloud Networks Markov Game Modeling of Moving Target Defense for Strategic Detection of Threats in Cloud Networks

Ankur Chowdhary, Sailik Sengupta, Dijiang Huang, Subbarao Kambhampati

none fully-autonomous
#26 empirical-study 2021

Modeling Penetration Testing with Reinforcement Learning Using Capture-the-Flag Challenges: Trade-offs between Model-free Learning and A Priori Knowledge Modeling Penetration Testing with Reinforcement Learning Using Capture-the-Flag Challenges: Trade-offs between Model-free Learning and A Priori Knowledge

Fabio Massimo Zennaro, Laszlo Erdodi

single-agent fully-autonomous
#27 system 2021

CybORG: A Gym for the Development of Autonomous Cyber Agents CybORG: A Gym for the Development of Autonomous Cyber Agents

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, Damian Marriott

single-agent fully-autonomous
#28 benchmark 2024

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

single-agent fully-autonomous
#29 benchmark 2024

AutoPenBench: Benchmarking Generative Agents for Penetration Testing AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, Roberto Bifulco

single-agent with two variants: fully-autonomous and human-assisted (semi-autonomous) fully-autonomous and semi-autonomous (two separate agent architectures evaluated)
#30 benchmark 2025

VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models

Bishal Ranjan Das, Sonia Jassi, Vaibhav Khandelwal, Tarun, Akansh Agarwal, Krittika Priyadarshini

none copilot
#31 system 2005

MulVAL: A Logic-based Network Security Analyzer MulVAL: A Logic-based Network Security Analyzer

Xinming Ou, Sudhakar Govindavajhala, Andrew W. Appel

none fully-autonomous
#32 defense 2025

Cloak, Honey, Trap: Proactive Defenses Against LLM Agents Cloak, Honey, Trap: Proactive Defenses Against LLM Agents

Daniel Ayzenshteyn, Roy Weiss, Yisroel Mirsky

multi-agent fully-autonomous
#33 system 2024

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

Dario Pasquini, Evgenios M. Kornaropoulos, Giuseppe Ateniese

single-agent fully-autonomous
#34 survey 2025

Forewarned is Forearmed: A Survey on Large Language Model-based Agents in Autonomous Cyberattacks Forewarned is Forearmed: A Survey on Large Language Model-based Agents in Autonomous Cyberattacks

Minrui Xu, Jiani Fan, Xinyu Huang, Conghao Zhou, Jiawen Kang, Dusit Niyato, Shiwen Mao, Zhu Han, Xuemin (Sherman) Shen, Kwok-Yan Lam

multi-agent fully-autonomous
#35 survey 2025

AI in Penetration Testing: A Systematic Mapping Study AI in Penetration Testing: A Systematic Mapping Study

Sulaiman O. Alwabisi

N/A N/A
#36 system 2024

Automated Penetration Testing: Formalization and Realization Automated Penetration Testing: Formalization and Realization

Charilaos Skandylas, Mikael Asplund

single-agent fully-autonomous
#37 survey 2025

Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Andreas Happe, Jürgen Cito

#38 system 2023

Getting Pwn'd by AI: Penetration Testing with Large Language Models Getting Pwn'd by AI: Penetration Testing with Large Language Models

Andreas Happe, Juergen Cito

single-agent semi-autonomous
#39 benchmark 2025

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Isamu Isozaki, Manil Shrestha, Rick Console, Edward Kim

multi-agent human-in-the-loop
#40 empirical-study 2024

Generative AI for pentesting: the good, the bad, the ugly Generative AI for pentesting: the good, the bad, the ugly

Eric Hilario, Sami Azam, Jawahar Sundaram, Khwaja Imran Mohammed, Bharanidharan Shanmugam

human-in-the-loop human-in-the-loop
#41 system 2024

BreachSeek: A Multi-Agent Automated Penetration Tester BreachSeek: A Multi-Agent Automated Penetration Tester

Ibrahim AlShehri, Adnan AlShehri, Abdulrahman AlMalki, Majed Bamardouf, Alaqsa Akbar

multi-agent fully-autonomous
#42 system 2024

Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments

Maria Rigaki, Carlos Catania, Sebastian Garcia

single-agent fully-autonomous
#43 system 2025

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

multi-agent fully-autonomous
#44 system 2025

Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges

Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, Shuai Wang

single-agent fully-autonomous
#45 system 2025

Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

Lajos Muzsai, David Imolai, András Lukács

single-agent fully-autonomous
#46 benchmark 2025

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

multi-agent fully-autonomous
#47 empirical-study 2026

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, Jessica Wang

single-agent fully-autonomous
#48 system 2026

Context Relay for Long-Running Penetration-Testing Agents Context Relay for Long-Running Penetration-Testing Agents

Marius Vangeli, Joel Brynielsson, Mika Cohen, Farzad Kamrani

single-agent fully-autonomous
#49 system 2026

Towards Cybersecurity Superintelligence: from AI-guided humans to human-guided AI Towards Cybersecurity Superintelligence: from AI-guided humans to human-guided AI

Victor Mayoral-Vilches, Stefan Rass, Martin Pinzger, Endika Gil-Uriarte, Unai Ayucar-Carbajo, Jon Ander Ruiz-Alcalde, Maite del Mundo de Torres, Maria Sanz-Gomez, Francesco Balassone, Cristobal R. J. Veas Chavez, Vanesa Turiel, Alfonso Glera-Picon, Daniel Sanchez-Prieto, Yuri Salvatierra, Paul Zabalegui-Landa, Ruffino Reydel Cabrera-Alvarez, Patxi Mayoral-Pizarroso

multi-agent fully-autonomous
#50 system 2025

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

Reworr, Dmitrii Volkov

single-agent fully-autonomous
#51 position-paper 2025

To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack

Terry Yue Zhuo, Yangruibo Ding, Wenbo Guo, Ruijie Meng

N/A - position paper discussing the evolution from workflow agents to trained agents fully-autonomous
#52 system 2025

RedTeamLLM: an Agentic AI framework for offensive security RedTeamLLM: an Agentic AI framework for offensive security

Brian Challita, Pierre Parrend

single-agent fully-autonomous
#53 benchmark 2025

HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

Xiaoxue Ren, Penghao Jiang, Kaixin Li, Zhiyong Huang, Xiaoning Du, Jiaojiao Jiang, Zhenchang Xing, Jiamou Sun, Terry Yue Zhuo

single-agent fully-autonomous
#54 system 2025

Cyber-Zero: Training Cybersecurity Agents Without Runtime Cyber-Zero: Training Cybersecurity Agents Without Runtime

Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang

single-agent fully-autonomous
#55 empirical-study 2026

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

Andreas Happe, Aaron Kaplan, J\xFCrgen Cito

single-agent fully-autonomous
#56 system 2025

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, Ofir Press

single-agent fully-autonomous
#57 system 2025

Multi-Agent Penetration Testing AI for the Web Multi-Agent Penetration Testing AI for the Web

Isaac David, Arthur Gervais

multi-agent fully-autonomous
#58 benchmark 2025

PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang, Xiaofei Xie

single-agent fully-autonomous
#59 system 2025

RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models

Hanzheng Dai, Yuanliang Li, Jun Yan, Zhibo Zhang

human-in-the-loop human-in-the-loop
#60 system 2025

Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, Vyas Sekar

hierarchical fully-autonomous
#61 system 2025

AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents

Julius Henke

multi-agent semi-autonomous
#62 empirical-study 2024

LLM Agents can Autonomously Exploit One-day Vulnerabilities LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang

single-agent fully-autonomous
#63 benchmark 2025

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang

single-agent fully-autonomous
#64 survey 2024

SoK: A Comparison of Autonomous Penetration Testing Agents SoK: A Comparison of Autonomous Penetration Testing Agents

Raphael Simon, Wim Mees

single-agent fully-autonomous
#65 empirical-study 2024

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, Muhammad Shafique

single-agent fully-autonomous
#66 system 2024

PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation

Junjie Huang, Quanyan Zhu

multi-agent fully-autonomous
#67 empirical-study 2023

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

Wesley Tann, Yuancheng Liu, Jun Heng Sim, Choon Meng Seah, Ee-Chien Chang

none human-in-the-loop
#68 survey 2020

An Empirical Survey of Functions and Configurations of Open-Source Capture the Flag (CTF) Environments An Empirical Survey of Functions and Configurations of Open-Source Capture the Flag (CTF) Environments

Stela Kucek, Maria Leitner

none human-in-the-loop