VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models VAP-6: A Benchmarking Framework on Vulnerability Assessment and Penetration Testing for Language Models
Problem & Motivation 问题与动机
There is a scarcity of comprehensive benchmarks for evaluating LLMs in the Vulnerability Assessment and Penetration Testing (VAPT) domain, especially for small, open-source models suitable for local deployment. Existing general LLM benchmarks and even emerging cybersecurity-specific benchmarks do not adequately address the specific requirements of VAPT or the particular considerations for smaller, locally deployable models.
在漏洞评估和渗透测试(VAPT)领域,缺乏全面的基准来评估大语言模型(LLM),尤其是适用于本地部署的小型开源模型。现有的通用 LLM 基准和新兴的网络安全基准都未能充分满足 VAPT 的特定需求,也未考虑小型本地可部署模型的特殊要求。
VAPT practitioners need offline model operation to avoid exposing sensitive system data to third-party hosted LLMs. Less computationally demanding models democratize access and enable customization through fine-tuning. While LLMs show significant promise for cybersecurity operations, there is no standardized methodology for evaluating their inherent VAPT knowledge, particularly for compact quantized models that can run locally on moderate hardware.
VAPT 从业者需要离线运行模型以避免将敏感系统数据暴露给第三方托管的 LLM。计算需求较低的模型使更多人能够使用,并可通过微调实现定制化。虽然 LLM 在网络安全操作中展现出显著潜力,但目前没有标准化的方法来评估其固有的 VAPT 知识,特别是针对可以在中等硬件上本地运行的紧凑量化模型。
Threat Model 威胁模型
Not explicitly defined. The benchmark assumes a knowledge evaluation context where LLMs are tested on static VAPT knowledge domains (CVE/CWE identification, CVSS scoring, certification-level reasoning, tool proficiency, exploit mapping) rather than dynamic interactive penetration testing. The privacy motivation implies a threat model where sensitive assessment data must not leave the local environment.
未明确定义威胁模型。基准假设一个知识评估场景,通过静态 VAPT 知识领域(CVE/CWE 识别、CVSS 评分、认证级推理、工具熟练度、漏洞利用映射)测试 LLM,而非动态交互式渗透测试。隐私动机意味着一种威胁模型,其中敏感评估数据不得离开本地环境。
Methodology 核心方法
The authors create six specialized MCQ-based datasets totaling 7800 questions that cover critical VAPT knowledge domains. Source material is gathered from authoritative repositories (CVE MITRE, CWE MITRE, Exploit DB, GitHub), processed and reformatted into MCQ format using ChatGPT (GPT-4o/GPT-4o mini), then manually validated. Three small open-source LLMs (2-3B parameters) are evaluated using Q4 quantization via Ollama on moderate hardware.
作者创建了六个专业化的选择题数据集,共计 7800 道题目,涵盖关键的 VAPT 知识领域。源材料从权威数据库(CVE MITRE、CWE MITRE、Exploit DB、GitHub)收集,使用 ChatGPT(GPT-4o/GPT-4o mini)处理并格式化为选择题格式,然后进行人工验证。三个小型开源 LLM(2-3B 参数)通过 Ollama 使用 Q4 量化在中等硬件上进行评估。
Architecture 架构设计
VAP-6 is a benchmark framework, not an agent system. It consists of six dataset components: (1) CVEMCQs -- 2000 questions matching CVE IDs to descriptions; (2) CWEMCQs -- 2000 questions matching CWE IDs to descriptions; (3) CVSS Prediction -- 2000 questions requiring severity classification, vector string composition, and base score prediction from vulnerability descriptions; (4) CEH v12 & CompTIA PenTest+ PT0-002 Styled MCQs -- 800 scenario-based reasoning questions; (5) VAPT Tools MCQs -- 500 questions on Nmap, Burp Suite, Metasploit, sqlmap, Wireshark, and Nessus; (6) CVE-to-Metasploit Module Mapping -- 500 questions associating CVEs with Metasploit modules. The evaluation pipeline uses Ollama for local model inference with a standardized system prompt.
VAP-6 是一个基准框架,而非智能体系统。它由六个数据集组成:(1) CVEMCQs -- 2000 道题,将 CVE ID 与描述匹配;(2) CWEMCQs -- 2000 道题,将 CWE ID 与描述匹配;(3) CVSS 预测 -- 2000 道题,要求根据漏洞描述进行严重性分类、向量字符串组合和基础分数预测;(4) CEH v12 & CompTIA PenTest+ PT0-002 风格选择题 -- 800 道场景化推理题;(5) VAPT 工具选择题 -- 500 道关于 Nmap、Burp Suite、Metasploit、sqlmap、Wireshark 和 Nessus 的题目;(6) CVE 到 Metasploit 模块映射 -- 500 道题,将 CVE 与 Metasploit 模块关联。评估流程使用 Ollama 进行本地模型推理,配合标准化系统提示词。
LLM Models 使用的大模型
Memory Mechanism 记忆机制
none
Evaluation 评估结果
All three models achieved well above typical certification passing thresholds (70-85%) on CEH and PenTest+ style questions, with Qwen 2.5 and Llama 3.2 exceeding 96% accuracy. However, all models struggled with factual knowledge retrieval: CVE identification accuracy ranged from 22-25%, CWE identification from 25-27%. CVSS vector exact match rates were below 2% for all models. On CVSS severity classification, Llama 3.2 led at 68.30% while Qwen 2.5 achieved only 2.40%. Gemma2 achieved the lowest CVSS base score RMSE at 1.2327. VAPT tool proficiency varied widely: Qwen 2.5 at 83.57%, Gemma2 at 78.40%, Llama 3.2 at only 49.10%. CVE-Metasploit mapping: Llama 3.2 led at 63.20%.
三个模型在 CEH 和 PenTest+ 风格题目上均远超典型认证通过阈值(70-85%),其中 Qwen 2.5 和 Llama 3.2 超过 96% 准确率。然而所有模型在事实知识检索方面表现不佳:CVE 识别准确率 22-25%,CWE 识别 25-27%。CVSS 向量精确匹配率均低于 2%。CVSS 严重性分类方面,Llama 3.2 以 68.30% 领先,而 Qwen 2.5 仅 2.40%。Gemma2 的 CVSS 基础分数 RMSE 最低为 1.2327。VAPT 工具熟练度差异显著:Qwen 2.5 为 83.57%,Gemma2 为 78.40%,Llama 3.2 仅 49.10%。CVE-Metasploit 映射:Llama 3.2 以 63.20% 领先。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- Qwen 2.5 (3B, Q4_K_M)
- Gemma2 (2B, Q4_K_M)
- Llama 3.2 (3B, Q4_K_M)
Scale 评估规模
7800 questions across 6 datasets
Contributions 核心贡献
- VAP-6, a novel benchmark framework comprising 7800 questions across six specialized VAPT knowledge domains (CVE, CWE, CVSS, certification reasoning, tool proficiency, exploit mapping), the first dedicated benchmark for evaluating small locally-deployable LLMs in the VAPT domain.
- A standardized dataset creation methodology using authoritative sources (CVE MITRE, CWE MITRE, Exploit DB) with ChatGPT-assisted reformatting and manual verification.
- Empirical evaluation of three small open-source LLMs (2-3B parameters) with Q4 quantization, demonstrating that even compact models retain strong reasoning capabilities (96%+ on certification-style questions) while struggling with factual recall (22-27% on CVE/CWE identification).
- Identification of a clear dichotomy between knowledge retrieval and reasoning capabilities in quantized small LLMs, with implications for their practical deployment as VAPT assistive tools.
- VAP-6,一个新型基准框架,包含跨六个专业 VAPT 知识领域(CVE、CWE、CVSS、认证推理、工具熟练度、漏洞利用映射)的 7800 道题目,是首个专门用于评估小型本地可部署 LLM 在 VAPT 领域表现的基准。
- 一种标准化的数据集创建方法,使用权威来源(CVE MITRE、CWE MITRE、Exploit DB),通过 ChatGPT 辅助格式化和人工验证。
- 对三个小型开源 LLM(2-3B 参数)进行 Q4 量化的实证评估,证明即使是紧凑模型也能保持强大的推理能力(认证风格题目 96% 以上),但在事实回忆方面表现困难(CVE/CWE 识别 22-27%)。
- 识别出量化小型 LLM 中知识检索与推理能力之间的明显二分现象,对其作为 VAPT 辅助工具的实际部署具有重要意义。
Limitations 局限性
- Focuses exclusively on static knowledge assessment through MCQs rather than evaluating LLMs in dynamic, interactive penetration testing scenarios requiring planning, execution, and environmental adaptation.
- Only evaluates three small LLMs (2-3B parameters) with Q4 quantization -- results may not generalize to larger model architectures or different quantization levels.
- Datasets were refined using ChatGPT (GPT-4o/GPT-4o mini) which may introduce systematic biases from the refinement model into the benchmark questions and answer options.
- No evaluation of prompt engineering strategies, few-shot learning, or RAG approaches that could significantly improve model performance on knowledge-intensive tasks like CVE/CWE identification.
- The MCQ format artificially constrains evaluation -- real VAPT work requires free-form reasoning, tool command generation, and multi-step planning that MCQs cannot capture.
- No comparison with larger models or commercial APIs (GPT-4, Claude, etc.) to establish an upper performance bound for the benchmark.
- The paper does not release the benchmark dataset publicly, limiting reproducibility and community adoption.
- Error correlation analysis shows nearly perfect correlation (0.99-1.00) between all model pairs, suggesting the benchmark may not effectively discriminate between different model capabilities.
- CVSS prediction evaluation conflates multiple sub-tasks (severity, vector, score) that require fundamentally different capabilities, making it hard to draw targeted conclusions.
- 仅关注通过选择题进行的静态知识评估,未评估 LLM 在需要规划、执行和环境适应能力的动态交互式渗透测试场景中的表现。
- 仅评估了三个小型 LLM(2-3B 参数,Q4 量化),结果可能不适用于更大的模型架构或不同的量化级别。
- 数据集通过 ChatGPT(GPT-4o/GPT-4o mini)进行优化,这可能将精炼模型的系统性偏差引入基准题目和答案选项中。
- 未评估提示工程策略、少样本学习或 RAG 方法,这些方法可能显著提升模型在知识密集型任务(如 CVE/CWE 识别)上的表现。
- 选择题格式人为限制了评估 -- 真实的 VAPT 工作需要自由形式的推理、工具命令生成和多步骤规划,这些都是选择题无法捕捉的。
- 未与更大模型或商业 API(GPT-4、Claude 等)进行比较,无法建立基准的性能上限。
- 论文未公开发布基准数据集,限制了可复现性和社区采用。
- 错误相关性分析显示所有模型对之间几乎完美相关(0.99-1.00),表明基准可能无法有效区分不同模型的能力差异。
- CVSS 预测评估将多个子任务(严重性、向量、分数)混为一体,这些子任务需要根本不同的能力,难以得出有针对性的结论。
Research Gaps 研究空白
- No existing benchmark comprehensively evaluates LLM VAPT knowledge across the full spectrum of tasks practitioners encounter, from vulnerability identification to tool usage to exploit mapping.
- The evaluation of small, locally deployable LLMs for cybersecurity applications is severely understudied -- most research focuses on large proprietary models.
- There is no standardized methodology for assessing whether LLMs possess sufficient domain knowledge to serve as reliable VAPT assistive tools.
- The sharp dichotomy between reasoning ability (96%+ on certification questions) and factual recall (22-27% on CVE/CWE) in quantized small models is unexplored and has significant implications for how these models should be augmented (e.g., with RAG) for practical deployment.
- Interactive, dynamic penetration testing benchmarks that evaluate planning, execution, and adaptation in realistic environments remain scarce.
- The impact of quantization levels on domain-specific cybersecurity knowledge retention has not been systematically studied.
- 现有基准未能全面评估 LLM 在从业者实际遇到的全方位 VAPT 任务中的知识,从漏洞识别到工具使用再到漏洞利用映射。
- 小型本地可部署 LLM 在网络安全应用中的评估严重不足 -- 大多数研究集中在大型专有模型上。
- 缺乏标准化方法来评估 LLM 是否具备足够的领域知识以作为可靠的 VAPT 辅助工具。
- 量化小型模型中推理能力(认证题目 96% 以上)与事实回忆(CVE/CWE 22-27%)之间的巨大差异尚未被充分研究,这对这些模型应如何增强(如使用 RAG)以实现实际部署具有重要意义。
- 评估真实环境中规划、执行和适应能力的交互式动态渗透测试基准仍然稀缺。
- 量化级别对特定领域网络安全知识保留的影响尚未被系统研究。
Novel Techniques 新颖技术
- Six-dimensional VAPT knowledge evaluation framework: Decomposing VAPT competency into six distinct measurable dimensions (CVE knowledge, CWE knowledge, CVSS prediction, scenario reasoning, tool proficiency, exploit mapping) enables granular assessment of model strengths and weaknesses.
- Certification-threshold benchmarking: Comparing LLM performance against human professional certification passing scores (70-85% for CEH/PenTest+) provides an intuitive and practically meaningful reference point for evaluating AI readiness.
- Quantization-aware evaluation methodology: Specifically targeting Q4_K_M quantized models running on moderate hardware (16GB RAM, 4GB vRAM) addresses the real-world deployment constraint that VAPT professionals face.
- 六维 VAPT 知识评估框架:将 VAPT 能力分解为六个独立可测量的维度(CVE 知识、CWE 知识、CVSS 预测、场景推理、工具熟练度、漏洞利用映射),实现对模型优势和劣势的细粒度评估。
- 认证阈值基准测试:将 LLM 表现与人类专业认证通过分数(CEH/PenTest+ 的 70-85%)进行比较,提供了直观且具有实际意义的参考点来评估 AI 就绪程度。
- 量化感知评估方法:专门针对在中等硬件(16GB RAM、4GB vRAM)上运行的 Q4_K_M 量化模型,解决了 VAPT 从业者面临的实际部署约束。
Open Questions 开放问题
- How would larger models (7B, 13B, 70B) or less aggressive quantization (Q8, FP16) perform on VAP-6, and at what scale does factual recall for CVE/CWE become reliable?
- Can RAG or fine-tuning on vulnerability databases close the massive gap between reasoning capability and factual recall observed in small quantized models?
- How should VAPT benchmarks evolve beyond static MCQs to evaluate the dynamic, multi-step, interactive nature of real penetration testing?
- What is the relationship between MCQ-based knowledge assessment and actual penetration testing performance -- does high certification-style accuracy translate to effective tool-assisted VAPT?
- How quickly do VAPT knowledge benchmarks become outdated as new CVEs and attack techniques emerge, and what update cadence is needed?
- Would domain-specific fine-tuning of small models on VAPT data produce better results than using larger general-purpose models, given the privacy constraints of local deployment?
- Why did Qwen 2.5 achieve only 2.40% on CVSS severity classification despite strong performance elsewhere -- is this a quantization artifact, a training data gap, or an architectural limitation?
- 更大的模型(7B、13B、70B)或较低程度的量化(Q8、FP16)在 VAP-6 上表现如何?CVE/CWE 的事实回忆在什么规模下才能可靠?
- RAG 或基于漏洞数据库的微调能否弥合小型量化模型中观察到的推理能力与事实回忆之间的巨大差距?
- VAPT 基准应如何从静态选择题演进,以评估真实渗透测试中动态、多步骤、交互式的特性?
- 基于选择题的知识评估与实际渗透测试表现之间的关系是什么 -- 高认证风格准确率是否能转化为有效的工具辅助 VAPT?
- 随着新 CVE 和攻击技术的出现,VAPT 知识基准多快会过时?需要什么样的更新频率?
- 考虑到本地部署的隐私约束,对小型模型进行 VAPT 数据的领域特定微调是否会比使用更大的通用模型产生更好的结果?
- 为什么 Qwen 2.5 在 CVSS 严重性分类上仅达到 2.40%,尽管在其他方面表现强劲 -- 这是量化伪影、训练数据缺口还是架构局限?
Builds On 基于前人工作
- GLUE and SuperGLUE (general NLU benchmarks)
- MMLU (multitask benchmark)
- PromptBench (LLM evaluation framework)
- CTIBench (cyber threat intelligence benchmark)
- SECURE (cybersecurity benchmark)
- CyberSecEval 3 (cybersecurity evaluation)
- CyberMetric (RAG-based cybersecurity assessment)
Open Source 开源信息
No (benchmark datasets not publicly released)