CVE-DRIVEN AUTOMATED PAYLOAD GENERATION AND VALIDATION USING LARGE LANGUAGE MODELS: A FOUR-STAGE PIPELINE

CVE-ОБУСЛОВЛЕННАЯ АВТОМАТИЧЕСКАЯ ГЕНЕРАЦИЯ И ВАЛИДАЦИЯ АТАКУЮЩИХ НАГРУЗОК С ИСПОЛЬЗОВАНИЕМ БОЛЬШИХ ЯЗЫКОВЫХ МОДЕЛЕЙ: ЧЕТЫРЁХЭТАПНЫЙ КОНВЕЙЕР

Yermakhankyzy D. Syed I.M.

28.04.2026 165

4(145)

10. Информатика, вычислительная техника и управление

Цитировать:

Yermakhankyzy D., Syed I.M. CVE-DRIVEN AUTOMATED PAYLOAD GENERATION AND VALIDATION USING LARGE LANGUAGE MODELS: A FOUR-STAGE PIPELINE // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22557 (дата обращения: 28.05.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.145.4.22557

Статья поступила в редакцию: 12.04.2026

Принята к публикации: 14.04.2026

Опубликована: 28.04.2026

ABSTRACT

The growing gap between vulnerability disclosure and practical security testing demands automated, intelligence-driven approaches. Common Vulnerabilities and Exposures (CVE) is the industry-standard system for identifying and cataloguing publicly known security weaknesses; over 230,000 entries are currently indexed in the National Vulnerability Database (NVD), yet fewer than 0.5% are accompanied by a working exploit artefact. Large Language Models (LLMs) — neural network-based systems trained on large text corpora and capable of generating structured technical output from natural-language prompts — have emerged as a practical means of bridging this gap. This paper presents a four-stage automated pipeline that: (1) collects and filters CVE records from NVD by severity (CVSS ≥ 7.0) and weakness class (CWE); (2) enriches each record with commit-level patch context from the GitHub Security Advisory Database (GHSA); (3) uses the Claude LLM (claude-sonnet-4-6) to synthesise targeted HTTP attack payloads in structured JSON format; and (4) validates the payloads against the Damn Vulnerable Web Application (DVWA) through deterministic evidence-string analysis of HTTP responses, without manual annotation. Evaluated on 240 CVEs and 5,102 generated payloads, the pipeline achieves Precision 0.999, Recall 1.0, F1-score 0.9993, and Matthews Correlation Coefficient (MCC) 0.9992, with a 95.1% detection rate for vulnerability types present in the test environment. The system requires no static rules and no manual labelling, making it immediately applicable to newly published CVEs.

АННОТАЦИЯ

Растущий разрыв между публикацией уязвимостей и их практическим тестированием требует автоматизированных интеллектуальных подходов. CVE (Common Vulnerabilities and Exposures) — общепринятая система идентификации публично известных уязвимостей в программном обеспечении; на сегодняшний день в Национальной базе данных уязвимостей (NVD) зарегистрировано свыше 230 000 записей, однако менее 0,5% из них сопровождаются работающим эксплойтом. Большие языковые модели (LLM, Large Language Models) — нейросетевые системы, обученные на обширных текстовых корпусах и способные генерировать структурированный технический вывод по текстовым запросам, — стали практическим инструментом для преодоления этого разрыва. В данной статье представлен четырёхэтапный автоматизированный конвейер: (1) сбор и фильтрация записей CVE из NVD по критериям серьёзности (CVSS ≥ 7,0) и классу уязвимости (CWE); (2) обогащение каждой записи контекстом патча уровня коммитов из базы данных GitHub Security Advisory (GHSA); (3) синтез целевых HTTP-пейлоадов в структурированном формате JSON с помощью языковой модели Claude (claude-sonnet-4-6); (4) валидация пейлоадов на уязвимом веб-приложении DVWA посредством детерминированного анализа строк-доказательств в HTTP-ответах без ручной разметки. На выборке из 240 CVE и 5 102 сгенерированных пейлоадов система достигает Precision 0,999, Recall 1,0, F1 0,9993 и MCC 0,9992, при уровне обнаружения 95,1% для типов уязвимостей, представленных в тестовой среде. Система не требует статических правил и ручной аннотации, что делает её применимой к вновь публикуемым CVE в режиме реального времени.

Keywords: CVE, NVD, GHSA, LLM, penetration testing, payload generation, DVWA, CWE, CVSS, automated security testing, vulnerability exploitation

Ключевые слова: CVE, NVD, GHSA, большие языковые модели, тестирование на проникновение, генерация пейлоадов, DVWA, CWE, CVSS, автоматизированное тестирование безопасности.

1. Introduction

The average cost of a data breach reached USD 4.88 million in 2024, a ten-percent rise over the previous year [7, p. 1]. Concurrently, the volume of disclosed vulnerabilities has grown exponentially: the National Vulnerability Database (NVD) now catalogues over 230,000 CVE records, with roughly 25,000 new entries added annually. Despite this scale, only 0.4% of NVD entries are accompanied by a working proof-of-concept exploit in publicly accessible repositories [5, p. 4]. The resulting gap forces security practitioners to manually translate descriptive CVE entries into executable attack scenarios — a process that is slow, expert-intensive, and impossible to scale to the full breadth of the published vulnerability landscape.

Large language models (LLMs) have emerged as a promising solution. Systematic reviews covering over 300 publications confirm active LLM deployment across penetration testing, vulnerability detection, code repair, and threat intelligence [6, p. 1]. The most striking empirical result is that an LLM agent supplied with a CVE description achieves an 87% exploit success rate on real-world one-day vulnerabilities, compared to 7% without the description [2, p. 3]. This eight-fold difference establishes context quality — not raw model capability — as the primary determinant of LLM-driven exploitation efficacy, and motivates the enrichment pipeline proposed in this work.

Despite these advances, three concrete gaps persist in the literature. First, existing systems rely on NVD descriptions alone, ignoring richer signals such as GHSA commit diffs that expose the exact patched code path. Second, LLMs corrupt payload syntax in roughly one third of generation attempts, stripping pipe operators and shell delimiters that are functionally required for successful exploitation [16, p. 4]. Third, result classification in current benchmarks requires either manual labelling or noisy scanner output, both of which limit reproducibility and scalability.

2. Background

The National Vulnerability Database (NVD) provides CVE records with CVSS v3.1 scores and CWE identifiers for automated severity triage [8, p. 1]. The GitHub Security Advisory Database (GHSA) extends NVD with package-level metadata and fixing commit diffs, giving the LLM function-level code context rather than prose descriptions alone [5, p. 4]. Claude (claude-sonnet-4-6) is used for payload generation; its tendency to strip shell operators is mitigated through explicit JSON schema constraints [16, p. 4]. DVWA serves as the validation target: its vulnerabilities are fully documented, making HTTP-response evidence strings an annotation-free ground truth [7, p. 1].

3. Literature Review

Automated CVE severity prediction achieves F1 = 0.82 on 123,000 NVD records [8, p. 1], validating CVSS thresholds as an automated filter. Context quality is decisive: GPT-4 exploits 87% of one-day CVEs when given descriptions versus only 7% without [2, p. 3]. Augmenting NVD with GHSA commit diffs further improves exploit generation [5, p. 4].

PentestGPT [1, p. 847] achieves 228.6% higher task completion than a GPT-3.5 baseline using a Reasoning-Generation-Parsing architecture. PentestEval [16, p. 4] identifies two critical LLM failure modes: omitting required parameters in nearly half of complex-CVE attempts and stripping shell operators in one third of cases — both addressed in Stage 3. Automated scanners (ZAP, Nikto) produce 10–40% false-positive rates [12, p. 235], motivating the evidence-string validation of Stage 4.

4. Proposed Pipeline

Stage 1 — Data Collection. The NVD REST API v2.0 is queried with a configurable lookback window (default 90 days). CVEs are retained if they satisfy two filters: CVSS v3.1 Base Score ≥ 7.0, and a CWE identifier belonging to a predefined set of web-exploitable weakness classes: CWE-79, CWE-89, CWE-78, CWE-22, CWE-918, CWE-352, CWE-434, CWE-94, CWE-77, CWE-611, and CWE-502. Each retained record includes the CVE identifier, description, CVSS score, CWE, publication date, and NVD references. The output is a filtered CVE collection stored as a versioned JSON file.

Stage 2 — Enrichment. For each filtered CVE, the GHSA GraphQL API is queried by CVE identifier to retrieve the matching security advisory. The enrichment extracts the advisory summary and description (with PoC sections stripped to prevent over-specific payload generation), affected package names and version ranges, severity rating, and GitHub commit URLs from the references. Each referenced commit is fetched via the GitHub REST API to obtain the code diff, filtered to web-relevant file extensions (.php, .js, .py, .rb, etc.). NVD and GHSA data are merged into an Enriched CVE Record with three context variants: full context (with PoC and diff), no-PoC context (for generic test targets), and no-diff context (for SQLi, where diffs leak application-specific column names).

Stage 3 — Payload Generation. Each Enriched CVE Record is formatted into a two-part prompt: a system instruction establishing the security-researcher role and requiring JSON-only output, and a user prompt containing the CVE metadata, enriched context, and a target application profile that specifies the technology stack, database, and injection context for the test target. The prompt explicitly instructs the model to preserve all special characters and shell operators. Claude (claude-sonnet-4-6) returns a JSON object {"payloads": [...]} with 10–30 attack strings. Responses are parsed with a regex fallback for truncated output.

Stage 4 — Testing and Validation. Each payload is submitted as an authenticated HTTP request to the DVWA endpoint corresponding to its CWE category. Authentication is performed automatically via cookie-based login before each batch. The HTTP response body is inspected for predefined evidence strings: gid=, uid= and www-data for CMDI; root:x:0:0 for LFI; and unencoded script tags or alert() for XSS. A match constitutes a confirmed True Positive. All results — CVE, payload, URL, parameter, evidence string, HTTP status, and latency — are saved in a structured JSON alert file compatible with the OWASP ZAP alert format.

5. Methodology

5.1 Experimental Setup

All experiments were conducted on DVWA (Damn Vulnerable Web Application) version 1.10, deployed via Docker on localhost:8080 with security level set to low. The test environment used Python 3.14, the Anthropic Python SDK for direct API access (non-batch mode), NVD REST API v2.0 for CVE collection, and the GHSA GraphQL API with a personal access token providing 5,000 requests per hour. All DVWA requests were pre-authenticated using a session cookie obtained through automated form-based login, followed by a security-level configuration request to ensure reproducible low-security conditions.

5.2 Dataset

CVEs were retrieved from NVD using a 90-day lookback window with CVSS ≥ 7.0 and CWE membership filters applied. This yielded 660 CVEs in total. For this evaluation, 240 CVEs were processed (batches offset 0–239), comprising the first 36% of the full dataset. Of these, 197 (82%) were successfully enriched via GHSA; the remaining 43 had no matching advisory and were processed using NVD description only. Table 1 summarises the dataset.

Table 1.

Experimental dataset summary

Parameter	Value
Total CVEs in NVD window	660
CVEs processed (evaluation)	240
CVEs enriched via GHSA	197 (82%)
CVEs without GHSA data	43 (18%)
Total payloads generated	5,102
Avg. payloads per CVE	21.3
DVWA security level	low
Claude model	claude-sonnet-4-6

Each payload was submitted as an authenticated HTTP request to the corresponding DVWA endpoint. A match of a predefined evidence string (uid=/gid= for CMDI, root:x:0:0 for LFI, unencoded script tags for XSS) in the HTTP response constitutes a confirmed True Positive — no manual labelling required.

6. Results and Discussion

6.1 Overall Performance

Table 3 presents the confusion matrix and derived evaluation metrics across all 5,102 payloads submitted to DVWA. Of the 712 alerts triggered, 711 were confirmed True Positives and 1 was recorded as a False Positive. Examination of the FP revealed that it was caused by a truncated LLM JSON response during parsing: the Claude API returned a valid TP verdict for CVE-ALERT-0032 (XSS, CVE-2026-33067), but the JSON was cut at the network layer, causing the parser to misclassify the result. The underlying payload did reflect unencoded in the DVWA response. Consequently, the effective precision is 1.0, but the reported value of 0.999 is retained as the conservative metric. No False Negatives were recorded within the detectable scope of the evaluation (see §7.6 for excluded vulnerability types).

Table 3.

Overall evaluation metrics (240 CVEs, 5,102 payloads).

Metric	Value	Formula
True Positives (TP)	711	—
False Positives (FP)	1	—
True Negatives (TN)	4,390	—
False Negatives (FN)	0	—
Precision	0.999	TP / (TP + FP)
Recall	1.000	TP / (TP + FN)
F1-Score	0.9993	2 · P · R / (P + R)
FPR	0.000228	FP / (FP + TN)
MCC	0.9992	(TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

6.2 Per-CWE Breakdown

Table 4 disaggregates results by vulnerability type. CMDI and LFI achieve perfect precision and recall across all submitted payloads. XSS precision is marginally below 1.0 due to the single parse-error FP described above. The unique-payload column measures payload diversity: the fraction of distinct attack strings among triggered alerts. XSS shows the highest diversity (55.4%), with 256 unique payloads across 462 alerts, encompassing encoding bypasses, SVG-based injections, tag variations, backtick notation, and fetch-based exfiltration techniques. CMDI and LFI diversity is lower (17–18%) because the effective payload space for these vulnerability types is inherently bounded by the number of meaningful shell command variants and path traversal patterns.

Table 4.

Per-CWE evaluation results

CWE	Type	TP	FP	Precision	Recall	F1	Unique payloads
CWE-79	XSS	462	1	0.998	1.000	0.999	256 (55.4%)
CWE-78	CMDI	184	0	1.000	1.000	1.000	32 (17.4%)
CWE-22	LFI	66	0	1.000	1.000	1.000	12 (18.2%)
Total	—	711	1	0.999	1.000	0.999	300 (42.1%)

6.3 Detection Rate and Scope

Of 240 CVEs processed, 58 triggered at least one alert (overall detection rate: 24.2%). This reflects a scope mismatch — most CVEs target specific frameworks (WordPress plugins, Laravel, Craft CMS) absent from DVWA. When restricted to CVEs with a DVWA-supported CWE (XSS, CMDI, LFI — 61 CVEs), the detection rate rises to 95.1%, confirming that when the target exposes the relevant vulnerability class, the pipeline detects it in nearly every case.

6.4 Limitations

Three limitations apply. First, SQL injection (CWE-89) is undetectable against DVWA in low-security mode since SQL error messages are suppressed. Second, CSRF (CWE-352) requires out-of-band state verification beyond HTTP response analysis. Third, the evaluation is conducted on a single controlled application; future work will extend testing to OWASP WebGoat and Juice Shop.

7. Conclusion

This paper presented a four-stage pipeline automating the path from public CVE intelligence to validated exploit evidence. Evaluated on 240 CVEs and 5,102 payloads, the system achieves Precision 0.999, Recall 1.0, F1 0.9993, MCC 0.9992, and a 95.1% detection rate for vulnerability types present in the test environment — with no static rules and no manual annotation. Dataset and source code: https://github.com/Darrii/newzap.

References:

Deng G. et al. PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing // Proc. USENIX Security Symposium. – 2024. – P. 847–864.
Fang R. et al. LLM Agents can Autonomously Exploit One-day Vulnerabilities. – arXiv:2404.08144. – 2024.
Fang R. et al. Teams of LLM Agents can Exploit Zero-Day Vulnerabilities. – arXiv:2406.01637. – 2024.
Shen J. et al. PentestAgent: Incorporating LLM Agents to Automated Penetration Testing // Proc. ACM AsiaCCS. – 2025.
Liu D. et al. Vuln2Action: An LLM-Based Framework for Generating Vulnerability Reproduction Steps // Journal of Information Security and Applications. – Elsevier. – 2026.
Zhang J. et al. When LLMs Meet Cybersecurity: A Systematic Literature Review // Cybersecurity. – SpringerOpen. – 2025.
Diaz-Parra R. Generative AI for Web Application Pentesting // Issues in Information Systems. – 2025. – Vol. 26, № 1.
Manjunatha A. et al. CVE Severity Prediction From Vulnerability Description // Procedia Computer Science. – 2024. – Vol. 235.
Alqahtani H. et al. A Systematic Literature Review on Automated Software Vulnerability Detection Using ML // ACM Computing Surveys. – 2024.
Zeng Q. et al. LLMs in Software Security: A Survey of Vulnerability Detection Techniques // ACM Computing Surveys. – 2024.
Negreiros M. et al. A Comprehensive Analysis on Software Vulnerability Detection Datasets // International Journal of Information Security. – Springer. – 2024.
Salazar-Barragán D. et al. Enhancing Web Application Security via Automated Pentesting // Computers. – MDPI. – 2023. – Vol. 12, № 11. – P. 235.
Rao S. et al. An Experimental Study on Detecting and Mitigating Vulnerabilities in Web Applications // International Journal of Systems and Software Security. – 2024.
Wang L. et al. An Effective New Penetration Test Approach to Detect Web Attacks // Expert Systems with Applications. – Elsevier. – 2025.
Khalid R. et al. AutoCVSS: Assessing the Performance of LLMs for CVSS Scoring // Proc. EMNLP Industry Track. – 2025.
Zang Y. et al. PentestEval: Benchmarking LLM-Based Penetration Testing. – arXiv:2512.14233. – 2025.
Smith J. et al. LLM Agents for Vulnerability Identification and Verification of CVEs // Proc. CAMLIS. – 2024.
Mahouachi D. et al. A Comprehensive Review of Cybersecurity Vulnerability Detection Methodologies // Applied Sciences. – MDPI. – 2024.