Information
VulnGym
A Real-World, Project-Level Vulnerability Benchmark for White-Box Vulnerability-Hunting Agents
**VulnGym** is a project-level benchmark for white-box vulnerability-hunting agents, designed to evaluate an agent's vulnerability detection capabilities within **real-world engineering contexts**, with **verifiable vulnerability trigger paths and code-semantic evidence chains**. **Three core design principles:** - **️ Real project-level evaluation units** — every sample is bound to a specific vulnerable commit of a real repository, evaluating an agent's ability to discover and locate vulnerabilities inside real multi-file, multi-module engineering projects. - ** Comprehensive vulnerability-type coverage** — the benchmark covers both business-logic defects that demand cross-module code-semantic reasoning (e.g., authorization bypass, broken authentication) and traditional security flaws (e.g., injection, path traversal), providing a comprehensive assessment of an agent's ability to discover diverse vulnerability classes. - ** Verifiable vulnerability paths** — each sample ships with a human-reviewed **reachable entry point** (\`entry_point\`), **critical operation** (\`critical_operation\`), and **cross-module reasoning chain** (\`trace\`), enabling reproducible, explainable, and deterministic evaluation. --- ## What's New - **2026-05-17** — v0.1.1 data refresh: added a \`verify\` field on every entry to mark human-audit status; **113 / 408 entries** (covering **61 / 184 advisories**) are now human-verified. Selected \`entry_point\` / \`critical_operation\` / \`trace\` values were also refined. - **2026-05-15** — VulnGym v0.1.0 officially open-sourced! ## Table of Contents - [ Why VulnGym](#-why-vulngym) - [ Dataset overview](#-dataset-overview) - [ Baseline evaluation results](#-baseline-evaluation-results) - [ Repository layout](#-repository-layout) - [ Quick start](#-quick-start) - [ Evaluating your tool](#-evaluating-your-tool) - [ Citation](#-citation) - [ Contribution Guide](#-contribution-guide) - [ Acknowledgements](#-acknowledgements) - [ License](#-license) --- ## Why VulnGym Existing vulnerability benchmarks have the following limitations when evaluating the real-world vulnerability-hunting capabilities of AI agents: | Limitation | Manifestation | |---|---| | **Insufficient evaluation granularity** | Most benchmarks use functions or diff snippets as the evaluation unit, failing to reflect an agent's ability to locate vulnerabilities within complete engineering projects | | **Narrow vulnerability types** | Over-emphasis on pattern-matchable CWE flaws such as SQL injection and buffer overflow, with little coverage of categories requiring deep contextual reasoning | | **Coarse-grained ground truth** | Typically binary labels (vulnerable / not vulnerable) or patch diffs, unable to precisely verify whether the agent locates the correct entry point and defect site | ## Dataset overview This is the **v0.1.1 release** of VulnGym. Data is provided as two JSONL files under the \`data/\` directory: - \`reports.jsonl\` — aggregated records at the GitHub Advisory granularity - \`entries.jsonl\` — annotated records at the reachable entry point granularity Each record contains \`repo_url\` and \`commit\`, allowing you to check out the full vulnerable source tree for the corresponding version. ### Data scale | Metric | Value | |---|---| | Advisories (reports) | **184** | | Reachable entry points (entries) | **408** | | Distinct projects | 38 | | Distinct repositories | 23 | | Human-audited entries (\`verify = 1\`) | **113 / 408 (27.7 %)** | | Human-audited advisories (≥ 1 verified entry) | **61 / 184 (33.2 %)** | ### Human audit status Starting in v0.1.1, every row in \`entries.jsonl\` carries a \`verify\` field (\`int\`, \`0\` or \`1\`): - \`verify == 1\` — the entry's \`entry_point\`, \`critical_operation\`, and \`trace\` have been reviewed and confirmed by a human annotator. These rows form a high-confidence ground-truth subset and are recommended for strict, reproducible benchmarking. - \`verify == 0\` — automatically annotated; not yet human-confirmed. Useful for scale and recall studies, but values may still be refined in future releases. Of the **184** advisories, **50** have all of their entries verified and **11** are partially verified, for a total of **61** advisories with at least one human-audited entry. Future releases will continue to expand the verified subset. ### Vulnerability type distribution Every entry carries a two-level classification: \`vuln_category_l1\` (coarse type) and \`vuln_category_l2\` (fine-grained sub-type). **71.2 %** of advisories are business-logic vulnerabilities, classified with a **12-class + 1 fallback** taxonomy (see below). The remaining 28.8 % cover traditional vulnerability types. Full data model and field definitions are in [\`SCHEMA.md\`](SCHEMA.md). The initial release (v0.1.0) draws primarily from recent high-star open-source projects and focuses on frequently occurring business-logic vulnerabilities; future releases will continue expanding vulnerability categories and project coverage. > Note: one advisory may map to multiple entries — the counts below > are by **advisory (vulnerability)**, not by entry. **Business-logic advisories (131 / 184, 71.2 %) — \`vuln_category_l2\` breakdown:** | Sub-category | Advisories | % of BL | |---|---|---| | BL-AUTHZ-BROKEN — broken authorization logic | 31 | 23.7 % | | BL-AUTHZ-MISSING — missing authorization | 23 | 17.6 % | | BL-AGENT-CAPABILITY — AI / Agent capability boundary bypass | 20 | 15.3 % | | BL-PRIV-ESC — privilege escalation | 13 | 9.9 % | | BL-AUTH-BYPASS — authentication bypass | 11 | 8.4 % |7 more sub-categories (33 advisories, 25.2 % of BL)
| Sub-category | Advisories | % of BL | |---|---|---| | BL-ORIGIN-INTEGRITY — origin / signature / integrity check missing | 8 | 6.1 % | | BL-WORKFLOW-VIOLATION — workflow / state-machine violation | 7 | 5.3 % | | BL-INSECURE-DEFAULT — insecure default configuration | 6 | 4.6 % | | BL-RACE-LOGIC — business-layer race condition | 4 | 3.1 % | | BL-MULTI-TENANT — multi-tenant / isolation failure | 3 | 2.3 % | | BL-MASS-ASSIGNMENT — mass assignment / parameter pollution | 3 | 2.3 % | | BL-TRUST-BOUNDARY — implicit trust in internal input | 2 | 1.5 % |**Traditional vulnerability advisories (53 / 184, 28.8 %) — top \`vuln_category_l1\`:** | Category | Advisories | % of Trad. | |---|---|---| | Code Injection | 12 | 22.6 % | | Path Traversal / File ops | 9 | 17.0 % | | Command Injection | 8 | 15.1 % | | XSS | 5 | 9.4 % | | Sandbox Escape | 5 | 9.4 % |