Platform Updates
Benchmark Changelog
All material changes to the VAULT benchmark methodology, dataset, rubric, and platform are documented here. Versioned updates ensure full reproducibility and auditability of results.
v1.3Rubric Update
April 2025- —Introduced weighted composite scoring across five criteria
- —Factual Accuracy weight increased from ×2.0 to ×2.5
- —Organization criterion weight adjusted from ×1.0 to ×0.8
- —Temperature fixed at 0.1 for improved LLM judge reproducibility
- —16,384-token reasoning budget applied to all judge calls
v1.2Dataset Update
March 2025- —Dataset expanded to 5,000 canine and feline cases
- —Improved case anonymization protocol (HIPAA-equivalent standards)
- —Added feline-specific clinical context normalization
- —Revised structured JSON output schema for score validation
v1.1Platform Update
February 2025- —Launched VAULT API v1 — benchmark runs, leaderboard, model registration
- —Introduced API key management with scoped permissions and instant revocation
- —Added webhook support for run completion and publication events
- —Beta participant access program opened to approved organizations
v1.0Initial Launch
January 2025- —VAULT Clinical Summarization benchmark launched in private beta
- —LLM-as-a-judge evaluation framework established
- —Initial dataset of 2,500 annotated canine cases
- —Governance charter, data access policy, and acceptable use policy published
Versioning policy: Major version changes (v1.x → v2.x) indicate a breaking change to the evaluation methodology that may affect score comparability across versions. Minor updates (v1.2 → v1.3) maintain backward-compatible rubric refinements. All benchmark runs include a suite version tag for full traceability.