Platform Updates

Benchmark Changelog

All material changes to the VAULT benchmark methodology, dataset, rubric, and platform are documented here. Versioned updates ensure full reproducibility and auditability of results.

v1.3Rubric Update

April 2025

—Introduced weighted composite scoring across five criteria
—Factual Accuracy weight increased from ×2.0 to ×2.5
—Organization criterion weight adjusted from ×1.0 to ×0.8
—Temperature fixed at 0.1 for improved LLM judge reproducibility
—16,384-token reasoning budget applied to all judge calls

v1.2Dataset Update

March 2025

—Dataset expanded to 5,000 canine and feline cases
—Improved case anonymization protocol (HIPAA-equivalent standards)
—Added feline-specific clinical context normalization
—Revised structured JSON output schema for score validation

v1.1Platform Update

February 2025

—Launched VAULT API v1 — benchmark runs, leaderboard, model registration
—Introduced API key management with scoped permissions and instant revocation
—Added webhook support for run completion and publication events
—Beta participant access program opened to approved organizations

v1.0Initial Launch

January 2025

—VAULT Clinical Summarization benchmark launched in private beta
—LLM-as-a-judge evaluation framework established
—Initial dataset of 2,500 annotated canine cases
—Governance charter, data access policy, and acceptable use policy published

Versioning policy: Major version changes (v1.x → v2.x) indicate a breaking change to the evaluation methodology that may affect score comparability across versions. Minor updates (v1.2 → v1.3) maintain backward-compatible rubric refinements. All benchmark runs include a suite version tag for full traceability.