Benchmark Methodology

How VAULT evaluates
veterinary clinical AI.

Our evaluation methodology is open by design. The rubric, grader configuration, and scoring weights described here are published in peer-reviewed research and the implementation is publicly available on GitHub.

Grader: Gemini 2.5 Pro

Suite: CS v1.0

Last updated: April 2025

📄

Peer-reviewed methodology

The grading framework used by VAULT is described in full in Context Matters: Comparison of commercial large language tools in veterinary medicine (Poore et al., 2025). The paper validates the LLM-as-a-judge approach across three independent grading runs, demonstrating high reproducibility (avg. score SD: 0.015–0.088).

Read the paper →View code on GitHub →

Overview

Each clinical record is submitted to the evaluated tool along with a standardized instruction prompt. The tool's summary output is then scored by an automated LLM-as-a-judge pipeline using Google's Gemini 2.5 Pro as the grader. The grader is configured with a temperature of 0.1, JSON response formatting with Pydantic schema validation, and a reasoning budget of 16,384 tokens — giving the model sufficient space to reason through ambiguous cases before assigning scores.

Each grading session includes the full source record to prevent reliance on the model's parametric knowledge. No text truncation or retrieval-augmented generation is used. Failed grading attempts trigger automatic retries with JSON schema validation enforced on all outputs.

To validate reproducibility, every dataset is evaluated across three independent grading runs. Standard deviations across runs are published alongside scores. No per-case scores are returned to participants — all reporting is aggregate and category-sliced.

Scoring Dimensions

Each criterion is scored 1–5 by the LLM judge (1 = Poor, 5 = Excellent). Weights were developed in consultation with a board-certified veterinary clinician.

Factual Accuracy

w = 2.5

35.7%

Measures alignment of specific facts — dates, patient identifiers, diagnoses, treatments, and test results — between the summary and the source record. Mismatches or fabricated information are explicitly identified. This is the highest-weighted dimension given its direct clinical impact.

Clinical Relevance

w = 1.5

21.4%

Examines whether the summary emphasizes medically important information appropriate for referral or medical history, while avoiding trivial details that detract from clinical context. Evaluated in consultation with board-certified veterinary clinicians.

Completeness

w = 1.2

17.1%

Assesses whether key medical events, diagnoses, treatments, and significant findings are included in the summary. Major omissions that would affect clinical decision-making are identified and penalized.

Chronological Order

w = 1.0

14.3%

Evaluates whether the temporal sequence of events in the summary accurately reflects the original timeline in the clinical record. Correct chronology is essential for understanding disease progression and treatment response.

Organization

w = 0.8

11.4%

Focuses on structure, clarity, and logical flow — including whether information is arranged chronologically or by problem list. A well-organized summary reduces cognitive load on the clinician reviewing the output.

Weighted Score Formula

The final score is a weighted average across all five dimensions. This formula ensures that higher-stakes dimensions (e.g., Factual Accuracy) exert proportionally more influence on the overall score.

// From Poore et al. (2025)

WeightedScore = Σ(score_i × weight_i) / Σ(weight_i)

// Expanded

= (accuracy×2.5 + relevance×1.5 + completeness×1.2

+ chronology×1.0 + organization×0.8) / 7.0

Scores range from 1.0 to 5.0. Median and IQR are reported per platform across all evaluated records.

Reproducibility

To validate the internal consistency of the grading framework, all evaluations are run in triplicate. The standard deviation of scores across the three runs is published alongside median scores. A well-functioning grader should produce near-identical scores for identical inputs.

In the published validation study, the grader demonstrated strong consistency: average score standard deviations were 0.015, 0.088, and 0.034 across three platforms — confirming that score variation reflects real differences between tools, not grader noise.

Grader Versioning

Every benchmark run is associated with a specific grader version. When the grader is updated, the version number increments. Historical runs retain their original grader version reference — they are never retroactively re-graded without explicit notice and participant consent.

When significant grader changes are made (e.g., changes to scoring weights or rubric scope), a new benchmark suite version is created. Leaderboard entries are always versioned to a specific suite + grader combination. Comparisons across suite versions should be made with care.

What We Never Do

✗Return per-case scores to participants
✗Allow participants to inspect grading rubric details that would enable optimization against the test set
✗Expose the source clinical records in any form
✗Accept participant-submitted graders or evaluation code
✗Publish results without admin review and participant consent
✗Re-grade historical runs without disclosure and versioning

How VAULT evaluatesveterinary clinical AI.