CanHealth AI Validation Study

Figure 1. AI accuracy by priority level before and after panel consensus review. The AI achieved 96.9% overall accuracy after review, with perfect performance (100%) on high-acuity P1/P2 cases.

Download PNG (300 DPI) Download PDF (Vector)

Figure 2. Consensus match rates for AI and individual human expert evaluators (blinded phase). The AI achieved 55.1% consensus agreement, lower than all human experts (range: 64.9%–84.8%).

Download PNG (300 DPI) Download PDF (Vector)

Figure 3. Distribution of AI error patterns after panel review (n=12). The majority of errors (58.3%) were over-triage classifications, where the AI assigned higher urgency than the panel consensus, reflecting a conservative safety bias.

Download PNG (300 DPI) Download PDF (Vector)

Figure 4. AI performance under permissive versus strict criteria. The permissive criterion (green line) accepts any expert-endorsed answer, while the strict criterion (orange line) requires exact consensus match. The divergence demonstrates that most AI "errors" reflect alternative valid interpretations rather than clinical incorrectness.

Download PNG (300 DPI) Download PDF (Vector)

Figure 5. Confusion matrix showing AI predictions versus panel consensus for priority classifications. Diagonal elements (green gradient) represent correct predictions; off-diagonal elements indicate misclassifications. Most errors occurred between adjacent priority levels.

Download PNG (300 DPI) Download PDF (Vector)

Figure 6. Intra-rater variability measured by answer change rates on re-evaluation. Individual evaluators changed their priority assignments in 10.7%–61.7% of re-reviewed cases (overall: 39.3%), highlighting inherent human inconsistency in clinical triage decisions.

Download PNG (300 DPI) Download PDF (Vector)