Clinical early warning from irregularly sampled EHR time series demands both well-calibrated risk estimates and explanations clinicians can verify. The core insight of this work is that asking an LLM to commit to a single outcome before scoring fosters overconfident, polarized predictions; instead, inspecting alternative outcomes and eliciting a dedicated rationale per outcome lets the model express graded, comparable risk via its implicit probabilities.
Key Findings
- Dialectical supervision: train an LLM to produce outcome-specific rationales (one rationale per candidate outcome) and derive a continuous risk score from the model’s implicit probabilities conditioned on those rationales. This reduces the tendency to collapse to extreme binary predictions.
- Empirical gains: across three irregularly sampled medical time-series benchmarks, TRIAGE yields an average AUPRC improvement of 3.3% and reduces calibration error by 81% versus competitive baselines. An LLM-as-judge evaluation rated TRIAGE rationales ~20% higher in clinical reasoning quality than baseline post-hoc explanations.
- Practical result: a single, relatively small open-source LLM can deliver both discriminative, better-calibrated risk estimates and explicit, outcome-grounded natural language explanations in one pass.
Who it's for and trade-offs
Great fit if you need clinically grounded, inspectable risk estimates from irregular EHR time series and want explanations tied to alternative outcomes rather than post-hoc salience. Look elsewhere if you cannot supply any supervision for outcome-specific rationales, if deployment requires a fully validated clinical-grade pipeline (TRIAGE is a research prototype requiring external clinical validation), or if strict latency/compute limits prohibit running an LLM-based reasoning step. The method improves explainability and calibration but adds the cost of generating and supervising multiple outcome-specific rationales per example.
Where it fits
Positions between opaque numerical time-series predictors and post-hoc explanation systems: it uses language-model reasoning as the primary evidentiary interface (rationales) while extracting calibrated probabilities rather than relying on a single committed prediction.
Implementation note
The authors provide training recipes (dialectical reasoning supervision + self-refinement) and released code to reproduce experiments on public benchmarks; clinical deployment still requires dataset-specific validation, privacy safeguards, and regulatory review.
