LogoAIAny
Icon for item

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Generates outcome-specific, dialectical rationales with an LLM and derives continuous, calibrated risk scores for irregularly sampled medical time series—mitigating risk polarization. Reports +3.3% average AUPRC and 81% reduction in calibration error across three benchmarks; code released.

Introduction

Clinical early warning from irregularly sampled EHR time series demands both well-calibrated risk estimates and explanations clinicians can verify. The core insight of this work is that asking an LLM to commit to a single outcome before scoring fosters overconfident, polarized predictions; instead, inspecting alternative outcomes and eliciting a dedicated rationale per outcome lets the model express graded, comparable risk via its implicit probabilities.

Key Findings
  • Dialectical supervision: train an LLM to produce outcome-specific rationales (one rationale per candidate outcome) and derive a continuous risk score from the model’s implicit probabilities conditioned on those rationales. This reduces the tendency to collapse to extreme binary predictions.
  • Empirical gains: across three irregularly sampled medical time-series benchmarks, TRIAGE yields an average AUPRC improvement of 3.3% and reduces calibration error by 81% versus competitive baselines. An LLM-as-judge evaluation rated TRIAGE rationales ~20% higher in clinical reasoning quality than baseline post-hoc explanations.
  • Practical result: a single, relatively small open-source LLM can deliver both discriminative, better-calibrated risk estimates and explicit, outcome-grounded natural language explanations in one pass.
Who it's for and trade-offs

Great fit if you need clinically grounded, inspectable risk estimates from irregular EHR time series and want explanations tied to alternative outcomes rather than post-hoc salience. Look elsewhere if you cannot supply any supervision for outcome-specific rationales, if deployment requires a fully validated clinical-grade pipeline (TRIAGE is a research prototype requiring external clinical validation), or if strict latency/compute limits prohibit running an LLM-based reasoning step. The method improves explainability and calibration but adds the cost of generating and supervising multiple outcome-specific rationales per example.

Where it fits

Positions between opaque numerical time-series predictors and post-hoc explanation systems: it uses language-model reasoning as the primary evidentiary interface (rationales) while extracting calibrated probabilities rather than relying on a single committed prediction.

Implementation note

The authors provide training recipes (dialectical reasoning supervision + self-refinement) and released code to reproduce experiments on public benchmarks; clinical deployment still requires dataset-specific validation, privacy safeguards, and regulatory review.

Information

  • Websitearxiv.org
  • AuthorsHyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang
  • Published date2026/06/08