LogoAIAny
Icon for item

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Standardizes representation-level evaluation for tabular encoders by exporting row-, column-, and table-level embeddings and probing them with shared lightweight heads across three suites (TRL-CTbench, TRL-Rbench, TRL-DLTE). Supplies curated benchmark assets and task rewrites (50 OpenML tables, 123 targets, a 47,772-table DLTE lake) to enable fair cross-paradigm comparison.

Introduction

Most assessments of tabular encoders live inside end-to-end pipelines, which hides whether a strong final score comes from reusable representations or pipeline specifics. TRL-Bench changes the lens: it extracts row-, column-, and table-level embeddings from arbitrary encoders and evaluates those representations under standardized downstream probes so that strengths become capability-specific signals rather than leaderboard artifacts.

Key Findings
  • Representation quality is capability-specific: across 20 models and 16 tasks, no single encoder dominates under standardized probes — performance depends on the match between pretraining objective and downstream signal. This argues against one-size-fits-all leaderboards.
  • Surface-text tasks favor generic text encoders, while tabular specialists win where pretraining aligns with structural tabular signals. So what: choose encoders by capability match, not raw rank.
  • Row vs cross-table goals separate training regimes: within-table prediction and cross-table linkage prefer different inductive biases; atomic linkage correlates strongly with the row-matching stage of DLTE pipelines.
  • Compositional pipelines beat single-encoder reuse on DLTE: the best end-to-end systems combine capability-matched specialists, and overall quality depends on non-additive compositional fit rather than per-stage marginal rank.
  • Benchmark assets: curated tasks and data include 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables — enabling diverse probes at row/column/table granularities.
Who it's for and trade-offs

Great fit if you build or evaluate tabular encoders, design representation-learning pretraining, or need a standardized probe suite to compare encoders across paradigms. It helps select encoders by capability (surface-text, structural tabular, linkage, etc.) rather than single-number leaderboards.

Look elsewhere if you only care about full end-to-end pipeline scores (TRL-Bench isolates representations and uses lightweight probes), or if your production pipeline cannot export intermediate embeddings. The protocol emphasizes representation reuse and probe performance; it does not replace task-specific end-to-end evaluation when pipeline interactions matter.

Where it fits

TRL-Bench sits between task-specific leaderboards and full-stack benchmarks: it isolates the representation interface so researchers and practitioners can reason about reusable signal, choose specialists when appropriate, and compose encoders for complex data-lake enrichment tasks.

Methodology (brief)

Encoders export embeddings via wrappers at row/column/table levels. Shared lightweight heads then probe these embeddings across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment). The repository provides task reformulations and large curated assets to ensure reproducibility and cross-paradigm comparability.

Information

  • Websitearxiv.org
  • AuthorsWei Pang, Xiangru Jian, Hehan Li, Zhixuan Yu, Alex Xue, Jinyang Li, Zhengyuan Dong, Xinjian Zhao, Hao Xu, Chao Zhang, Reynold Cheng, M. Tamer Özsu, Tianshu Yu
  • Published date2026/06/08