Long chain-of-thought (CoT) traces often contain useful reasoning mixed with redundant trial-and-error steps; standard outcome-based reinforcement methods tend to reinforce entire long trajectories, including the redundant parts. ThoughtFold introduces an introspective preference-learning approach to detect redundant sub-trajectories within correct CoTs and explicitly penalize them, encouraging models to bridge only essential reasoning segments and thus "fold" long chains into shorter, more direct paths. (arxiv.org)
Key Findings
- Introduces an introspective mechanism that generates a spectrum of candidate sub-trajectories from a correct CoT and learns fine-grained preferences among them, rather than relying solely on outcome-correct / outcome-incorrect signals. This produces denser learning signals focused on redundancy reduction. (arxiv.org)
- Proposes a masked preference optimization objective that explicitly penalizes redundant explorations and rewards concise bridging between essential steps, shifting the learning signal from final-outcome memorization to intra-trajectory efficiency. (arxiv.org)
- Empirical result highlighted: applying ThoughtFold reduces token usage of DeepSeek-R1-Distill-Qwen-7B by ~56% while maintaining state-of-the-art accuracy on the evaluated benchmarks, indicating substantial inference-efficiency gains without accuracy trade-off. (arxiv.org)
Who it's for and trade-offs
Great fit if you train or fine-tune LLMs for multi-step reasoning tasks and care about inference cost: ThoughtFold is designed to produce shorter reasoning traces that reduce token consumption during generation and training. Look elsewhere if your use case requires preserving full exploratory traces for interpretability, forensic analysis, or curriculum-style training where intermediate wrong turns are pedagogically important—ThoughtFold intentionally downweights and folds such explorations. (arxiv.org)
Where it fits
Methodologically this paper sits between outcome-based RL for CoTs and preference-learning approaches: it refines the unit of supervision from whole-trajectory outcomes to ranked sub-trajectories, offering a pragmatic route to improve reasoning efficiency for deployed LLMs without redesigning model architectures. (arxiv.org)
