Large Language Models (LLMs) typically solve complex problems through step-by-step reasoning, with evaluations focused on the final answer. We challenge this by asking: Does the final answer truly reflect the model's best conclusion? Can different reasoning paths lead to better results? We introduce a method that segments reasoning traces into subthoughts and generates alternative continuations from each. Aggregating answers across subthoughts—by choosing the most frequent one—improves accuracy significantly over using only the final answer. Experiments on mathematical reasoning datasets (AIME2024, AIME2025) show consistent gains of up to 13% and 10%.
Traditional evaluation of language models considers only the final answer of a reasoning process. Our method analyzes intermediate reasoning steps (subthoughts) to extract a more reliable consensus answer.
The length of a rectangular garden is 12 meters, and the width is 8 meters. What is the area of the garden?
The evolution of answers across subthoughts reveals distinct patterns: consistent correctness, fluctuating incorrectness, and cases where the mode answer corrects the final answer error.
The entropy of answer distributions is significantly lower for correctly solved problems than for incorrectly solved ones, indicating that answer consistency signals reliability.
Using the most frequent answer (mode) from subthought completions consistently outperforms the baseline, with accuracy gains of up to 13% on AIME2024 and 10% on AIME2025.
@article{hammoud2025beyond, title = {Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think}, author = {Hasan Abed Al Kader Hammoud and Hani Itani and Bernard Ghanem}, journal = {arXiv preprint arXiv}, year = {2025}, archivePrefix = {arXiv}, primaryClass = {cs.AI} }