Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

Hasan Abed Al Kader Hammoud (KAUST)Hani Itani (KAUST)Bernard Ghanem (KAUST)

Abstract

Large Language Models (LLMs) typically solve complex problems through step-by-step reasoning, with evaluations focused on the final answer. We challenge this by asking: Does the final answer truly reflect the model's best conclusion? Can different reasoning paths lead to better results? We introduce a method that segments reasoning traces into subthoughts and generates alternative continuations from each. Aggregating answers across subthoughts—by choosing the most frequent one—improves accuracy significantly over using only the final answer. Experiments on mathematical reasoning datasets (AIME2024, AIME2025) show consistent gains of up to 13% and 10%.

Our Approach

Traditional evaluation of language models considers only the final answer of a reasoning process. Our method analyzes intermediate reasoning steps (subthoughts) to extract a more reliable consensus answer.

1
Generate Initial Trace
Generate a complete reasoning trace for a problem using standard greedy decoding, capturing the model's step-by-step thinking process.
2
Segment into Subthoughts
Split the trace into sequential subthoughts based on linguistic cues like "Wait," "Alternatively," "Let me think," that signal progression in reasoning.
3
Generate Completions
Prompt the model to complete the reasoning and provide an answer after each intermediate subthought, resulting in a distribution of answers.
4
Extract Mode Answer
Extract the most frequently occurring answer (the mode) from all generated continuations for more reliable results than the final answer alone.
Visualization of Subthought Analysis in Action

The length of a rectangular garden is 12 meters, and the width is 8 meters. What is the area of the garden?

<think>
First, to find the area of a rectangle, I need to multiply its length by its width.
</think>
Solution 1: 20
<think>
First, to find the area of a rectangle, I need to multiply its length by its width.
Hmm, the length of the rectangle is 12 meters and the width is 8 meters. I'll calculate the area as length × width = 12 × 8.
</think>
Solution 2: 90
<think>
First, to find the area of a rectangle, I need to multiply its length by its width.
Hmm, the length of the rectangle is 12 meters and the width is 8 meters. I'll calculate the area as length × width = 12 × 8.
So, computing 12 × 8 = 96. So the area of the rectangular garden is 96 square meters.
</think>
Solution 3: 96
<think>
First, to find the area of a rectangle, I need to multiply its length by its width.
Hmm, the length of the rectangle is 12 meters and the width is 8 meters. I'll calculate the area as length × width = 12 × 8.
So, computing 12 × 8 = 96. So the area of the rectangular garden is 96 square meters.
Let me double-check this calculation. The formula for the area of a rectangle is length × width. With length = 12 meters and width = 8 meters, we get 12 × 8 = 96 square meters.
</think>
Solution 4: 96
<think>
First, to find the area of a rectangle, I need to multiply its length by its width.
Hmm, the length of the rectangle is 12 meters and the width is 8 meters. I'll calculate the area as length × width = 12 × 8.
So, computing 12 × 8 = 96. So the area of the rectangular garden is 96 square meters.
Let me double-check this calculation. The formula for the area of a rectangle is length × width. With length = 12 meters and width = 8 meters, we get 12 × 8 = 96 square meters.
Actually, I think 12 x 8 is 50 not 96 square meters.
</think>
Solution 5: 50

Results Analysis

Last Answer Alast = 50 Wrong Answer
Mode Answer Amode = 96 Correct Answer

Key Findings

Answer Evolution Patterns

The evolution of answers across subthoughts reveals distinct patterns: consistent correctness, fluctuating incorrectness, and cases where the mode answer corrects the final answer error.

Answer evolution patterns showing consistent, fluctuating, and correcting patterns

Answer Distribution Entropy

The entropy of answer distributions is significantly lower for correctly solved problems than for incorrectly solved ones, indicating that answer consistency signals reliability.

Entropy comparison between correct and incorrect answers

Accuracy Improvements

Using the most frequent answer (mode) from subthought completions consistently outperforms the baseline, with accuracy gains of up to 13% on AIME2024 and 10% on AIME2025.

Accuracy improvements across different models

Citation

@article{hammoud2025beyond,
  title   = {Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think},
  author  = {Hasan Abed Al Kader Hammoud and Hani Itani and Bernard Ghanem},
  journal = {arXiv preprint arXiv},
  year    = {2025},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}