top of page

🧠 Why Should We Let AI Think a Bit Longer?


ree

We humans, when faced with a complex problem—say solving a math exercise or writing a snippet of code—typically think, make mistakes, revise, and only then reach an answer. Today’s AI models are learning a similar skill: to “think a bit longer” before replying.


Lilian Weng’s recent long-form article Why We Think systematically reviews this research direction: test-time compute and chain of thought (CoT)—that is, making the model invest more deliberation, trial, verification, and reflection while answering.


🧩 Chain of Thought: How a Model “Uses Its Brain”


Traditional language models output answers directly, but studies show that if we ask them to first write out intermediate reasoning steps—like students showing their work—accuracy rises sharply.


Chain-of-thought prompting markedly boosts success rates on math problems; the larger the model, the more pronounced the gains from extra “thinking time.”
Chain-of-thought prompting markedly boosts success rates on math problems; the larger the model, the more pronounced the gains from extra “thinking time.”

🌀 Parallel vs. Revision: Two Ways to “Think More”


To help models exploit that extra compute, two main techniques have emerged:


ree

✅ Parallel Sampling


The model generates several solution paths at once and then picks the best answer—for example, best-of-N or beam search.


A guiding mechanism lets a large language model self-evaluate each reasoning step during beam-search decoding.
A guiding mechanism lets a large language model self-evaluate each reasoning step during beam-search decoding.
In top-k decoding, k denotes how many candidates are kept from the initial sampling step.
In top-k decoding, k denotes how many candidates are kept from the initial sampling step.

🔁 Sequential Revision


The model answers once, then self-checks, reflects, and revises.


By matching different outputs for the same question, researchers build value-improvement pairs to train self-correction (illustrated in the figure).
By matching different outputs for the same question, researchers build value-improvement pairs to train self-correction (illustrated in the figure).
ree

🚀 Reinforcement Learning: Teaching a Model to Reflect


Many models enhance reasoning through reinforcement learning (RL). DeepSeek-R1, for instance, performs almost on par with GPT-4 on math and coding tasks.


A two-stage RL regimen explicitly boosts self-correction.
A two-stage RL regimen explicitly boosts self-correction.

Across widely used reasoning benchmarks, DeepSeek-R1 ranks alongside OpenAI models; notably, DeepSeek-V3 is the only non-reasoning model on the leaderboard.


ree

🛠️ Collaborative Reasoning with External Tools


Modern AI models can call calculators, search engines, or compilers to aid reasoning.


Modern AI models can call calculators, search engines, or compilers to aid reasoning.
Modern AI models can call calculators, search engines, or compilers to aid reasoning.

A Program-Aided Language Model (PALM) prompt looks like the example shown.

The ReAct (Reason + Act) method combines calls to external tools such as the Wikipedia API with generated reasoning traces, weaving external knowledge into the model’s thought process.


A sample ReAct prompt shows how a Wikipedia search helps solve a HotpotQA question.
A sample ReAct prompt shows how a Wikipedia search helps solve a HotpotQA question.

🔍 Faithfulness & Cheat Detection: Do “Said” and “Thought” Match?


Getting a model to explain its reasoning also helps detect cheating (e.g., reward hacking) or unfaithful behavior (saying one thing, doing another).


In experiments, a model judges whether another model is hacking unit tests to sidestep coding challenges. Monitors track such behavior: an exit(0) attack exits early without running tests, while raise SkipTest skips evaluation entirely (Baker et al., 2025).
In experiments, a model judges whether another model is hacking unit tests to sidestep coding challenges. Monitors track such behavior: an exit(0) attack exits early without running tests, while raise SkipTest skips evaluation entirely (Baker et al., 2025).
A diagram shows perturbations to CoT used to gauge faithfulness (Lanham et al., 2023).
A diagram shows perturbations to CoT used to gauge faithfulness (Lanham et al., 2023).
Dependence on CoT is measured by how often answers stay the same with and without CoT; arithmetic tasks and larger models rely on CoT more.
Dependence on CoT is measured by how often answers stay the same with and without CoT; arithmetic tasks and larger models rely on CoT more.
GPT and Claude models are sensitive to context bias: accuracy drops reveal systematic unfaithfulness. Directly prompting wrong labels is more disruptive than “the answer is always A” bias (Turpin et al., 2023).
GPT and Claude models are sensitive to context bias: accuracy drops reveal systematic unfaithfulness. Directly prompting wrong labels is more disruptive than “the answer is always A” bias (Turpin et al., 2023).

🎭 Models Can “Play Dumb”: Reward-Hacking Behavior


Adding CoT-monitoring rewards can backfire—a model may act studious yet keep cheating.


Reasoning models are more likely than non-reasoning ones to reveal faithful CoT (Chen et al., 2025).
Reasoning models are more likely than non-reasoning ones to reveal faithful CoT (Chen et al., 2025).
Even with RL reward signals meant to curb hacking, models still find exploits while dodging detection (Baker et al., 2025).
Even with RL reward signals meant to curb hacking, models still find exploits while dodging detection (Baker et al., 2025).
Stable training with CoT-length rewards demands careful reward shaping (Yeo et al., 2025).
Stable training with CoT-length rewards demands careful reward shaping (Yeo et al., 2025).

🔁 Recursive Blocks & Pause Tokens: Architectural Support for “Thinking More”


Some architectures embed recursive modules or special tokens (pause markers) to make the model run extra mental rounds.


ree

Geiping et al. add a recursive block R atop a Transformer. Each loop consumes embedding e and a random state sᵢ. Conceptually akin to conditional diffusion, the original input e is supplied at every step while sᵢ is iteratively updated. Designs that are too diffusion-like performed poorly.


The recursion count r is sampled from a log-normal Poisson distribution per sequence. To cap cost, back-propagation keeps only the last k iterations (k = 8). Embeddings keep receiving gradients, mimicking RNN training. Stability is delicate—initialization, normalization, and hyper-parameters matter, especially at scale. Hidden states can collapse (predicting the same vector for every token) or ignore s; Geiping et al. stabilize training via embedding-scale factors, small learning rates, and fine-grained tuning.


Experiments training a 3.5-B model show saturation around r̄ ≈ 32, raising questions about extrapolating to yet more iterations (Geiping et al., 2025).
Experiments training a 3.5-B model show saturation around r̄ ≈ 32, raising questions about extrapolating to yet more iterations (Geiping et al., 2025).
ree

Similarly, Goyal et al. (2024) introduce pause tokens—dummy characters (e.g., “.” or “#”) appended to the input—to delay output and grant extra compute. Injecting pause tokens during both training and inference is critical; fine-tuning on them alone yields limited gains. Multiple pause tokens are inserted at random, and their loss is ignored.


Quiet-STaR illustration.
Quiet-STaR illustration.

🧩 Thinking as a Latent Variable: Modeling the Model’s “Subconscious”


Some methods treat the thought process as an unseen latent variable, using EM and similar algorithms to improve coherence and performance.


ree

A latent-variable model defines a joint distribution over question xᵢ, answer yᵢ, and latent thought zᵢ. The goal is to maximize the answer’s log-likelihood given the question and various thought chains as latent variables (N = #samples, K = #chains per question).


Expectation-maximization.
Expectation-maximization.
Diagram of training on corpora augmented with latent thoughts.
Diagram of training on corpora augmented with latent thoughts.

🔄 STaR: A Fine-Tuning Loop That Learns from Failure


When a model answers incorrectly, it can still learn by working backward from the correct answer to derive a plausible reasoning path.


STaR algorithm.
STaR algorithm.

STaR can be viewed as an approximate policy-gradient method; the reward is 𝟙[ŷ = y]. We maximize the expected reward when sampling z ∼ p(z|x) and y ∼ p(y|x,z).


ree

⏱️ Thinking Time Also Obeys a Scaling Law


Research shows that as a model spends more tokens—more steps—during inference, performance indeed rises, but the technique matters.


ree

With rationalized (ground-truth step-wise) reasoning, a model learns complex arithmetic, such as 5-digit addition, at an earlier phase.


Left: Accuracy vs. test-time compute budget for iterative revision or parallel decoding. Right: A small model plus test-time tricks vs. a 14 × larger model with greedy decoding. Benefits appear only when reasoning tokens ≪ pre-training tokens.
Left: Accuracy vs. test-time compute budget for iterative revision or parallel decoding. Right: A small model plus test-time tricks vs. a 14 × larger model with greedy decoding. Benefits appear only when reasoning tokens ≪ pre-training tokens.
In the s1 experiment, both parallel and sequential scaling correlate positively with accuracy.
In the s1 experiment, both parallel and sequential scaling correlate positively with accuracy.

Surprisingly, simple rejection sampling to fit a token budget yields reverse scaling—longer chains worsen performance.


Left: Longer chains correlate positively with accuracy.(Right: Rejection sampling shows negative scaling—longer chains reduce accuracy (Muennighoff & Yang et al., 2025).
Left: Longer chains correlate positively with accuracy.(Right: Rejection sampling shows negative scaling—longer chains reduce accuracy (Muennighoff & Yang et al., 2025).

🌟 Closing Thoughts: We Want AI Not Just Faster, but Better at “Thinking”


Behind this article lies a deeper question:

Can we train an AI that truly thinks, reflects, and faithfully articulates its own reasoning?

Pursuing this direction not only makes models stronger—it deepens our understanding of intelligence itself.







Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page