🧠 Why Should We Let AI Think a Bit Longer?
- Nyquiste
- May 13
- 5 min read

We humans, when faced with a complex problem—say solving a math exercise or writing a snippet of code—typically think, make mistakes, revise, and only then reach an answer. Today’s AI models are learning a similar skill: to “think a bit longer” before replying.
Lilian Weng’s recent long-form article Why We Think systematically reviews this research direction: test-time compute and chain of thought (CoT)—that is, making the model invest more deliberation, trial, verification, and reflection while answering.
🧩 Chain of Thought: How a Model “Uses Its Brain”
Traditional language models output answers directly, but studies show that if we ask them to first write out intermediate reasoning steps—like students showing their work—accuracy rises sharply.

🌀 Parallel vs. Revision: Two Ways to “Think More”
To help models exploit that extra compute, two main techniques have emerged:

✅ Parallel Sampling
The model generates several solution paths at once and then picks the best answer—for example, best-of-N or beam search.


🔁 Sequential Revision
The model answers once, then self-checks, reflects, and revises.


🚀 Reinforcement Learning: Teaching a Model to Reflect
Many models enhance reasoning through reinforcement learning (RL). DeepSeek-R1, for instance, performs almost on par with GPT-4 on math and coding tasks.

Across widely used reasoning benchmarks, DeepSeek-R1 ranks alongside OpenAI models; notably, DeepSeek-V3 is the only non-reasoning model on the leaderboard.

🛠️ Collaborative Reasoning with External Tools
Modern AI models can call calculators, search engines, or compilers to aid reasoning.

A Program-Aided Language Model (PALM) prompt looks like the example shown.
The ReAct (Reason + Act) method combines calls to external tools such as the Wikipedia API with generated reasoning traces, weaving external knowledge into the model’s thought process.

🔍 Faithfulness & Cheat Detection: Do “Said” and “Thought” Match?
Getting a model to explain its reasoning also helps detect cheating (e.g., reward hacking) or unfaithful behavior (saying one thing, doing another).




🎭 Models Can “Play Dumb”: Reward-Hacking Behavior
Adding CoT-monitoring rewards can backfire—a model may act studious yet keep cheating.



🔁 Recursive Blocks & Pause Tokens: Architectural Support for “Thinking More”
Some architectures embed recursive modules or special tokens (pause markers) to make the model run extra mental rounds.

Geiping et al. add a recursive block R atop a Transformer. Each loop consumes embedding e and a random state sᵢ. Conceptually akin to conditional diffusion, the original input e is supplied at every step while sᵢ is iteratively updated. Designs that are too diffusion-like performed poorly.
The recursion count r is sampled from a log-normal Poisson distribution per sequence. To cap cost, back-propagation keeps only the last k iterations (k = 8). Embeddings keep receiving gradients, mimicking RNN training. Stability is delicate—initialization, normalization, and hyper-parameters matter, especially at scale. Hidden states can collapse (predicting the same vector for every token) or ignore s; Geiping et al. stabilize training via embedding-scale factors, small learning rates, and fine-grained tuning.


Similarly, Goyal et al. (2024) introduce pause tokens—dummy characters (e.g., “.” or “#”) appended to the input—to delay output and grant extra compute. Injecting pause tokens during both training and inference is critical; fine-tuning on them alone yields limited gains. Multiple pause tokens are inserted at random, and their loss is ignored.

🧩 Thinking as a Latent Variable: Modeling the Model’s “Subconscious”
Some methods treat the thought process as an unseen latent variable, using EM and similar algorithms to improve coherence and performance.

A latent-variable model defines a joint distribution over question xᵢ, answer yᵢ, and latent thought zᵢ. The goal is to maximize the answer’s log-likelihood given the question and various thought chains as latent variables (N = #samples, K = #chains per question).


🔄 STaR: A Fine-Tuning Loop That Learns from Failure
When a model answers incorrectly, it can still learn by working backward from the correct answer to derive a plausible reasoning path.

STaR can be viewed as an approximate policy-gradient method; the reward is 𝟙[ŷ = y]. We maximize the expected reward when sampling z ∼ p(z|x) and y ∼ p(y|x,z).

⏱️ Thinking Time Also Obeys a Scaling Law
Research shows that as a model spends more tokens—more steps—during inference, performance indeed rises, but the technique matters.

With rationalized (ground-truth step-wise) reasoning, a model learns complex arithmetic, such as 5-digit addition, at an earlier phase.


Surprisingly, simple rejection sampling to fit a token budget yields reverse scaling—longer chains worsen performance.

🌟 Closing Thoughts: We Want AI Not Just Faster, but Better at “Thinking”
Behind this article lies a deeper question:
Can we train an AI that truly thinks, reflects, and faithfully articulates its own reasoning?
Pursuing this direction not only makes models stronger—it deepens our understanding of intelligence itself.
Comments