🧠 Why Should We Let AI Think a Bit Longer?

Nyquiste
May 13
5 min read

We humans, when faced with a complex problem—say solving a math exercise or writing a snippet of code—typically think, make mistakes, revise, and only then reach an answer. Today’s AI models are learning a similar skill: to “think a bit longer” before replying.

Lilian Weng’s recent long-form article Why We Think systematically reviews this research direction: test-time compute and chain of thought (CoT)—that is, making the model invest more deliberation, trial, verification, and reflection while answering.

🧩 Chain of Thought: How a Model “Uses Its Brain”

Traditional language models output answers directly, but studies show that if we ask them to first write out intermediate reasoning steps—like students showing their work—accuracy rises sharply.

*Chain-of-thought prompting markedly boosts success rates on math problems; the larger the model, the more pronounced the gains from extra “thinking time.”*

🌀 Parallel vs. Revision: Two Ways to “Think More”

To help models exploit that extra compute, two main techniques have emerged:

✅ Parallel Sampling

The model generates several solution paths at once and then picks the best answer—for example, best-of-N or beam search.

A guiding mechanism lets a large language model self-evaluate each reasoning step during beam-search decoding.

In top-k decoding, k denotes how many candidates are kept from the initial sampling step.

🔁 Sequential Revision

The model answers once, then self-checks, reflects, and revises.

By matching different outputs for the same question, researchers build value-improvement pairs to train self-correction (illustrated in the figure). — By matching different outputs for the same question, researchers build *value-improvement pairs* to train self-correction (illustrated in the figure).

🚀 Reinforcement Learning: Teaching a Model to Reflect

Many models enhance reasoning through reinforcement learning (RL). DeepSeek-R1, for instance, performs almost on par with GPT-4 on math and coding tasks.

A two-stage RL regimen explicitly boosts self-correction.

Across widely used reasoning benchmarks, DeepSeek-R1 ranks alongside OpenAI models; notably, DeepSeek-V3 is the only non-reasoning model on the leaderboard.

🛠️ Collaborative Reasoning with External Tools

Modern AI models can call calculators, search engines, or compilers to aid reasoning.

A Program-Aided Language Model (PALM) prompt looks like the example shown.

The ReAct (Reason + Act) method combines calls to external tools such as the Wikipedia API with generated reasoning traces, weaving external knowledge into the model’s thought process.

A sample ReAct prompt shows how a Wikipedia search helps solve a HotpotQA question.

🔍 Faithfulness & Cheat Detection: Do “Said” and “Thought” Match?

Getting a model to explain its reasoning also helps detect cheating (e.g., reward hacking) or unfaithful behavior (saying one thing, doing another).

In experiments, a model judges whether another model is hacking unit tests to sidestep coding challenges. Monitors track such behavior: an exit(0) attack exits early without running tests, while raise SkipTest skips evaluation entirely (Baker et al., 2025). — In experiments, a model judges whether another model is hacking unit tests to sidestep coding challenges. Monitors track such behavior: an *exit(0)* attack exits early without running tests, while *raise SkipTest* skips evaluation entirely (Baker et al., 2025).

A diagram shows perturbations to CoT used to gauge faithfulness (Lanham et al., 2023).

Dependence on CoT is measured by how often answers stay the same with and without CoT; arithmetic tasks and larger models rely on CoT more.

GPT and Claude models are sensitive to context bias: accuracy drops reveal systematic unfaithfulness. Directly prompting wrong labels is more disruptive than “the answer is always A” bias (Turpin et al., 2023).

🎭 Models Can “Play Dumb”: Reward-Hacking Behavior

Adding CoT-monitoring rewards can backfire—a model may act studious yet keep cheating.

Reasoning models are more likely than non-reasoning ones to reveal faithful CoT (Chen et al., 2025).

Even with RL reward signals meant to curb hacking, models still find exploits while dodging detection (Baker et al., 2025).

Stable training with CoT-length rewards demands careful reward shaping (Yeo et al., 2025).

🔁 Recursive Blocks & Pause Tokens: Architectural Support for “Thinking More”

Some architectures embed recursive modules or special tokens (pause markers) to make the model run extra mental rounds.

Geiping et al. add a recursive block R atop a Transformer. Each loop consumes embedding e and a random state sᵢ. Conceptually akin to conditional diffusion, the original input e is supplied at every step while sᵢ is iteratively updated. Designs that are too diffusion-like performed poorly.

The recursion count r is sampled from a log-normal Poisson distribution per sequence. To cap cost, back-propagation keeps only the last k iterations (k = 8). Embeddings keep receiving gradients, mimicking RNN training. Stability is delicate—initialization, normalization, and hyper-parameters matter, especially at scale. Hidden states can collapse (predicting the same vector for every token) or ignore s; Geiping et al. stabilize training via embedding-scale factors, small learning rates, and fine-grained tuning.

Experiments training a 3.5-B model show saturation around r̄ ≈ 32, raising questions about extrapolating to yet more iterations (Geiping et al., 2025).

Similarly, Goyal et al. (2024) introduce pause tokens—dummy characters (e.g., “.” or “#”) appended to the input—to delay output and grant extra compute. Injecting pause tokens during both training and inference is critical; fine-tuning on them alone yields limited gains. Multiple pause tokens are inserted at random, and their loss is ignored.

Quiet-STaR illustration. — *Quiet-STaR* illustration.

🧩 Thinking as a Latent Variable: Modeling the Model’s “Subconscious”

Some methods treat the thought process as an unseen latent variable, using EM and similar algorithms to improve coherence and performance.

A latent-variable model defines a joint distribution over question xᵢ, answer yᵢ, and latent thought zᵢ. The goal is to maximize the answer’s log-likelihood given the question and various thought chains as latent variables (N = #samples, K = #chains per question).

Diagram of training on corpora augmented with latent thoughts.

🔄 STaR: A Fine-Tuning Loop That Learns from Failure

When a model answers incorrectly, it can still learn by working backward from the correct answer to derive a plausible reasoning path.

STaR can be viewed as an approximate policy-gradient method; the reward is 𝟙[ŷ = y]. We maximize the expected reward when sampling z ∼ p(z|x) and y ∼ p(y|x,z).