🗞️ Chain-of-Thought Reasoning without Prompting

nlp
cot
llm agents mooc
llms
Can you make a LLM reason without prompting? Looks like you can…
Published

September 18, 2024

Special Paper ⭐️

This paper-reading is part of UC Berkeley’s Large Language Model Agents MOOC that I’m currently participating. More involved readout on actual lectures will be available in my Blog section with LLM agents mooc and associated deep dives should be available under nlp rabbit hole tags.

What is this paper about ?

As noted in the subtitle, this paper asks the question Can LLMs reason effectively without prompting? and then proceeds to answer the same. Specifically, by eliciting the Chain of Thought (CoT) reasoning paths, this paper convinces that, we can make LLM provide good reasoned response without explicitly pre-training, fine-tuning or even prompting the LLM explicitly.

Authors of this paper, claim that, this can be achieved by making the LLM auto-magically tap into CoT reasoning path by a special decoding technique called CoT Decoding.

The authors by way of objective experiments also categorically proves that CoT-decoding is the only decoding strategy that effectively improves language model reasoning.

What are the key contributions of this paper ?

The key contributions of this paper can be summarized as follows :

  1. Novel finding that LLMs can reason by simple decoding changes, without the use of prompting. This is the main finding of this paper. By using a specific CoT decoding technique, the authors of this paper has proved that it is indeed possible to make LLMs reason without specifically prompting the LLM to reason before answering the questions. Furthermore, this methodology works out of the box without the need for human inputs in either pre-training, fine-tuning or even prompting during actual inference.

  2. CoT Decoding method. This paper also proposes a novel decoding technique called CoT decoding, based on top-k decoding technique but focussing on the tokens where the model’s confidence is higher. On comparing with normal Greedy Decoding, the model’s reasoning capabilities improve drastically if the decoding logic is changed to select other alternative CoT paths based on top-k decoding. For experiment purposes, k=10 was used and the results were documented.

So then what’s the difference between this CoT Decoding and Top-k decoding? Incase of CoT decoding, the model is made to select alternate decoding path such the token’s answer confidence is higher. This is the next finding.

  1. CoT-decoding that reliably selects CoT-paths based on answer confidence. It is proposed that CoT decoding selects CoT paths reliably based on answer confidence. So basically, if the model selects the top K tokens such that the chosen decoding path leads to tokens that always have high probability of occurrence as the next token based on prior context & input sequence, such token paths are found to be reliably tapping the CoT reasoning paths, thereby allowing the LLMs to reason reliably.

This way, without manually introducing CoT via pre-training, fine tuning & even prompting, just by choosing the CoT Decoding technique, we can trick the LLM to elicit the reasoning behavior.

  1. Other takeaways worthy to mention
  • CoT-decoding elicits reasoning across model scales. i.e within a given models, as the model parameters are scaled, when applied CoT decoding, the authors find that the reasoning abilities also scale for free
  • CoT-decoding partially closes the reasoning gap between pre-trained and instruction-tuned models, without using any supervised data.
  • The presence of correct CoT paths depends on the task difficulty levels and correlates with task prominence in the pre-training distribution. Hence depending on the model and the complexity of given task, the chance of using CoT decoding will greatly depend on the presence of CoT paths accordingly. So it is not a universal solution for reasoning problem.
  • CoT-decoding unveils model’s intrinsic vulnerabilities in reasoning. i.e CoT-decoding is not just useful to make a model reason without prompting, but also to find inherent weakness in the model w.r.t its reasoning abilities.
  • Finally, the authors also admit that CoT decoding is actually a compute intensive decoding method thereby incurring additional computational costs. Especially if a cost benefit analysis is done w.r.t to the accuracy produced by the elicited reasoning, the benefits wane away quickly. This is one of the potential future works of this paper.

What are my key takeaways ?

Well to me it opened a whole vista of knowledge in NLP especially around the encoder / decoder techniques used in transformer architecture. Since this paper compares CoT decoding with other decoding techniques like Greedy, Beam, Top-K, Self Consistency etc, I had to go & read about those to make sense of this paper. That’s how I entered into the nlp rabbit hole, which I plan to document separately.

Especially the self-consistency technique [1] really intrigued me. This technique is basically asking the model to generate different variations of responses for a the same input instruction and then asking the model to choose the answer that is most consistent among the choices. This is again based on the assumption that, often the consistent response is the most accurate response. Think about it. A fact cannot be changed hence how many times you ask the same question, the response is not gonna change. This whole technique is banked on this simple common sense. More about this paper later on its separate papershelf post.

I understood the difference between Encoder-only, Encoder-Decoder, Decoder-only models and how it is different from the original transformer paper [2] and especially why the decoder-only models exists in first place inspite of having a transformer architecture.

The key differences:

  • Encoder-only models process the entire input at once and create bidirectional representations.
  • Decoder-only models process input sequentially and can generate new tokens.
  • Full encoder-decoder models can both process input sequences and generate output sequences.

As such, if one wants to generate text, the presence of a decoder node is a must since encoder node only generates representation of input sequence in multi-dimensional space. Without decoder, output sequence cannot be generated. That’s why the original [2] paper was revolutionary in introducing the output generation in first place. As an evolution, much more simpler decoder-only models were invented to focus more on the generative aspect of the NLP.

Subscribe to Techno Adventure Newsletter

I also publish a newsletter where I share my techo adventures in the intersection of Telecom, AI/ML, SW Engineering and Distributed systems. If you like getting my post delivered directly to your inbox whenever I publish, then consider subscribing to my substack.

I pinky promise 🤙🏻 . I won’t sell your emails!

References

[1]
X. Wang et al., “Self-consistency improves chain of thought reasoning in language models.” 2023. Available: https://arxiv.org/abs/2203.11171
[2]
A. Vaswani et al., “Attention is all you need.” 2023. Available: https://arxiv.org/abs/1706.03762