Uncategorized

ByteDance AI Research Unveils Reinforced Fine-Tuning (ReFT) Method to Enhance the Generalizability of Learning LLMs for Reasoning with Math Problem Solving as an Example

One effective method to improve the reasoning skills of LLMs is to employ supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations. However, this approach has limitations in terms of generalization because it heavily depends on the provided CoT data. In scenarios like math problem-solving, each question in the training data typically has only one annotated reasoning path. In the ideal case, it would be more beneficial for the algorithm to learn from multiple annotated reasoning paths associated with a given question, as this could enhance its overall performance and adaptability.

Researchers from ByteDance Research lab suggest a practical method known as Reinforced Fine-Tuning (ReFT) to improve the generalization capabilities of learning LLMs for reasoning, using math problem-solving as an illustrative example. The ReFT approach begins by initially warming the model through SFT. Subsequently, it leverages online reinforcement learning, specifically employing the Proximal Policy Optimization (PPO) algorithm. During this fine-tuning process, the model is exposed to various reasoning paths automatically sampled based on the given question. The rewards for reinforcement learning come naturally from the ground-truth answers, contributing to a more robust and adaptable LLM for enhanced reasoning abilities.

Recent research efforts have focused on improving CoT prompt design and data engineering, aiming to make CoT comprehensive and fine-grained for step-by-step reasoning solutions. Some approaches have used Python programs as CoT prompts, demonstrating more accurate reasoning steps and significant improvements over natural language CoT. Another line of work focuses on improving the quality and quantity of CoT data, including efforts to increase the amount of CoT data from OpenAI’s ChatGPT. Reinforcement learning has been applied to fine-tuning paradigms to improve performance over conventional supervised fine-tuning, specifically for solving math problems. 

The study proposes ReFT to enhance the generalizability of learning LLMs for reasoning, specifically in math problem-solving. ReFT combines SFT with online reinforcement learning using the PPO algorithm. The model is first warmed with SFT and then fine-tuned using reinforcement learning, where multiple reasoning paths are automatically sampled given the question, and rewards are derived from ground-truth answers. In addition, inference-time strategies such as majority voting and re-ranking are combined with ReFT to boost performance further.

The ReFT method significantly outperforms SFT regarding reasoning capability and generalizability for LLMs in math problem-solving. Extensive experiments on GSM8K, MathQA, and SVAMP datasets demonstrate the better performance of ReFT over SFT. The performance of ReFT can be further boosted by combining inference-time strategies such as majority voting and re-ranking. They use Python programs as CoT prompts, showing more accurate reasoning steps and significant improvements over natural language CoT. Previous work on reinforcement learning and reranking has also demonstrated better performance over supervised fine-tuning and majority voting.

In conclusion, ReFT stands out as a fine-tuning method for enhancing models in solving math problems. Unlike SFT), ReFT optimizes a non-differentiable objective by exploring multiple CoT annotations rather than relying on a single one. Extensive experiments across three datasets using two foundational models have shown that ReFT surpasses SFT in performance and generalization. Models trained with ReFT exhibit compatibility with techniques like majority voting and reward model reranking. ReFT outperforms several open-source open-source models of similar sizes in math problem-solving, highlighting its effectiveness and practical value.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


Leave a Reply

Your email address will not be published. Required fields are marked *