Close Menu
MathsXPMathsXP
    What's Hot

    How To Build An Effective Social Media Publishing Schedule

    May 8, 2025

    Indiana Makes Earned Wages Access Legislation Law to Benefit Local Consumers and Businesses

    May 8, 2025

    Missing Numbers up to 10

    May 8, 2025
    1 2 3 … 249 Next
    Pages
    • Get In Touch
    • Maths XP – Winning the news since ’25.
    • Our Authors
    • Privacy Policy
    • Terms of Service
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    MathsXPMathsXP
    Join Us Now
    • Home
    • Our Guides
      • Careers, Business & Economic Trends
      • Cryptocurrency & Digital Assets
      • Debt Management & Credit
      • Insurance & Risk Management
      • Investing Strategies & Portfolio Management
      • Personal Finance Basics & Budgeting
      • Retirement Planning
      • Taxes & Tax-Efficient Strategies
    • Other News
      • Behavioral Finance & Money Psychology
      • Global Economic & Market News
      • Small Business & Entrepreneurship Finance
      • Sustainable & ESG Investing
      • Tech, AI, and Fintech Innovations
      • Maths
    MathsXPMathsXP
    Home » LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward
    Tech, AI, and Fintech Innovations

    LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward

    The News By The NewsMay 3, 2025No Comments4 Mins Read
    Facebook Twitter Pinterest Reddit Telegram LinkedIn Tumblr VKontakte WhatsApp Email
    LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward
    Share
    Facebook Twitter Reddit Pinterest Email

    Recent advancements in LLMs such as OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have significantly improved their performance on complex mathematical reasoning tasks. Reinforcement Learning with Verifiable Reward (RLVR) is a key contributor to these improvements, which uses rule-based rewards, typically a binary signal indicating whether a model’s solution to a problem is correct. Beyond enhancing final output accuracy, RLVR has also been observed to foster beneficial cognitive behaviors like self-reflection and improve generalization across tasks. While much research has focused on optimizing reinforcement learning algorithms like PPO and GRPO for greater stability and performance, the influence of training data—its quantity and quality—remains less understood. Questions around how much and what kind of data is truly effective for RLVR are still open, despite some work like LIMR introducing metrics to identify impactful examples and reduce dataset size while maintaining performance.

    In contrast to the extensive research on data selection in supervised fine-tuning and human feedback-based reinforcement learning, the role of data in RLVR has seen limited exploration. While LIMR demonstrated that using a small subset of data (1.4k out of 8.5k examples) could maintain performance, it did not examine the extreme case of minimal data use. Another concurrent study found that even training with just four PPO examples led to notable improvements, but this finding wasn’t deeply investigated or benchmarked against full-dataset performance. Although RLVR shows great promise for enhancing reasoning in LLMs, a deeper, systematic study of data efficiency and selection in this context is still lacking. 

    Researchers from the University of Washington, University of Southern California, Microsoft, University of California, Santa Cruz, and Georgia Institute of Technology show that RLVR can significantly enhance large language models’ mathematical reasoning using a single training example, 1-shot RLVR. Applying it to Qwen2.5-Math-1.5B improves its MATH500 accuracy from 36.0% to 73.6%, matching the performance of much larger datasets. The improvements generalize across models, tasks, and algorithms. The study also reveals effects like cross-domain generalization, increased self-reflection, and post-saturation generalization, and highlights the roles of policy gradient loss and entropy-driven exploration. 

    The study investigates how much the RLVR training dataset can be reduced while retaining comparable performance to the full dataset. Remarkably, the authors find that a single training example—1-shot RLVR—can significantly boost mathematical reasoning in LLMs. The study shows that this effect generalizes across tasks, models, and domains. Interestingly, training on one example often enhances performance on unrelated domains. A simple data selection strategy based on training accuracy variance is proposed, but results show that even randomly chosen examples can yield major gains. 

    The study evaluates their method using Qwen2.5-Math-1.5B as the primary model and other models like Qwen2.5-Math-7B, Llama-3.2-3 B-Instructt, and DeepSeek-R1-DistillQwen-1.5 BB. They use a 1,209-example subset of the DeepScaleR dataset for data selection, and the MATH dataset for comparison. Training involves the Verl pipeline, with carefully chosen hyperparameters and batch configurations. Surprisingly, training with just one or two examples—especially π1 and π13—leads to strong generalization, even beyond math tasks. This “post-saturation generalization” persists despite overfitting signs. The study also finds increased model self-reflection and shows that even simple examples can significantly enhance performance across domains.

    In conclusion, the study explores the mechanisms behind the success of 1-shot RLVR, demonstrating that base models already possess strong reasoning abilities. Experiments show that even a single example can significantly improve performance on reasoning tasks, suggesting the model’s inherent capacity for reasoning. The study highlights that policy gradient loss is key to 1-shot RLVR’s effectiveness, with entropy loss further enhancing performance. Additionally, encouraging exploration through techniques like entropy regularization can improve post-saturation generalization. The findings also emphasize the need for careful data selection to optimize the model’s performance, particularly in data-constrained scenarios. 


    Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop


    Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


    Source link

    1Shot Complex Learn Learning LLMs Math Microsoft Power Reinforcement Researchers Reward University Unlock USC Verifiable Washington
    Share. Facebook Twitter Pinterest LinkedIn Reddit Email
    Previous ArticleThe Unexpected Ripple Effects of Disability Employment Quotas
    Next Article Trump’s tariffs are a gift to the mafia
    The News

    Related Posts

    Indiana Makes Earned Wages Access Legislation Law to Benefit Local Consumers and Businesses

    May 8, 2025

    Microsoft Signs 3 Million Ton Nature-Based Carbon Removal Deal with Forest Investor EFM

    May 8, 2025

    OpenAI launches a data residency program in Asia

    May 8, 2025

    Hugging Face Releases nanoVLM: A Pure PyTorch Library to Train a Vision-Language Model from Scratch in 750 Lines of Code

    May 8, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Advertisement
    Demo
    MathXp.Com
    MathXp.Com

    Winning the news since '25.

    Facebook X (Twitter) Instagram Pinterest YouTube
    Pages
    • Get In Touch
    • Maths XP – Winning the news since ’25.
    • Our Authors
    • Privacy Policy
    • Terms of Service
    Top Insights

    How To Build An Effective Social Media Publishing Schedule

    May 8, 2025

    Indiana Makes Earned Wages Access Legislation Law to Benefit Local Consumers and Businesses

    May 8, 2025

    Missing Numbers up to 10

    May 8, 2025
    2025 MathsXp.com
    • Home
    • Buy Now

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.