Close Menu
MathsXPMathsXP
    What's Hot

    The Smartest Dividend Stocks in Bill Ackman’s Portfolio to Buy With $1,000 Right Now – TFFH – The Financial Freedom Hub

    May 12, 2025

    Cosmic Energy Profile – MathsXP – TFFH – The Financial Freedom Hub

    May 12, 2025

    How to Pay Cash for a Car – TFFH – The Financial Freedom Hub

    May 12, 2025
    1 2 3 … 39 Next
    Pages
    • Get In Touch
    • Maths XP – Winning the news since ’25.
    • Our Authors
    • Privacy Policy
    • Terms of Service
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    MathsXPMathsXP
    Join Us Now
    • Home
    • Our Guides
      • Careers, Business & Economic Trends
      • Cryptocurrency & Digital Assets
      • Debt Management & Credit
      • Insurance & Risk Management
      • Investing Strategies & Portfolio Management
      • Personal Finance Basics & Budgeting
      • Retirement Planning
      • Taxes & Tax-Efficient Strategies
    • Other News
      • Behavioral Finance & Money Psychology
      • Global Economic & Market News
      • Small Business & Entrepreneurship Finance
      • Sustainable & ESG Investing
      • Tech, AI, and Fintech Innovations
      • Maths
    MathsXPMathsXP
    Home » Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization
    Tech, AI, and Fintech Innovations

    Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization

    The News By The NewsMay 11, 2025No Comments4 Mins Read
    Facebook Twitter Pinterest Reddit Telegram LinkedIn Tumblr VKontakte WhatsApp Email
    Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization
    Share
    Facebook Twitter Reddit Pinterest Email

    Sparse large language models (LLMs) based on the Mixture of Experts (MoE) framework have gained traction for their ability to scale efficiently by activating only a subset of parameters per token. This dynamic sparsity allows MoE models to retain high representational capacity while limiting computation per token. However, with their increasing complexity and model size approaching trillions of parameters, training them efficiently requires algorithmic innovation and a tightly integrated hardware-software optimization. These challenges are especially relevant when deploying models on non-standard AI accelerators like Ascend NPUs, which require specific architectural alignment to deliver optimal performance.

    A major technical challenge lies in the inefficient utilization of hardware resources while training sparse LLMs. Since only a portion of parameters are active for each token, workloads across devices become unbalanced, leading to synchronization delays and underused processing power. This imbalance also affects memory utilization as different experts process different numbers of tokens, sometimes exceeding capacity. These inefficiencies are compounded at a large scale, such as across thousands of AI chips, where communication and memory management bottlenecks significantly hinder throughput. The inability to fully harness the computational promise of sparsity in practice restricts the deployment of such models on hardware systems like Ascend NPUs.

    Several strategies have been proposed to tackle these challenges. These include auxiliary losses to balance token distribution across experts and drop-and-pad strategies that limit expert overload by discarding tokens exceeding capacity. However, these techniques either reduce model performance or introduce inefficiencies in memory and computation. Other efforts include heuristic expert placement and traditional communication patterns like All-to-All dispatching, but these often fail to scale well or maintain high throughput. Moreover, standard memory-saving techniques like recomputation are usually coarse-grained, targeting whole layers instead of specific operations, leading to increased runtime without proportional memory savings.

    Researchers from the Pangu team at Huawei Cloud introduced a highly structured and optimized training approach for large MoE models tailored to Ascend NPUs. They developed Pangu Ultra MoE, a sparse LLM with 718 billion parameters, focusing on aligning model architecture and system design with the capabilities of the Ascend hardware. Their approach begins with a simulation-based model configuration process that evaluates thousands of architecture variants using metrics grounded in actual hardware behavior. These simulations inform design decisions before any physical training is undertaken, thus saving substantial computational resources and enabling informed tuning of model hyperparameters.

    The simulation method analyzes combinations of parameters such as the number of layers, hidden size, and expert count using a five-dimensional parallelism strategy that includes Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, Data Parallelism, and Context Parallelism. The final model configuration adopted by Huawei included 256 experts, a hidden size 7680, and 61 transformer layers. To further optimize performance, researchers integrated an Adaptive Pipe Overlap mechanism to mask communication costs and used hierarchical All-to-All communication to reduce inter-node data transfer. They employed fine-grained recomputation, such as recomputing only key-value vectors in attention modules, and introduced tensor swapping to offload activation memory to host devices dynamically.

    Pangu Ultra MoE achieved a Model Flops Utilization (MFU) of 30.0% and processed tokens at a rate of 1.46 million per second using 6,000 Ascend NPUs. The baseline MFU was 18.9% with 0.61 million tokens per second on 4,000 NPUs. The researchers also introduced dynamic expert placement strategies, improving device-level load balance and achieving a relative 10% MFU improvement. The model performed competitively on benchmark evaluations, attaining 81.3% on AIME2024, 97.4% on MATH500, 94.8% on CLUEWSC, and 91.5% on MMLU. In the healthcare domain, it outperformed DeepSeek R1 by scoring 87.1% on MedQA and 80.8% on MedMCQA, confirming its strength in domain-specific applications.

    This study illustrates how the Pangu team at Huawei effectively tackled the core difficulties of training massive MoE models on specialized hardware. Their systematic architecture search, efficient communication techniques, and tailored memory optimizations represent a strong framework for scalable AI training. The work demonstrates practical ways to unlock the performance potential of sparse models and sets a direction for future system-aware AI design.


    Check out Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

    Here’s a brief overview of what we’re building at Marktechpost:


    Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.


    Source link

    718BParameter architecture Ascend Efficiently Huawei Introduces Language Model MoE NPUs Optimization Pangu SimulationDriven Sparse SystemLevel Trained Ultra
    Share. Facebook Twitter Pinterest LinkedIn Reddit Email
    Previous ArticleMy Birth Angel – Hot NEW Offer that sells like hotcakes – TFFH – The Financial Freedom Hub
    Next Article Super Affiliate Marketing Mastery
    The News

    Related Posts

    BitGo secures EU crypto custody licence

    May 12, 2025

    iplicit: Time to Prioritise Wellbeing in Finance, With 82% of Professionals Feeling Stressed

    May 12, 2025

    This American VC is betting on European defense tech; that’s still very unusual

    May 12, 2025

    Microsoft Researchers Introduce ARTIST: A Reinforcement Learning Framework That Equips LLMs with Agentic Reasoning and Dynamic Tool Use

    May 12, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    Subscribe to Updates

    Get the latest news from Mathxp!

    Advertisement
    MathXp.Com
    MathXp.Com

    Winning the news since '25.

    Facebook X (Twitter) Instagram Pinterest YouTube
    Pages
    • Get In Touch
    • Maths XP – Winning the news since ’25.
    • Our Authors
    • Privacy Policy
    • Terms of Service
    Top Insights

    The Smartest Dividend Stocks in Bill Ackman’s Portfolio to Buy With $1,000 Right Now – TFFH – The Financial Freedom Hub

    May 12, 2025

    Cosmic Energy Profile – MathsXP – TFFH – The Financial Freedom Hub

    May 12, 2025

    How to Pay Cash for a Car – TFFH – The Financial Freedom Hub

    May 12, 2025
    2025 MathsXp.com
    • Home

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.