Close Menu
MathsXPMathsXP
    What's Hot

    How To Build An Effective Social Media Publishing Schedule

    May 8, 2025

    Indiana Makes Earned Wages Access Legislation Law to Benefit Local Consumers and Businesses

    May 8, 2025

    Missing Numbers up to 10

    May 8, 2025
    1 2 3 … 249 Next
    Pages
    • Get In Touch
    • Maths XP – Winning the news since ’25.
    • Our Authors
    • Privacy Policy
    • Terms of Service
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    MathsXPMathsXP
    Join Us Now
    • Home
    • Our Guides
      • Careers, Business & Economic Trends
      • Cryptocurrency & Digital Assets
      • Debt Management & Credit
      • Insurance & Risk Management
      • Investing Strategies & Portfolio Management
      • Personal Finance Basics & Budgeting
      • Retirement Planning
      • Taxes & Tax-Efficient Strategies
    • Other News
      • Behavioral Finance & Money Psychology
      • Global Economic & Market News
      • Small Business & Entrepreneurship Finance
      • Sustainable & ESG Investing
      • Tech, AI, and Fintech Innovations
      • Maths
    MathsXPMathsXP
    Home » RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity
    Tech, AI, and Fintech Innovations

    RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity

    The News By The NewsMay 5, 20251 Comment4 Mins Read
    Facebook Twitter Pinterest Reddit Telegram LinkedIn Tumblr VKontakte WhatsApp Email
    RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity
    Share
    Facebook Twitter Reddit Pinterest Email

    LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Methods like Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, representing a fundamental challenge for this class of models.

    Linear complexity language models have emerged as alternatives to transformer-based architectures that suffer from quadratic computational demands when processing long sequences. The RWKV model series combines transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, from the foundational RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Further, Native Sparse Attention organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention includes SeerAttention and Block Attention (MoBA).

    Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

    RWKV-X is a hybrid architecture that integrates RWKV-7 blocks with sparse attention blocks. Rather than training from scratch, RWKV-X builds upon existing models using an interleaved block expansion approach and zero-initialization mechanism inspired by LLaMA Pro. The training follows a two-stage process:

    • First, the model trains on short 1024-token contexts from the MiniPile dataset while freezing all parameters except the newly added blocks. 
    • The second stage involves long-context continual pretraining using the ProLong-64K dataset and a context length of 64K tokens, processing approximately 1 billion tokens total. During this phase, all parameters are unfrozen and jointly optimized. The training employs Long-context Cross-Entropy (LongCE) loss, which dynamically weights tokens based on their importance.

    The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller RWKV-X (0.22B) achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

    In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.


    Check out the Paper. Also, don’t forget to follow us on Twitter.

    Here’s a brief overview of what we’re building at Marktechpost:

    ML News Community – r/machinelearningnews (92k+ members)

    Newsletter– airesearchinsights.com/(30k+ subscribers)

    miniCON AI Events – minicon.marktechpost.com

    AI Reports & Magazines – magazine.marktechpost.com

    AI Dev & Research News – marktechpost.com (1M+ monthly readers)


    Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.


    Source link

    1MToken Attention Combines Complexity Decoding Efficient Enable Linear Memory Recurrent RWKVX Sparse
    Share. Facebook Twitter Pinterest LinkedIn Reddit Email
    Previous ArticleAdobe Express Survey Reveals How Small Business Owners Are Hustling with Passion and Creative Tools
    Next Article Why Sketchers Stock Is Skyrocketing Today – TFFH
    The News

    Related Posts

    Indiana Makes Earned Wages Access Legislation Law to Benefit Local Consumers and Businesses

    May 8, 2025

    OpenAI launches a data residency program in Asia

    May 8, 2025

    Hugging Face Releases nanoVLM: A Pure PyTorch Library to Train a Vision-Language Model from Scratch in 750 Lines of Code

    May 8, 2025

    Navigating the macro-economy: Opportunities in multi-currency settlement

    May 8, 2025
    View 1 Comment

    1 Comment

    1. Pingback: RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity - TFFH - The Financial Freedom Hub

    Top Posts

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Advertisement
    Demo
    MathXp.Com
    MathXp.Com

    Winning the news since '25.

    Facebook X (Twitter) Instagram Pinterest YouTube
    Pages
    • Get In Touch
    • Maths XP – Winning the news since ’25.
    • Our Authors
    • Privacy Policy
    • Terms of Service
    Top Insights

    How To Build An Effective Social Media Publishing Schedule

    May 8, 2025

    Indiana Makes Earned Wages Access Legislation Law to Benefit Local Consumers and Businesses

    May 8, 2025

    Missing Numbers up to 10

    May 8, 2025
    2025 MathsXp.com
    • Home
    • Buy Now

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.