Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Train carrying possible hazardous materials derails in Palo Pinto, Texas: Officials

    August 13, 2025

    LATAM Joins Delta, Volaris, Alaska, JetBlue, Ryanair in Recording Soaring Passenger Growth Across Global Domestic and International Markets

    August 13, 2025

    Train and deploy AI models at trillion-parameter scale with Amazon SageMaker HyperPod support for P6e-GB200 UltraServers

    August 12, 2025
    Facebook X (Twitter) Instagram
    • Demos
    • Buy Now
    Facebook X (Twitter) Instagram YouTube
    14 Trends14 Trends
    Demo
    • Home
    • Features
      • View All On Demos
    • Buy Now
    14 Trends14 Trends
    Home » Dion: the distributed orthonormal update revolution is here
    AI Features

    Dion: the distributed orthonormal update revolution is here

    adminBy adminAugust 12, 2025No Comments6 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.

    Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)–W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.

    There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP and TP parallelization becomes desirable. Going back to the inspiration for Muon, the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.  

    What’s an orthonormal update?

    Illustration of matrix parameters
    Figure1. Illustration of matrix parameters

    At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.

    What is Dion?

    While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).

    Pseudocode of the centralized version of Dion
    Figure 2. Pseudocode of the centralized version of Dion

    This is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.

    Dion implements orthonormalization using amortized power iteration (opens in new tab). Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it. 

    Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.

    Spotlight: AI-POWERED EXPERIENCE

    Microsoft research copilot experience

    Discover more about research at Microsoft through our AI-powered experience


    Opens in a new tab

    How does it work?

    Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.

    Wall-clock time speedup of Dion for 3B model training
    Figure 3. Wall-clock time speedup of Dion for 3B model training

    Why would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.

    This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).

    Scaling of Dion across different batch sizes
    Figure 4. Scaling of Dion across different batch sizes

    Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).   

    In our experiments, these benefits extend to various post-training regimes as well.

    We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.

    Low-rank Dion across different model sizes
    Figure 5. Low-rank Dion across different model sizes

    Projecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.    

    Using hardware timings of the individual update steps suggests a story that looks this:

    Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.
    Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.

    We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.

    Acknowledgements

    We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.

    Opens in a new tab





    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    Reimagining healthcare delivery and public health with AI

    August 7, 2025

    Self-adaptive reasoning for science – Microsoft Research

    August 6, 2025

    Project Ire autonomously identifies malware at scale

    August 5, 2025

    VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

    August 5, 2025

    Navigating medical education in the era of generative AI

    July 24, 2025

    Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

    July 24, 2025
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Laws, norms, and ethics for AI in health

    May 1, 20252 Views
    Don't Miss

    Train carrying possible hazardous materials derails in Palo Pinto, Texas: Officials

    August 13, 2025

    A train that may be carrying hazardous materials derailed in Palo Pinto County, Texas, on…

    LATAM Joins Delta, Volaris, Alaska, JetBlue, Ryanair in Recording Soaring Passenger Growth Across Global Domestic and International Markets

    August 13, 2025

    Train and deploy AI models at trillion-parameter scale with Amazon SageMaker HyperPod support for P6e-GB200 UltraServers

    August 12, 2025

    Epic Games wins partial victory in Australian court against Google and Apple

    August 12, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    Demo
    About Us
    About Us

    Your source for the lifestyle news. This demo is crafted specifically to exhibit the use of the theme as a lifestyle site. Visit our main page for more demos.

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Train carrying possible hazardous materials derails in Palo Pinto, Texas: Officials

    August 13, 2025

    LATAM Joins Delta, Volaris, Alaska, JetBlue, Ryanair in Recording Soaring Passenger Growth Across Global Domestic and International Markets

    August 13, 2025

    Train and deploy AI models at trillion-parameter scale with Amazon SageMaker HyperPod support for P6e-GB200 UltraServers

    August 12, 2025
    Most Popular

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    14 Trends
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Home
    • Buy Now
    © 2025 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.