{"id":29496,"date":"2025-02-24T11:11:03","date_gmt":"2025-02-24T03:11:03","guid":{"rendered":"https:\/\/www.1ai.net\/?p=29496"},"modified":"2025-02-24T11:11:03","modified_gmt":"2025-02-24T03:11:03","slug":"%e6%9c%88%e4%b9%8b%e6%9a%97%e9%9d%a2-kimi-%e5%bc%80%e6%ba%90-moonlight%ef%bc%9a30-%e4%ba%bf-160-%e4%ba%bf%e5%8f%82%e6%95%b0%e6%b7%b7%e5%90%88%e4%b8%93%e5%ae%b6%e6%a8%a1%e5%9e%8b","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/29496.html","title":{"rendered":"Dark Side of the Moon Kimi Open Source Moonlight: 3 Billion \/ 16 Billion Parameter Mixed Expert Models"},"content":{"rendered":"<p>February 24th.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%9c%88%e4%b9%8b%e6%9a%97%e9%9d%a2\" title=\"[Sees articles with labels]\" target=\"_blank\" >Dark Side of the Moon<\/a> <a href=\"https:\/\/www.1ai.net\/en\/tag\/kimi\" title=\"[View articles tagged with [Kimi]]\" target=\"_blank\" >Kimi<\/a> Yesterday, we released a new technical report on \"Muon Scalable for LLM Training\" and announced the launch of \"Moonlight\": a 3 billion \/ 16 billion parameterized system trained on Muon.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%b7%b7%e5%90%88%e4%b8%93%e5%ae%b6%e6%a8%a1%e5%9e%8b\" title=\"[Sees articles with labels of [mixed expert model]\" target=\"_blank\" >hybrid expert model<\/a>(MoE). Using 5.7 trillion tokens, better performance is achieved at lower floating point operations counts (FLOPs), thus improving the Pareto efficiency bound.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-29497\" title=\"48c894d9j00ss63h500dxd000kp011vp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/02\/48c894d9j00ss63h500dxd000kp011vp.jpg\" alt=\"48c894d9j00ss63h500dxd000kp011vp\" width=\"745\" height=\"1363\" \/><\/p>\n<p>Dark Side of the Moon says the team discovered that the Muon optimizer can be used by the<strong>Add weight attenuation, carefully adjust the update magnitude of each parameter<\/strong>and other technologies are extended with the following highlights:<\/p>\n<blockquote>\n<ul>\n<li>These techniques allow Muon to be used out-of-the-box for large-scale training without the need for hyperparameter tuning. Expansion law experiments show that Muon achieves about 2x computational efficiency compared to AdamW, which computes optimal training.<\/li>\n<\/ul>\n<\/blockquote>\n<p>The model used in this thesis is Moonlight-16B-A3B, with a total number of parameters of 15.29B and an activation parameter of 2.24B, which uses the Muon optimizer to obtain the above results with 5.7T Tokens of training data.<\/p>\n<blockquote>\n<ul>\n<li>Our model not only breaks the current Pareto frontiers, but also achieves better performance than previous models with a significantly reduced number of FLOPs required for training.<\/li>\n<li>We open-source a distributed version of our Muon implementation that is optimized for both memory usage and communication efficiency. We have also released pre-trained models, command-tuned models, and intermediate training checkpoints designed to support future research.<\/li>\n<\/ul>\n<p data-vmark=\"3431\">The relevant links are attached below:<\/p>\n<ul class=\"list-paddingleft-2\">\n<li>\n<p data-vmark=\"31e7\">GitHub:<a href=\"https:\/\/github.com\/MoonshotAI\/Moonlight\" target=\"_blank\" rel=\"noopener\">Click here to go<\/a><\/p>\n<\/li>\n<li>\n<p data-vmark=\"35e3\">Hugging Face:<a href=\"https:\/\/huggingface.co\/moonshotai\" target=\"_blank\" rel=\"noopener\">Click here to go<\/a><\/p>\n<\/li>\n<\/ul>\n<\/blockquote>","protected":false},"excerpt":{"rendered":"<p>February 24, 2011 - Kimi, Dark Side of the Moon, yesterday released a new technical paper \"Muon Scalable for LLM Training\" and announced the launch of \"Moonlight\": a 3 billion \/ 16 billion parameter hybrid expert model (MoE) trained on Muon. Mixed Expert Model (MoE) trained on Muon. Using 5.7 trillion tokens, it achieves better performance at lower floating point operations (FLOPs), thus improving the Pareto efficiency bound. According to Dark Side of the Moon, the team found that the Muon optimizer can be scaled by adding techniques such as weight decay and carefully tuning the magnitude of the update for each parameter, and has the following highlights: These techniques allow Muon to be used out-of-the-box for large-scale training without the need for hyperparameterization<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[1814,1168,5783],"collection":[],"class_list":["post-29496","post","type-post","status-publish","format-standard","hentry","category-news","tag-kimi","tag-1168","tag-5783"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/29496","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=29496"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/29496\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=29496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=29496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=29496"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=29496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}