Moore Threads Supports DeepSeek Open Source Week "Family Bucket"

March 2 News.DeepSeek The official closing of Open Source Week.Moore ThreadIntelligent Technology (Beijing) Co., Ltd. announced in a post yesterday evening that it has successfully realized the DeepSeek in a short period of time. Full support for various open source projectsIt covers FlashMLA, DeepEP, DeepGEMM, DualPipe, and the Fire-Flyer File System (3FS).

Moore Threads Supports DeepSeek Open Source Week "Family Bucket"

1AI with Moore's Thread Support The DeepSeek Open Source Week "bucket" of code is assembled below:

FlashMLA

  • FlashMLA is an open source repository of efficient MLA (Multi-Head Latent Attention) inference kernels designed to accelerate the computation of MLA mechanisms.Particularly suitable for DeepSeek family of models(e.g., DeepSeek-V2, V3, and R1).
  • Based on the new MUSA Compute Capability 3.1 computational architecture, Moore Threads provides native FP8 computational power and upgraded the high-performance linear algebra template library, MUTLASS, to quickly support FlashMLA. With MUTLASS 0.2.0, Moore Threads releases the open-source repository, MT-FlashMLA, which provides a quick deployment of the DeepSeek FlashMLA.
  • MT-FlashMLA Open Source Address:
  • https://github.com/MooreThreads/MT-flashMLA
  • MUTLASS FlashAttention3 Address:
  • https://github.com/MooreThreads/mutlass/tree/main/experimental/mp31_flash_attention_fwd

DeepEP

  • MT-DeepEP Open Source Address:
  • DeepEP is an open-source EP (expert parallelism) communication library for MoE (hybrid expert) model training and inference, which is mainly suitable for large model training, especially for cluster training that requires EP. It significantly improves the training efficiency by optimizing the usage rate of communication channels. Moore Threads is based on MUSA Compute Capability 3.1 Full-Featured GPUs, the first time adapted DeepEP to support the following features:
  • Efficient and optimized All-to-All communication with dispatch & combine support
  • Supports MTLink + GPU (MUSA Compute Capability 3.1) intra-node communication
  • High throughput computational core for training and inference pre-population phases
  • A low-latency computational core for the inference decoding phase
  • Native support for FP8 data distribution
  • Flexible control of GPU resources for efficient overlap of compute and communication
  • https://github.com/MooreThreads/MT-DeepEP

DeepGEMM

  • DeepGEMM is an FP8 GEMM library supporting dense matrix and mixed-expertise (MoE) matrix multiplication to power V3 / R1 training and inference. This open source repository is based on a library of C++ templates for high-performance generalized matrix multiplication (GEMM). Moore Threads is based on the MUTLASS optimized implementation on the new GPU architecture of the FP8 Matrix multiplicationThe corresponding functions of DeepGEMM are supported.
  • MUTLASS FP8 GEMM Address:
  • https://github.com/MooreThreads/mutlass/tree/main/examples/02_mp31_fp8_gemm_with_collective_builder
  • https://github.com/MooreThreads/mutlass/tree/main/examples/03_mp31_fp8_scaling_gemm

DualPipe

  • DualPipe is a bi-directional pipelined parallel algorithm proposed by DeepSeek-V3, which significantly reduces "pipeline bubbles" (device idle waits) by completely overlapping the computation and communication between the forward and backward computation phases. Compared with traditional pipelined parallelism, DualPipe adopts a bi-directional data flow design so that data is processed from both ends, which significantly improves resource utilization and training efficiency.
  • Moore Threads supports the DualPipe algorithm with the full compatibility of the deep learning framework Torch-MUSA (open source) and the MUSA software stack. MT-DualPipe has full access to Moore Threads. MT-Megatron frameworkand MT-TransformerEngine framework (soon to be open source)The DeepSeek V3 training process is a new way to implement the DeepSeek V3 training process.full recapitulation.
  • MT-DualPipe Open Source Address:
  • https://github.com/MooreThreads/MT-DualPipe
  • Torch-MUSA open source address:
  • https://github.com/MooreThreads/Torch_MUSA

3FS

  • The Fire-Flyer File System (3FS), a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks to maximize the bandwidth performance of SSDs, serves as a key underpinning for the challenges of AI training and inference workloads during training and inference in V3 and R1.
  • Moore Threads quickly built 3FS, a high-performance distributed file system, in a single day, and efficiently developed the storage plugin.Successfully realized seamless integration with Quam ClusterIt provides full-stack storage acceleration solutions for AI training, AI inference, scientific computing, and other scenarios.
  • 3FS CSI Driver Address:
  • https://github.com/MooreThreads/csi-driver-3fs
statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
HeadlinesInformation

Wuhan, Hubei, the first case of "AI generated map was infringed" copyright disputes, the defendant company needs to pay compensation of 4,000 yuan

2025-3-2 13:10:37

Information

The Dark Side of AI: Training Introduces Unsafe Code, and Advocates Human Domination

2025-3-2 13:14:27

Search