Dialogue with Zheng Weimin, Academician of the Chinese Academy of Engineering: DeepSeek, what's so great about it?

January 27DeepSeek The app topped the free app download chart of Apple's US app store, surpassing ChatGPT on the US download chart. on the same day, DeepSeek became the No.1 app in China on the free app chart of Apple's China app store.

What's so great about DeepSeek?

Dialogue with Zheng Weimin, Academician of the Chinese Academy of Engineering: DeepSeek, what's so great about it?

today,Chinese Academy of EngineeringAcademician and Professor, Department of Computer Science, Tsinghua UniversityRaymond Cheng Wei-min (1926-), Hong Kong actor and politician, President of Hong Kong 1986-1992And a number of AI circle people in communication with Sina Technology, pointed out the key to the success of DeepSeek out of the circle.

Currently, the industry's love and praise for DeepSeek focuses on three main areas.

  • First, at the technical level, DeepSeek-V3, the model behind DeepSeek, and DeepSeek-R1, the company's newly launched model, have achieved capabilities comparable to OpenAI 4o and o1 models, respectively.
  • Second, the two models developed by DeepSeek are much cheaper, at about one-tenth the cost of the OpenAI 4o and o1 models.
  • Third, DeepSeek has open-sourced the technology of these two models, which allows more AI teams to develop more AI-native applications based on the most advanced and lowest-cost models.

So how does DeepSeek achieve model cost reduction?

Zheng Weimin noted that "DeepSeek's self-developed MLA architecture and DeepSeek MOE architecture have played a key role in bringing down the cost of training its own models." He noted that "MLA compresses the KV Cache size mainly by transforming the attention operator, realizing that more KV Cache can be stored with the same capacity, theThis architecture, coupled with the modification of the FFN layer in the DeepSeek-V3 model, implements a very large sparse MoE layer, which becomes the most crucial reason for the low training cost of DeepSeek. "

In terms of technology, KV Cache is an optimization technique often used to store key-value pairs (i.e., key-value values) of tokens generated by AI models to improve computational efficiency. Specifically, in the process of model computing, KV cache will act as a memory bank in the process of model computing to store the key values of tokens previously processed by the model, and effectively control the input and output of the stored tokens through the model computing of attention scores, avoiding most of the problems of the model computing from starting from the first token every time by "exchanging storage for computation". This effectively controls the input and output of stored tokens, and avoids the repetitive computation of the first token in most large models, thus improving the efficiency of arithmetic power utilization.

In addition, according to Weimin Zheng, DeepSeek also solves the problem of "very large and very sparse MoE models" using thePerformance challengesAnd that's "the key reason why DeepSeek is so inexpensive to train".

Dialogue with Zheng Weimin, Academician of the Chinese Academy of Engineering: DeepSeek, what's so great about it?

Currently, enhancing the professional cognitive ability of AI big models through MoE hybrid expert models is becoming an industry-recognized effective means, and the more the number of expert models in a big model, the sparser the model and the more efficient it is, but the more expert models become may lead to less accurate results in the final generation.

According to Zheng Weimin, "DeepSeek's strength is its ability to train MoEs, making it the first company to successfully train such a large MoE in a public MoE model.." Sina Technology learned that in order to ensure the balanced operation of large-scale MoE expert models, DeepSeek uses an advanced expert loading equalization technique without auxiliary loss function to ensure that under each token, when a small number of expert network parameters are activated, different expert networks can be activated at a more balanced frequency to prevent the expert network from activating a bunch of activations.

In addition, DeepSeek leverages the sparsely activated design of the expert network to limit the number of tokens sent to a GPU cluster node, which stabilizes the communication overhead between GPUs at a low level.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

OpenAI Faces Another Copyright Lawsuit: Indian Publisher Alleges Unauthorized Use of Protected Content to Train Models

2025-1-26 8:26:57

Information

Zhu Xiaohu called DeepSeek a victory for tech idealists: focusing on technology and innovation is far better than buying traffic and PR.

2025-1-27 13:45:27

Search