The Valley's Going Crazy: How DeepSeek Achieved State-of-the-Art AI with a $6 Million Budget
Over the past weekend, more than 20 people approached me, buzzing about DeepSeek—the new AI contender that's shaking up Silicon Valley.
"DeepSeek R1 made things even scarier."
These were the chilling words of a Meta insider as they grappled with a harsh truth:
Did you know that every single leader in Meta’s GenAI division earns more than the entire $5.6 million it cost to train DeepSeek v3?
—a model that’s now setting the AI world ablaze in the past few days, with Nasdaq Composite plunging by 3.1%, Nvidia 11%.
And btw, there are a dozens of such "leaders" on Meta’s payroll.
So how DeepSeek did it?
DeepSeek's Secret Sauce: High Performance on a Shoestring Budget
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising a staggering 671 billion total parameters, with 37 billion activated for each token. Despite its massive scale, the model achieves performance comparable to leading closed-source models while requiring only a fraction of the training resources. As stated in the technical report:
"DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M."
This level of efficiency is made possible through a combination of architectural innovations, training optimizations, and infrastructure improvements.
DeepSeek has managed to achieve what many thought impossible - developing a state-of-the-art AI model for a fraction of the cost of its competitors. Let's break down the key innovations that made this possible:
1. Innovative Architectural Designs
DeepSeek's engineers didn't just rely on brute force computing power. They implemented cutting-edge architectural designs like Mixture of Experts (MoE) and Multi-Level Attention (MLA). These innovations allow for more efficient data processing and reduced computational demands without sacrificing output quality.
The use of MoE architecture, in particular, enables DeepSeek to scale model capacity while keeping computational costs in check. This approach allows for selective activation of only the most relevant "expert" components for each input, significantly improving efficiency. To put it in simpler words -
Imagine a large AI model, which is like a big, complex brain. This model is made up of many smaller “experts,” each trained to be very good at a specific type of task. In a typical AI model, every piece of data is processed by the entire model. But in MoE, when a new task or problem comes up, the system decides which experts are the best suited to handle it, and only those experts are used to process that task.
This makes MoE more efficient because it doesn’t need to use all the experts every time. It only uses a few, which saves on computational resources and speeds things up. At the same time, because you’re using specialists for specific tasks, the system can perform better and more accurately in those areas.
Multi-head Latent Attention (MLA)
At the core of DeepSeek-V3's architecture is the Multi-head Latent Attention (MLA) mechanism, which enables efficient inference by reducing the Key-Value (KV) cache. MLA employs a low-rank joint compression for attention keys and values. This compression significantly reduces memory requirements during generation while maintaining performance comparable to standard Multi-Head Attention.
Specifically if you are familiar with the concept of “attention”, here they are using Multi-head Attention.
Multi-head Attention: It is like having multiple "attention heads," or different ways of looking at the input. Each head looks at the input in a different way, focusing on different relationships or aspects of the data. The idea is to capture a broader understanding of the information by looking at it from multiple angles.
The Challenge: The downside of traditional Multi-head Attention is that it can be memory-heavy, especially when processing long sequences of data. This is because the model needs to store large amounts of data (called Key-Value or KV pairs) to remember important parts of the input. As the model generates more output, the memory needed to store these Key-Value pairs can grow quickly, slowing down the system.
How MLA Helps: MLA improves on this by using a technique called low-rank joint compression for the Key-Value pairs. This means it reduces the size of the information the model has to store, while still keeping enough detail to make accurate predictions. Think of it like compressing a large file into a smaller size, but still keeping most of the important content intact.
The result is that MLA reduces the memory requirements during the model’s generation process (when it's creating new output), making the whole process faster and more efficient. Despite the reduction in memory, MLA still performs similarly to the original Multi-Head Attention, so the quality of the results doesn’t suffer.
In addition, according to their report, they added that,
“On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.
We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.”
Auxiliary-Loss-Free Load Balancing
A key innovation in DeepSeek-V3 is the auxiliary-loss-free load balancing strategy. This approach introduces a bias term for each expert, which is dynamically adjusted during training to ensure balanced expert utilization without relying on auxiliary losses that can degrade model performance. To understand this better -
Imagine you're organizing a group of friends to help clean up a big room. Each friend is responsible for a specific task, like sweeping, mopping, or organizing the toys. You want each friend to do their fair share of work so that no one gets overwhelmed or underused. But here's the catch: some friends may naturally be faster or slower than others at certain tasks, so you need a way to keep things fair without making it feel like a competition.
In this scenario, Auxiliary-Loss-Free Load Balancing is like giving each friend a “helping hand” based on how much work they've already done, so that they don't end up doing too much or too little. Instead of tracking each person’s performance with complicated metrics or adding extra rules (which could distract them from just getting the work done), you adjust their workload a little as they go along. This helps to keep everyone working at a balanced pace, without slowing things down with extra steps that might complicate the job.
In DeepSeek-V3, each expert (or group of model components working on a task) is like a friend in that group. The bias term is the adjustment — a small change to each expert's task — that helps balance how much work each expert is doing during training. This "adjustment" is dynamic, meaning it changes based on how much work each expert has done so far.
The key innovation here is that this balancing doesn’t require extra rules or penalties (like auxiliary losses), which could slow down or confuse the training process. Instead, the model makes these adjustments naturally and efficiently, just like how you might shift tasks among your friends without making it a big deal. This keeps everything running smoothly without extra effort or negative side effects.
Multi-Token Prediction (MTP)
DeepSeek-V3 incorporates a Multi-Token Prediction objective, which extends the prediction scope to multiple future tokens at each position. This densifies training signals and enables the model to pre-plan its representations for better future token prediction.
2. Training Optimizations and Infrastructure
FP8 Mixed Precision Training
One of the most significant optimizations in DeepSeek-V3 is the implementation of an FP8 mixed precision training framework. This approach includes:
Fine-grained quantization with tile-wise and block-wise grouping
Increased accumulation precision in Tensor Cores
Online quantization for accurate scales
These approaches help by adjusting the numbers so they fit better in the lower precision format (FP8), in the meantime efficiently increase precision to FP32 at later steps, ensuring that the final calculations are more precise, even if earlier steps used lower precision..
The technical report notes:
"We introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model."
Efficient Pipeline Parallelism
DeepSeek-V3 employs a novel pipeline parallelism algorithm called DualPipe, which effectively overlaps computation and communication phases:
"DualPipe not only accelerates model training by effectively overlapping forward and backward computation-communication phases, but also reduces the pipeline bubbles."
To explain in plain English, one of the key challenges in training AI models is how to manage communication between different parts of the system. When training a model across many nodes, there’s a lot of back-and-forth communication. DualPipe helps reduce this issue by optimizing how the system communicates. It ensures that while one part of the model is calculating, it can still send data to other parts of the model at the same time, making the whole process much more efficient.
This algorithm, combined with optimized cross-node all-to-all communication kernels, allows for near-zero communication overhead even with fine-grained experts across nodes.
In other words: When you train an AI model across multiple nodes, there’s a need to send data back and forth between them. DeepSeek-V3’s team created special communication tools (called "communication kernels") to make sure the data travels as fast as possible. These tools fully use the available bandwidth (like high-speed data lanes between nodes), so that no time is wasted while data is being transferred.
Memory Optimizations
To reduce memory footprint during training, DeepSeek-V3 implements several techniques:
Recomputation of RMSNorm and MLA up-projections (so that results are not saved but recomputed when needed in back-prop.)
Exponential Moving Average for performance tracking in CPU (to give more weight to recent data, so it stays up-to-date with the current state of the model.)
Shared embedding and output head for Multi-Token Prediction
When a model is making predictions, it has a “head” that’s responsible for generating the final outputs. In DeepSeek’s case, it’s important to make predictions for multiple tokens at once (this is the Multi-Token Prediction or MTP).To save memory, DeepSeek uses a clever strategy where the embedding layer (which processes input data) and the output head (which makes predictions) are placed on the same parallel processor (PP) rank. This means that the parameters (weights) and gradients for both parts of the model are shared. By sharing the same weights between the embedding and output head, DeepSeek reduces the amount of memory needed to store them. This is another trick to make the model more memory-efficient without sacrificing performance.
These optimizations enable training without costly Tensor Parallelism, further improving efficiency.
3. Pre-training and Fine-tuning Process
Keep reading with a 7-day free trial
Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.