2084: A Dive into DeepSeek
While deepstock is training, an intro on DeepSeek V3
*deepstock is still training (although the initial results look very promising, as we’re getting over 60% accuracy on predicting buy/sell ratings)… but while we wait for the results of that, here is a intro to what makes DeepSeek interesting by Lisanthan Moodley, MS Imperial College of London.
Introduction
Over the past two weeks, the AI industry has experienced a major shake-up.
The open-source release of DeepSeek V3 & R1, boasting performance metrics on par with leading commercial models, triggered a massive market reaction: $2 trillion wiped from US market caps, with Nvidia alone losing $600 billion.
This made me wonder: Was the market's extreme reaction justified?
I decided to take a dive into the inner workings of DeepSeek, focusing on the white paper. Flipping through its pages felt overwhelming, as there were numerous components contributing to its overall performance. After further research, it became clear that the V3 & R1 model wasn’t some overnight success; rather a culmination of a year’s worth of publications from the DeepSeek team.
Ultimately, these breakthroughs came from simple yet powerful changes to the transformer architecture, leveraging optimisation techniques down to the GPU core. U.S. companies have overlooked this — a byproduct from having ample investment and access to the best chips. It’s clear that, despite limited resources, DeepSeek's advancements are truly innovative…when there’s a will; there’s always a way.
Innovation #1
Among DeepSeek's most groundbreaking innovation is their novel attention mechanism. First used in the V2 model (released June 2024) and further refined in their V3 model. This breakthrough tackles a crucial challenge in LLM models: reducing KV Cache for faster inference/training.
Previous attention mechanisms were explored in an attempt to optimise KV Cache (discussed in DeepSeeks V2 white paper). These methods include: Multi-Head Attention, Grouped Query Attention and Multi-Query Attention.
Current Attention Mechanisms:
Multi-Head Attention (MHA):
Linearly projects the queries (Q), keys (K), and values (V) into lower-dimensional subspaces multiple times (the number of projections, is a hyperparameter). Each projection can be viewed as a different "perspective" or variation of the input.
Perform attention in parallel on each of the projected representations (each is known as an "attention head").
Concatenate the outputs from all heads.
Apply a final linear projection to merge these outputs into the desired
dimensionality.
Enables the model to focus on different positions of the
input simultaneously, capturing various dependencies and contextual information.
Multi-Query Attention (MQA):
Similar to MHA, but instead of each query having its own key-value (KV) pair, a single shared KV pair is used for all queries across heads.
The shared KV reduces the memory footprint (less KV cache to load) and
can lead to faster training and inference without a significant loss in
performance.
Group Query Attention (GQA):
Follows the same principles as MHA, but queries are divided into groups.(essentially it uses the same query matrices for multiple key and value matrices)
Each group shares its own set of key-value projections, and the number of shared KV projections is treated as a hyperparameter.
It balances the need for diverse contextual representations with computational efficiency.
With reference to the results above; Multi-Head Attention (MHA) achieved the highest accuracy, followed by Group Query Attention (GQA), and finally Multi-Query Attention (MQA). It is evident that altering the query, key, and value components—whether by combining or omitting them—leads to a reduction in accuracy.
Since completely dropping/reducing QKV pairs isn’t a viable solution, the DeepSeek team introduced an innovative compression technique called Multi-Head Latent attention.
Proposed Attention Mechanism
Multi-Head Latent Attention (MLA)
The proposed attention mechanism, called Multi-Head Latent Attention, builds on the same principles as MHA while introducing a crucial enhancement: a dimensionality reduction of KV before it is cached. This allows the model to condense information to a “manageable latent space” without sacrificing the semantic richness of the original embeddings.
So how are they able to do this?… (Back to Linear Algebra 101)
Without being too technical; queries, keys, and value vectors are matrices. DeepSeek’s approach employs a learned matrix multiplier to project token embeddings ( denoted as latent cₜ in the diagram) into a lower dimensional space before going through the attention mechanism… 1
Thus:
Rotary Positional Embedding (RoPE):
In addition to MLA (Multi-Head Latent Attention), Rotary Positional Embeddings (RoPE) was used. RoPE was first introduced in the paper "Reformer: Enhanced Transformer with Rotary Position Embedding" and later adopted by Meta in developing its Llama models.
What is RoPE?
At a basic level, RoPE helps the model by providing more context to Queries, Keys, and Values (QKV). It is a process that integrates both absolute and relative positional information by multiplying each query and key vector by a rotation matrix.
An issue that DeepSeek have stated was that applying RoPE currently makes it harder to compress QKV pairs. This happens because matrix multiplication is not a commutative operation resulting in positional embeddings getting “lost” post compression.
To work around this, DeepSeek uses a decoupling technique that isolates positional embeddings from token embeddings. Rather than combining both embeddings together before projection, DeepSeek first processes and compresses the token embeddings, then introduces the positional embeddings through concatenation (see diagram above).
Intuitively, by expanding a “compressed token embedding” and its “true positional embedding”; it should get you back to the original context (My thinking 🤔)
Surprisingly, the results as seen above shows that “Some Attention (MLA)” outperforms “Full Attention (MHA),” which brings some skepticism.
Personally, I wouldn't rule it out…
We only need about 40% of the words in a sentence to grasp its meaning— thus in theory compressing QKV could help remove that “surplus of words”. Additionally, it's possible that this architecture leans more heavily on RoPE (compared to other models) to offset potential context losses from the compression process.
Of course, to really confirm these findings, we’d need to take a look at the training data. But for now, it’s definitely an interesting development worth keeping an eye on.
Innovation #2
Mixture of Experts: Dynamic Load Balancing
Mixture of Experts (MoE) is not a new concept—it was first introduced in 1991 and has since been adopted by models designed by OpenAI and Mixtral. At its core, MoE enhances specialisation within a neural network by activating only specific parts of the model during inference. This selective activation enables DeepSeek to scale its model to 671 billion parameters while requiring only 37 billion active parameters for inference. In contrast, dense models like Llama must utilise all 405 billion parameters to generate a response. Ultimately, MoE improves computational efficiency without compromising accuracy.
One challenge of implementing MoE is load balancing. Some experts may receive little to no data, leading to an imbalance and potential routing collapse during training. To address this, a common technique is to introduce an auxiliary loss that penalises experts receiving tokens too frequently. However, if this penalty is too high, it can limit the model’s learning capacity.
To strike a better balance between load distribution and model performance, DeepSeek introduced an auxiliary-loss-free load balancing strategy. Instead of relying on auxiliary loss penalty, they implemented a bias term to manage expert allocation more effectively.
According to the paper, the bias term is dynamically updated with each training batch. The routing mechanism includes an additional parameter that continuously adjusts after each batch, tracking overloaded and underloaded experts to optimise data distribution. Alongside this bias term, a really small “complementary sequence-wise auxiliary loss” is included as a safety measure to ensure stable training.
By integrating these novel implementations, DeepSeek has significantly enhanced MoE models, as demonstrated by their high sparsity factor.
The Future
The release of DeepSeek signifies a major breakthrough in deep learning, with industry leaders drawing parallels to the "Sputnik Moment" in the space race.
While DeepSeek's innovations are significant, it's important to distinguish between evolutionary and revolutionary.
Unlike the transformer architecture ("Attention Is All You Need"), DeepSeek's contributions seem to focus more on optimisation and practical improvements. That said, DeepSeek’s approach will reshape the economics of AI training and inference, potentially paving the way for profitability among major players like OpenAI and Anthropic, who currently operate at a loss.
The big question remains: Do these computational gains push us closer to Artificial General Intelligence (AGI), or are they simply solving today’s most pressing challenges—speed, cost, and efficiency?
For now, DeepSeek stands as a milestone—one that will redefine how we build and deploy intelligent systems in the future. Whether it proves to be the catalyst for AGI, only time will tell…
(still a bit ambiguous in the white paper tbh)






Nice article. Very insightful, thank you.
Is deepstock still training?