DeepSeek: A New Chapter in Artificial Intelligence

DeepSeek is a true phenomenon. Just days after its release, the Chinese chatbot skyrocketed to the top of the most downloaded apps on the Apple App Store, dethroning ChatGPT. For many, it was a shock that a relatively unknown company with minimal investment—its budget is roughly 14 times smaller than OpenAI’s—managed to outpace, even if temporarily, the undisputed market leader.

History of DeepSeek

DeepSeek was founded by Chinese billionaire Liang Wengfeng. Educated at Zhejiang University, Liang received a Bachelor of Engineering in electronic information engineering in 2007 and a Master of Engineering in information and communication engineering in 2010.

In 2008, Liang formed a team with his university classmates to accumulate data related to financial markets and explore quantitative trading using machine learning. In February 2016, Liang and two other engineering classmates co-founded High-Flyer, a company focused on leveraging artificial intelligence for trading algorithms (making investments, spotting patterns in stock prices, etc.).

In April 2023, High-Flyer established an artificial general intelligence lab dedicated to developing artificial intelligence tools would not be used to perform stock trading. By May 2023, this lab became an independent entity named DeepSeek.

In January 2025, DeepSeek made headlines with the release of DeepSeek-R1, a 671-billion-parameter open-source reasoning AI model. The model quickly gained popularity, becoming the number one free app on the U.S. Apple App Store.

Liang Wengfeng

Key milestones:

2016. High-Flyer foundation. This company initially focused on AI trading algorithms laid the groundwork for DeepSeek.
2023. DeepSeek foundation. Founded in April as an artificial general intelligence lab under High-Flyer, DeepSeek became independent by May.
2025. DeepSeek-R1 release. It quickly became a worldwide sensation, topping the charts as one of the most popular chatbots.

DeepSeek’s journey to the top has been anything but easy. In its early days, the company relied on Nvidia A100 graphics chips, which were later banned from export to China by the U.S. administration. Developers then switched to the less powerful H800 chips, but those were also restricted soon after. Despite these challenges, DeepSeek managed to create its advanced R1 model using just $5.6 million worth of H800 chips. To put that in perspective, training GPT-4 is estimated to cost between $50–100 million.

“Our biggest challenge has never been money, it is the embargo on high-end chips,” Liang has said.

DeepSeek features and key technologies

Unlike many other popular chatbots, DeepSeek models are open-source, meaning users can explore how the technology works under the hood. This transparency builds trust, as it ensures the chatbot isn’t a mysterious "black box"—its behavior can be examined and understood by the community.

Open-source components enable developers and researchers to contribute improvements, fix bugs, or adapt the technology for specific needs. That’s why open-source projects tend to evolve quickly because of community contributions. You’ll see new features, improvements, and applications emerge faster than with proprietary systems.

Some of the important technical solutions that make DeepSeek models work as efficiently as possible:

MoE (Mixture of Experts)
MLA (Multi-head Latent Attention)
MTP (Multi-Token Prediction)

Mixture of Experts (MoE) is a machine learning technique that involves combining the predictions of multiple specialized models (the "experts") to improve overall performance of the chatbot.

Here's how it works in DeepSeek:

DeepSeek likely has a large pool of 256 specialized neural networks (experts). Each expert is a smaller model trained to handle specific patterns or features in the data. For example, in natural language processing, one expert might specialize in syntax, another in semantics, another in domain-specific knowledge, etc.
A gating network decides which experts to activate for each input token. It evaluates the input and assigns weights to the experts, selecting the top 8 experts most relevant to the current token. This ensures that only a small subset of the total experts is used at any given time.
Instead of running all 256 experts for every token (which would be computationally expensive), only the top 8 experts are activated. This drastically reduces the computational cost while still leveraging the model's full capacity.

By activating only a small subset of experts, DeepSeek achieves resource efficiency. The model can scale to a very large size (in terms of parameters) without a proportional increase in computation.

Multi-head Latent Attention (MLA) is a powerful mechanism that combines the strengths of multi-head attention and latent space representations to improve efficiency and performance.

Here’s how it works in DeepSeek:

In standard multi-head attention, the input is split into multiple "heads," each of which learns to focus on different aspects of the data.
The input data (e.g., text, images, or other structured data) is first encoded into a high-dimensional representation.
The input representation is projected into a lower-dimensional latent space using a learned transformation (e.g., a neural network layer).
The latent representation is split into multiple heads, each of which computes attention scores in the latent space. This allows the model to focus on different aspects of the data efficiently.
By operating in a latent space, MLA reduces the computational cost of attention mechanisms, making it feasible to process large datasets or long sequences.

The combination of multi-head attention and latent representations enables the model to capture complex patterns and relationships in the data, leading to better performance on tasks like natural language processing, recommendation systems, or data analysis.

Variant of Multi-Token Prediction in DeepSeek

Multi-token prediction (MTP) is a technique used in language models to predict multiple tokens (words or subwords) ahead in a sequence, rather than just the next token. This approach can improve the model's ability to generate coherent and contextually accurate text, as it encourages the model to consider longer-term dependencies and structure in the data.

Here's how it works in DeepSeek:

The input sequence (e.g., a sentence or paragraph) is encoded using a transformer-based architecture, which captures contextual information about each token in the sequence.
DeepSeek models have multiple output heads, each trained to predict a different future token.
Head 1 predicts the next token. Head 2 predicts the token after that. Head 3 predicts the token two positions ahead.
At inference time, the model generates text autoregressively, but the multi-token training ensures that each prediction is informed by a broader context, leading to more coherent and accurate text generation.

DeepSeek applies multi-token prediction to enhance the quality of its language models, making them more effective at tasks like text generation, translation, and summarization.

Current models

Two of the most recent DeepSeek models are DeepSeek-V3 released in December 2024 and DeepSeek-R1 released in January 2025.

V3 is a direct competitor to GPT 4o while R1 can be compared to OpenAI’s o1 model:

DeepSeek-V3 is a reliable choice for most everyday tasks, capable of answering questions on any topic. It shines in having natural-sounding conversations and showcasing creativity. This model is good for writing, content creation, or answering generic questions that have likely been answered many times before.

DeepSeek-R1, on the other hand, shines when it comes to complex problem-solving, logic, and step-by-step reasoning tasks. R1 was designed to tackle challenging queries that require thorough analysis and structured solutions. This model is great for coding challenges and logic-heavy questions.

Model	Strengths	Weaknesses
DeepSeek-V3	General coding assistance and explaining concepts in simpler terms	May sacrifice some niche expertise for versatility
	Creative writing with deep understanding of context	May overgeneralize in highly technical domains
	Well-suited for quick content generation	Lacks reasoning abilities
DeepSeek-R1	Can handle niche technical tasks	Struggles with broader context or ambiguous queries
	High accuracy in specialized domains (math or code, for example)	Rigid and formulaic output in creative tasks
	Optimized for technical writing such as legal documents or academic summaries	Less adaptable to style and tone changes

Both models have similar technical specs:

	DeepSeek-V3	DeepSeek-R1
Base model	DeepSeek-V3-Base	DeepSeek-V3-Base
Type	General-purpose model	Reasoning model
Parameters	671B (37B activated)	671B (37B activated)
Context length	128K	128K

The key difference is in their training. Here’s how DeepSeek-R1 was trained on V3:

Cold Start Fine-tuning: Rather than overwhelming the model with large volumes of data right away, it begins with a smaller, high-quality dataset to refine its responses from the outset.
Reinforcement Learning Without Human Labels: Unlike V3, DeepSeek-R1 relies entirely on RL, meaning it learns to reason independently instead of just mimicking training data.
Rejection Sampling for Synthetic Data: The model generates multiple responses, and only the best-quality answers are selected to train itself further.
Blending Supervised & Synthetic Data: The training data merges the best AI-generated responses with the supervised fine-tuned data from DeepSeek-V3.
Final RL Process: A final round of reinforcement learning ensures the model generalizes well to a wide variety of prompts and can reason effectively across topics.

Now, let’s look at some benchmarks to see how both V3 and R1 compare to other popular models:

DeepSeek-R1 vs OpenAI o1 vs OpenAI o1 mini vs DeepSeek-V3

AIME 2024 and MATH-500 are mathematics benchmarks, GPQA Diamond and MMLU are general knowledge tests, and finally, Codeforces and SWE-bench Verified are coding benchmarks.

Distilled DeepSeek models

Distillation in artificial intelligence is the process of creating smaller, more efficient models from larger ones, preserving much of their reasoning power while reducing computational demands.

Deploying V3 and R1 isn’t practical for everyone, since they require 8 NVIDIA H200 GPUs with 141GB of memory each. That’s why DeepSeek created 6 distilled models ranging from 1.5B to 70B parameters:

They started with six open-source models from Llama 3.1/3.3 and Qwen 2.5.
Then, generated 800,000 high-quality reasoning samples using R1.
And finally, they fine-tuned the smaller models on these synthetic reasoning data.

Here’s how these six models fared in key benchmarks, demonstrating their abilities in math (AIME 2024 and MATH-500), general knowledge (GPQA Diamond), and coding (LiveCode Bench and CodeForces):

DeepSeek-R1 distilled models in benchmarks

Predictably, as the number of parameters increased, the results improved. The smallest model with 1.5 billion parameters performed the worst, while the largest model with 70 billion parameters performed the best. Curiously, the most balanced model looks like Qwen-32B, which is almost as good as Llama-70B, although it has half as many parameters.

DeepSeek’s Future

DeepSeek has achieved remarkable success in a short time, gaining global recognition almost overnight. The chatbot seemed to appear out of nowhere, but there’s a risk it could fade just as quickly. Maintaining brand visibility and trust over the long term is a significant challenge, especially in such a highly competitive market. Tech giants like Google and OpenAI have budgets that far exceed DeepSeek’s financial resources, and they also hold a technical edge.

One of the major hurdles DeepSeek faces is the compute gap. Compared to its U.S. counterparts, DeepSeek operates at a significant disadvantage in terms of computational power. This gap is exacerbated by U.S. export controls on advanced chips, which limit DeepSeek’s access to the latest hardware needed to develop and deploy more powerful AI models.

While DeepSeek has shown impressive efficiency in its operations, access to more advanced computational resources could significantly accelerate its progress and strengthen its competitiveness against companies with greater capabilities. Closing this compute gap is crucial for DeepSeek to scale its innovations and establish itself as a stronger contender on the global stage.

That said, it’s important not to paint too bleak a picture, because DeepSeek has already achieved something remarkable. The company has proven that even with limited resources, it’s possible to create a world-class product—something many believed was only achievable with billion-dollar budgets and massive infrastructure. DeepSeek’s success is likely to inspire countless others and further accelerate the already rapid advancement of AI technologies.