DeepSeek’s AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

Pelican Press · January 28, 2025

This is the hidden content, please

DeepSeek’s AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia’s CUDA, according to an analysis from Mirae Asset Securities Korea cited by

This is the hidden content, please

.

This is the hidden content, please

(Parallel Thread Execution) is an intermediate instruction set architecture designed by Nvidia for its GPUs. PTX sits between higher-level GPU programming languages (like CUDA C/C++ or other language frontends) and the low-level machine code (streaming assembly, or SASS). PTX is a close-to-metal ISA that exposes the GPU as a data-parallel computing device and, therefore, allows fine-grained optimizations, such as register allocation and thread/warp-level adjustments, something that CUDA C/C++ and other languages cannot enable. Once PTX is into SASS, it is optimized for a specific generation of Nvidia GPUs.

For example, when training its V3 model, DeepSeek reconfigured Nvidia’s H800 GPUs: out of 132 streaming multiprocessors, it allocated 20 for server-to-server communication, possibly for compressing and decompressing data to overcome connectivity limitations of the processor and speed up transactions. To maximize performance, DeepSeek also implemented advanced pipeline algorithms, possibly by making extra fine thread/warp-level adjustments.

These modifications go far beyond standard CUDA-level development, but they are notoriously difficult to maintain. Therefore, this level of optimization reflects the exceptional skill of DeepSeek’s engineers. The global GPU shortage, amplified by U.S. restrictions, has compelled companies like DeepSeek to adopt innovative solutions, and DeepSeek has made a breakthrough. However, it is unclear how much money DeepSeek had to invest in development to achieve its results.

The breakthrough disrupted the market as some investors believed that the need for high-performance hardware for new AI models would get lower, hurting the sales of companies like Nvidia. Industry veterans, such as Intel Pat Gelsinger, ex-chief executive of Intel, believe that applications like AI can take advantage of all computing power they can access. As for DeepSeek’s breakthrough, Gelsinger sees it as a way to add AI to a broad set of inexpensive devices in the mass market.

Get Tom’s Hardware’s best news and in-depth reviews, straight to your inbox.

This is the hidden content, please

#DeepSeeks #breakthrough #bypasses #industrystandard #CUDA #assemblylike #PTX #programming

This is the hidden content, please

Sign In

Home

Activity

Store

My Details

Forums

All Servers

DeepSeek’s AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

Recommended Posts

Pelican Press 0

Trader Feedback

DeepSeek’s AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

Link to comment

Share on other sites

Join the conversation

Most Contributions

Vote for the server

Recently Browsing 0 members

Important Information