Author page description
May 27
- AMD vs NVIDIA Inference Benchmark: Who Wins on Performance and Cost per Million Tokens?
➀ The article compares the performance and cost efficiency of AMD and NVIDIA GPUs for various AI tasks such as chat, translation, reasoning, and summarization.
➁ It highlights the MI325X and MI300X as cost-effective options for Llama3 70B chat and translation tasks.
➂ The analysis reveals that AMD GPUs are less cost-effective in rental scenarios due to limited availability and higher prices.
➃ The article discusses the need for better inference benchmarks and explores the features and capabilities of NVIDIA's Dynamo framework.
April 29
- AMD's New Sense of Urgency: MI450X, Chance to Beat NVIDIA, and NVIDIA's New Moat
➀ AMD is facing challenges in catching up with NCCL and needs exclusive access to a persistent cluster of at least 1,024 MI300 class GPUs.
➁ AMD's RCCL library is a fork of Nvidia's NCCL and requires significant engineering hours to sync with Nvidia's major refactor.
➂ AMD is planning to rewrite RCCL from scratch to stop being a fork of NCCL.
➃ NVIDIA's NCCL continues to advance with new features and performance improvements.
➄ AMD has made progress in software infrastructure but is falling behind in ML libraries.
➅ AMD lacks support for features like disaggregated prefill and NVMe KV Cache Tiering.
➆ Recommendations are made to both AMD and NVIDIA for improving their competitive positions.
April 28
- Microsoft’s Datacenter Freeze: 1.5GW Self-Build Slowdown & Lease Cancellation Misconceptions
➀ The market has focused on '2GW of lease cancellations', but this only covers non-binding LOIs, not firm contracts.
➁ Microsoft has ~5GW of pre-leased capacity under binding contracts set to start operations between 2025 and 2028.
➂ Microsoft walked away from significantly more than 2GW of non-binding contracts over the last two quarters.
April 22
- Huawei AI CloudMatrix 384 – China's Answer to Nvidia GB200 NVL72
➀ Huawei has unveiled the CloudMatrix 384, an AI accelerator and rack-scale architecture that competes with Nvidia's GB200 NVL72.
➁ The system uses 384 Ascend 910C chips, achieving impressive performance despite each chip being only one-third the performance of an Nvidia Blackwell GPU.
➂ The CloudMatrix 384 offers 300 PFLOPs of dense BF16 compute, almost double that of the GB200 NVL72, with over 3.6x aggregate memory capacity and 2.1x more memory bandwidth.
October 28
- Fab Whack-a-Mole: Chinese Companies Evasion of Export Controls
➀ Current Western export controls have slowed China's progress in advanced logic, but are not perfect or infallible.
➁ Loopholes in restrictions include offshore manufacturing, end-use workarounds, and renaming/reclassifying technologies.
➂ Huawei's fab network poses a national security concern, exploiting sanctions and advancing domestic semiconductor supply chains.
➃ WFE suppliers' lobbying for relaxed controls is refuted by strong business performance and long-term market share impacts from domestic Chinese firms.
➄ Suggestions for improving export controls include expanding the entity list, aligning ally restrictions, tightening supply chain restrictions, and improving enforcement.