Nvidia Blackwell Leads AI Inference, AMD Challenges

In the latest round of machine learning benchmark results from MLCommons, computers built around Nvidia’s new Blackwell GPU architecture outperformed all others. But AMD’s latest spin on its Instinct GPUs, the MI325, proved a match for the Nvidia H200, the product it was meant to counter. The comparable results were mostly on tests of one of the smaller-scale large language models Llama2 70B (for 70 billion parameters). However, in an effort to keep up with a rapidly changing AI landscape, MLPerf added three new benchmarks to better reflect where machine learning is headed.

MLPerf runs benchmarking for machine learning systems in an effort to provide an apples-to-apples comparison between computer systems. Submitters use their own software and hardware, but the underlying neural networks must be the same. There are a total of 11 benchmarks for servers now, with three added this year.

It has been “hard to keep up with the rapid development of the field,” says Miro Hodak, the co-chair of MLPerf Inference. ChatGPT only appeared in late 2022, OpenAI unveiled its first large language model (LLM) that can reason through tasks last September, and LLMs have grown exponentially—GPT3 had 175 billion parameters, while GPT4 is thought to have nearly 2 trillion. As a result of the breakneck innovation, “we’ve increased the pace of getting new benchmarks into the field,” says Hodak.

The new benchmarks include two LLMs. The popular and relatively compact Llama2-70B is already an established MLPerf benchmark, but the consortium wanted something that mimicked the responsiveness people are expecting of chatbots today. So the new benchmark “Llama2-70B Interactive” tightens the requirements. Computers must produce at least 25 tokens per second under any circumstance and cannot take more than 450 milliseconds to begin an answer.

Seeing the rise of “agentic AI”—networks that can reason through complex tasks—MLPerf sought to test an LLM that would have some of the characteristics needed for that. They chose Llama3.1 405B for the job. That LLM has what’s called a wide context window. That’s a measure of how much information—documents, samples of code, etc.—it can take in at once. For Llama3.1 405B that’s 128,000 tokens, more than 30 times as much as Llama2 70B.

The final new benchmark, called RGAT, is what’s called a graph attention network. It acts to classify information in a network. For example, the dataset used to test RGAT consist of scientific papers, which all have relationships between authors, institutions, and fields of studies, making up 2 terabytes of data. RGAT must classify the papers into just under 3000 topics.

Blackwell, Instinct Results

Nvidia continued its domination of MLPerf benchmarks through its own submissions and those of some 15 partners such as Dell, Google, and Supermicro. Both its first and second generation Hopper architecture GPUs—the H100 and the memory-enhanced H200—made strong showings. “We were able to get another 60 percent performance over the last year” from Hopper, which went into production in 2022, says Dave Salvator, director of accelerated computing products at Nvidia. “It still has some headroom in terms of performance.”

But it was Nvidia’s Blackwell architecture GPU, the B200, that really dominated. “The only thing faster than Hopper is Blackwell,” says Salvator. The B200 packs in 36 percent more high-bandwidth memory than the H200, but more importantly it can perform key machine-learning math using numbers with a precision as low as 4 bits instead of the 8 bits Hopper pioneered. Lower precision compute units are smaller, so more fit on the GPU, which leads to faster AI computing.

In the Llama3.1 405B benchmark, an eight-B200 system from Supermicro delivered nearly four times the tokens per second of an eight-H200 system by Cisco. And the same Supermicro system was three times as fast as the quickest H200 computer at the interactive version of Llama2-70B.

Nvidia used its combination of Blackwell GPUs and Grace CPU, called GB200, to demonstrate how well its NVL72 data links can integrate multiple servers in a rack, so they perform as if they were one giant GPU. In an unverified result the company shared with reporters, a full rack of GB200-based computers delivers 869,200 tokens/s on Llama2 70B. The fastest system reported in this round of MLPerf was an Nvidia B200 server that delivered 98,443 tokens/s.

AMDis positioning its latest Instinct GPU, the MI325X, as providing competitive performance to Nvidia’s H200. MI325X has the same architecture as its predecessor MI300 but adds even more high-bandwidth memory and memory bandwidth—288 gigabytes and 6 terabytes per second (a 50 percent and 13 percent boost respectively).

Adding more memory is a play to handle larger and larger LLMs. “Larger models are able to take advantage of these GPUs because the model can fit in a single GPU or a single server,” says Mahesh Balasubramanian, director of data center GPU marketing at AMD. “So you don’t have to have that communication overhead of going from one GPU to another GPU or one server to another server. When you take out those communications your latency improves quite a bit.” AMD was able to take advantage of the extra memory through software optimization to boost the inference speed of DeepSeek-R1 8-fold.

On the Llama2-70B test, an eight-GPU MI325X computers came within 3 to 7 percent the speed of a similarly tricked out H200-based system. And on image generation the MI325X system was within 10 percent of the Nvidia H200 computer.

AMD’s other noteworthy mark this round was from its partner, Mangoboost, which showed nearly four-fold performance on the Llama2-70B test by doing the computation across four computers.

Intel has historically put forth CPU-only systems in the inference competition to show that for some workloads you don’t really need a GPU. This time around saw the first data from Intel’s Xeon 6 chips, which were formerly known as Granite Rapids and are made using Intel’s 3-nanometer process. At 40,285 samples per second, the best image recognition results for a dual-Xeon 6 computer was about one-third the performance of a Cisco computer with two Nvidia H100s.

Compared to Xeon 5 results from October 2024, the new CPU provides about an 80 percent boost on that benchmark and an even bigger boost on object detection and medical imaging. Since it first started submitting Xeon results in 2021 (the Xeon 3), the company has achieve an 11-fold boost in performance on Resnet.

For now, it seems Intel has quit the field in the AI accelerator chip battle. Its alternative to the Nvidia H100, Gaudi 3, did not make an appearance in the new MLPerf results nor in version 4.1, released last October. Gaudi 3 got a later than planned release because its software was not ready. In the opening remarks at Intel Vision 2025, the company’s invite-only customer conference, newly minted CEO Lip Bu Tan seemed to apologize for Intel’s AI efforts. “I’m not happy with our current position,” he told attendees. “You’re not happy either. I hear you loud and clear. We are working toward a competitive system. It won’t happen overnight, but we will get there for you.”

Google’sTPU v6e chip also made a showing, though the results were restricted only to the image generation task. At 5.48 queries per second, the 4-TPU system saw a 2.5x boost over a similar computer using its predecessor TPU v5e in the October 2024 results. Even so, 5.48 queries per second was roughly in line with a similarly-sized Lenovo computer using Nvidia H100s.

From Your Site Articles