feature By now you’ve probably heard AI datacenters called factories. It’s an apt description: power goes in and tokens come out.
Admittedly it’s an oversimplified description, but the economics of AI inference at scale are deceptively simple. The more tokens you can generate for a given amount of power, the better. Sell enough tokens to cover your infrastructure, power, facilities, and operations costs, and anything that’s left over is profit.
“For the datacenters, inference tokens per watt translates directly to the revenues of the CSPs” (cloud service providers), Nvidia CEO Jensen Huang reiterated during the company’s most recent earnings call.
Just like the assembly line revolutionized manufacturing in the 1900s, the same phenomenon is taking place in the datacenter. Any optimization that drives up the number of tokens per second, per dollar, per watt (TPS/$/W) is a competitive advantage. But this is where things get complicated. Scaling inference isn’t as simple as more GPUs, more tokens.
Not all tokens are created equal
With modern hardware, it’d be fairly trivial to maximize token throughput at the expense of individual user experience.
“It’s not one size fits all in terms of the answer. There are SLAs, there’s different application types,” Dave Salvator, director of accelerated computing products at Nvidia, told El Reg.
This changes the equation a bit. It now becomes how many TPS/$/W you can generate for a given “goodput.”
Goodput can mean a lot of things, but in the case of LLM inference, it usually refers to a service-level target such as time to first token under a few hundred milliseconds, or a per-user generation rate greater than X number of tokens a second.
SemiAnalysis’s InferenceX (formerly InferenceMax) benchmark illustrates this quite nicely. The synthetic benchmark offers arguably the best look into the performance scaling and economics of generative AI inference yet.
In this graphic, we see the total token throughput per megawatt charted against user interactivity in a Pareto curve of various B300 configurations. In this, the ideal performance is up and to the right.
InferenceX’s efficiency Pareto curve can be broken down into three main categories. Bulk tokens on the left, expensive low-latency tokens on the right, and the so called “Goldilocks zone” in the middle. – Click to enlarge
As you can see, the chips can reach throughputs exceeding 3.5 million tokens a second per megawatt, but trade interactivity, that is Tok/s per user, to do it. These tokens are cheap to serve, but glacially slow. This configuration is more akin to a city bus. It’s not fast, but it can carry a lot of people.
On the other end of the spectrum, the chips can be configured to maximize user interactivity but doing so sacrifices throughput. Faster tokens make this tier of “premium” tokens more desirable, but lower throughput means providers need to charge more for them.
The area in between has been called the “goldilocks” zone. It offers high enough interactivity while delivering enough throughput as to be cost effective.
Software matters
Goodput can be tricky as achieving it depends heavily on hardware, software, and the model in question. LLM inference is no longer as simple as wrangling enough compute to hit your goodput. Models have to be paired with the right software to perform optimally.
vLLM, a popular inference serving framework, might work great for one model and underperform alternatives, like SGLang or TensorRT LLM, when running another. This is one of the reasons that Nvidia has been pushing its inference microservices (NIMs) so hard. By taking the guesswork out of inference deployments, they not only get to sell you the hardware but a subscription to go with it.
For the same amount of power, InferenceX data shows that TensorRT LLM running on Nvidia’s B200 GPUs is significantly more efficient at serving models like DeepSeek R1 than something like SGLang. Having said that, open source inference engines are still prized by large hyperscalers and model houses as they can be optimized and customized for their specific workloads.
Software can make a big difference to the efficiency of the underlying hardware. In this case, Nvidia’s in-house inference engine TensorRT LLM delivers better performance than SGLang – Click to enlarge
Disaggregated Compute
The performance gap widens considerably when looking at disaggregated serving frameworks like Nvidia’s Dynamo or AMD’s MoRI. By distributing the same work across a pool of GPUs, these frameworks break the workload up into smaller pieces, running the compute-intensive prefill (prompt processing) phase on some GPUs and the bandwidth limited decode phase (token generation) on others.
The exact ratio of prefill GPUs to decode GPUs is going to vary from model to model and depend to some degree on your desired goodput. You might want fewer decode and more prefill GPUs if you’re trying to serve lots of users. Meanwhile, you’d want the opposite for latency sensitive apps like code assistants.
One of the biggest gains in inference efficiency has come from running different parts of the workload on different GPUs – Click to enlarge
Disaggregated serving along with techniques like multi-token prediction, a form of speculative decoding we’ve discussed previously, can move the Pareto curve up and to the right by a significant margin.
Driving the rack-scale transition
The move to a mixture of experts (MoE) model architectures which utilize a subset of the whole model to process and generate tokens is changing the way we build systems.
“Those experts have to be communicating with each other a lot,” said Salvator, who explains this has driven a shift toward both disaggregated compute and larger rack-scale architectures, like Nvidia’s NVL72, AMD’s Helios, and AWS’ Trainium3.
These architectures offer more GPUs / XPUs which are connected by high-speed scale up fabrics, which help to reduce latency and boost throughput.
The challenge, of course, becomes finding the ideal combination of expert, pipeline, data, and tensor parallelism to hit your goodput target while maximizing throughput for a given amount of power.
Another big jump in inference efficiency has come from the move to rack-scale architectures, like Nvidia’s GB200 and GB300 NVL72 racks – Click to enlarge
Comparing Nvidia’s enterprise-focused B300s to its rack-scale GB300s, we see that the smaller system performs well in scenarios where user interactivity is low, but runs out of steam above about 50 tokens per second per user. The rack-scale system, meanwhile, maintains a higher degree of interactivity without compromising on throughput.
For the moment, Nvidia is the only vendor with a mature rack-scale platform you can actually buy. However, that won’t be true for long. AMD’s MI455X-based Helios rack systems are due out in the second half of 2026 and boast performance that, at least on paper, is on par with Nvidia’s next-gen Vera-Rubin racks.
While rack-scale architectures achieve greater efficiency, for more traditional, air-cooled datacenters, AMD’s VP of AI Software Anush Elangovan, argues that there’s still a place for eight-way GPU boxes.
And depending on which end of the performance spectrum you’re optimizing for, these eight-way systems often come with 85 percent or better — particularly on the right end of the Pareto curve — of rack-scale performance.
Considering this, and the fact that those smaller systems are considerably less expensive, might explain why both AMD and Nvidia continue to serve this segment even as they push their NVL and Helios machines.
We can see this in the data.
At least for neocloud operators, rack-scale’s cost benefit is primarily seen at higher throughputs and lower interactivity. Meanwhile at higher interactivity, Nvidia and AMD’s eight GPU boxes still hold up well. – Click to enlarge
On top of how efficient at churning out tokens Nvidia and AMD’s AI accelerators are, InferenceX also tracks inference costs. The closer the Pareto curve gets to the bottom right corner, the better value those tokens are.
In this example, we can see that below about 70 Tok/s per user, Nvidia’s rack-scale systems, like the GB200 NVL72, dominate, delivering the highest volume of tokens at the lowest cost. But, as interactivity increases, Nvidia and AMD’s smaller systems become more cost effective. Again, a lot of this depends on which software levers you pull. And it might explain why Nvidia burned $20 billion on Groq’s intellectual property and talent.
The chip designer’s SRAM-heavy AI accelerators excelled in the kind of latency sensitive applications that fall toward the right side of these graphs.
An unrelenting rate of change
Hardware is only as good as the software that runs on it, and that software is improving quickly. Inference providers that fail to update their software stacks regularly could be leaving a lot of performance on the table.
“The state of the art of AI is very much a moving target,” Salvator said. “We are continuing to optimize both our software and our hardware to try and address that state of the art.”
Less than a month ago, AMD’s MI355X accelerators trailed Nvidia’s equivalent chips by a wide margin in the SGLang inference framework – Click to enlarge
Nvidia accelerators have aged well in part because the company’s software continues to deliver performance gains long after they ship. The same is true of AMD which, despite a considerably smaller software engineering staff, is delivering performance optimizations as fast as they can.
Take AMD’s MI355X for example. On paper, the chip roughly matches Nvidia’s B200 and B300 accelerators. Yet, as of early February, Nvidia’s Blackwell GPUs delivered significantly higher performance in SGLang.
In less than a month, AMD has managed to close the gap with Nvidia, at least for FP8 inference running in SGLang. – Click to enlarge
Less than a month later, AMD had closed that gap considerably and now outperforms Nvidia in some regimes. However, that’s in an apples-to-apples comparison between the two chips running SGLang.
AMD still has a ways to go to catch up with Nvidia’s in-house inference engine TensorRT LLM. However, given the progress AMD has managed to make in less than a month we wouldn’t be surprised to see the House of Zen close that gap as well.
“The software aspect makes the whole difference. It’s about just how much brain power we put on which data type and which model type. The rate of progress is literally daily,” Ramine Roane, CVP of AI product management at AMD, told El Reg.
More levers to pull
Up until this point, you might have noticed we’ve mostly been looking at InferenceX figures for FP8. While the latest Blackwell and Instinct GPUs from Nvidia and AMD offer native FP4 acceleration, we are only now starting to see models released at this precision.
OpenAI’s GPT-OSS was among the first major open-weight models to use MXFP4. As it stands, most models are still released at 16 or increasingly 8-bit precision as that’s become the least common denominator for hardware support.
This is likely to change as the economics of inference strongly favor lower precisions. The reason is simple; smaller model weights need less memory capacity, bandwidth, and compute to achieve the same level of performance as higher precision models.
SemiAnalysis InferenceX results illustrate this nicely. The jump in throughput and interactivity going from FP8 to FP4 can be substantial, but only if optimized kernels are available for that model.
But while FP4 may offer better throughput, it does so at a cost. Quantization, particularly at 4-bits and lower, gets a bad rap for lobotomizing models.
“If your accuracy loss is too severe, the speed up becomes irrelevant,” Salvator said.
However, the FP4 data types supported by AMD and Nvidia’s latest accelerators use some clever math to vastly expand the number of values that can represent model weights from 16 to more than 4,000.
We dug into this in greater detail back when GPT-OSS launched last year, but in a nutshell, it involves using a scaling factor across a block of model weights to achieve an output quality closer to FP8 or even BF16.
A race to the bottom
For inference providers serving open-weights models, tokens are a commodity. For those serving them, it’s a race to the bottom where whoever can offer the most desirable models, or the highest quality tokens, or the fastest tokens for the lowest cost, wins out.
Some inference providers, like Cerebras, have leaned into their unique hardware architecture to provide “premium” low-latency tokens. This won the startup a contract with OpenAI to serve its GPT-5.3-Codex-Spark coding model at thousands of tokens a second.
Others, like Fireworks, have developed tools to help customers customize their models for their specific applications. “Our design point has always been customization,” Fireworks CEO Lin Qiao tells El Reg.
When Fireworks first launched its tuning platform, the gap in quality between open and closed models was quite large, she explained. Approaches like supervised fine tuning provided a way for customers to achieve performance closer to that of proprietary models while also imbuing them with their company’s domain knowledge.
Since then, open-weight models have made significant gains, Qiao explained. “It’s clear trend that closed and open model quality is converging, especially in the LLM space, and that makes tuning much more appealing.”
However, even fine-tuned model serving is quickly becoming a commodity. All of the major cloud providers now offer similar services, which means smaller inference-as-a-service and neocloud providers not only need to constantly optimize their hardware and software stacks, but also must think carefully about how they differentiate themselves from the rest of the pack. ®

