If 2023 and 2024 were all about AI training, 2025 is poised to be all about AI inference. The entire AI ecosystem needs that transition to happen to prove that the abundance of AI hype generated over the last two years was justified.
Inference, or the notion of AI delivering something new and valuable based on all the data it has been trained on, is the key to a maturing market and the success of agentic AI applications designed to put AI into practical, everyday use. It also is the key to ensuring that projects like Stargate do not end up just being massive money pits.
Boosting inference performance is one part of enabling this transition. Since late last summer, Nvidia has been touting the inference performance of its Blackwell GPU systems as 4x that of previous Hopper architectures. The MLPerf benchmark results Nvidia released in November were another feather in Nvidia’s inference performance cap. But, another major factor enabling more practical usage of AI lies in being able to reduce the cost of inference as an increasing number of AI agents begin to generate millions of tokens to initiate their AI processes.
That cost reduction will pave the road leading to “profitable AI,” according to Dave Salvator, director of accelerated computing products at Nvidia.
“AI inference is notoriously difficult, as it requires many steps to strike the right balance between throughput and user experience,” stated Salvator, in a blog post this week. “But the underlying goal is simple: generate more tokens at a lower cost. Tokens represent words in a large language model (LLM) system — and with AI inference services typically charging for every million tokens generated, this goal offers the most visible return on AI investments and energy used per task.”
Salvator added that Nvidia aims to deliver on that goal through ongoing software optimization for partners and users of its Nvidia AI inference platform, combined with the ability of its Hopper platform to deliver up to 15x more energy efficiency for inference workloads compared to previous generations.
Salvator name-checked a number of Nvidia partners and customers in his post that are working with Nvidia to change the economics of inference. Those include Perplexity AI, an AI-powered search engine that handles more than 435 million monthly queries, with each of those queries representing multiple AI inference requests.
Perplexity AI uses Nvidia’s H100 GPUs, Triton Inference Server, and TensorRT-LLM to manage this multitude.
“Supporting over 20 AI models, including Llama 3 variations like 8B and 70B, Perplexity processes diverse tasks such as search, summarization and question-answering,” Salvator stated. “By using smaller classifier models to route tasks to GPU pods, managed by Nvidia Triton, the company delivers cost-efficient, responsive service under strict service level agreements. Through model parallelism, which splits LLMs across GPUs, Perplexity achieved a threefold cost reduction while maintaining low latency and high accuracy.”
Part of Nvidia’s recipe for reducing inference cost and improving efficiency has been to incorporate Redrafter, an open-source approach to speculative decoding published by Apple, into its TensorRT-LLM, Salvator stated, adding, “ReDrafter uses smaller “draft” modules to predict tokens in parallel, which are then validated by the main model. This technique significantly reduces response times for LLMs, particularly during periods of low traffic.”
Meanwhile, the Triton Inference Server has played a major role in helping Nvidia customer Docusign to optimize inference. As Alex Zakhvatov, senior product manager at Docusign, stated in the Nvidia post, “We no longer need to deploy bespoke, framework-specific inference servers for our AI models. We leverage Triton as a unified inference server for all AI frameworks and also use it to identify the right production scenario to optimize cost- and performance-saving engineering efforts.”