CPUs Are Quietly Throttling Your LLM Performance

Georgia Tech researchers found that adding a few CPU cores — costing just 1.5% more — can make LLM inference up to 5× faster. Here's why your GPUs are probably starving right now.

Your expensive GPU cluster is sitting idle. Not because the GPUs lack power—they’re some of the fastest AI accelerators ever built. They’re idle because the CPU can’t feed them work fast enough.

A new research paper from Georgia Tech reveals a counterintuitive bottleneck in large language model (LLM) inference: the CPU, not the GPU, often determines how fast your AI responds. When running models like Llama or GPT across multiple GPUs, inadequate CPU resources can leave $30,000 GPUs waiting for a $50 processor to catch up. The researchers found that simply adding more CPU cores—a marginal cost increase of roughly 1.5% on cloud instances—improved response times by 1.36× to 5.40× without requiring any additional GPUs.

The paper, titled “Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference,” systematically documents how CPU underprovisioning creates cascading failures across GPU clusters, and why this problem is getting worse as AI models scale.

Server infrastructure and data center
Server infrastructure — Source: Pexels

The Hidden Job CPUs Do in GPU Clusters

When you send a prompt to an LLM running on multiple GPUs, most people assume the GPUs immediately start computing. They don’t. The CPU handles three critical preprocessing and coordination tasks first, and if it’s starved for resources, the entire pipeline stalls.

1. Tokenization

Refer to caption

Your text prompt needs to be converted into numerical tokens the model understands. For a 100,000-token prompt—roughly a short book—this preprocessing can take multiple seconds of CPU time. Modern tokenizers like HuggingFace’s Rust-based implementation spawn multiple threads to accelerate this work, which means they compete aggressively for whatever CPU cores are available. The researchers found that tokenization alone can account for up to 50% of time-to-first-token latency in LLM serving, and this fraction doesn’t decrease with longer prompts because modern optimizations like FlashAttention make GPU compute scale nearly linearly with input length.

2. Kernel Coordination

Refer to caption

The CPU must launch thousands of computation kernels across GPUs. Each launch requires traversing the CUDA runtime, issuing commands via PCIe bus writes to GPU registers, and managing synchronization between devices. If the CPU thread handling these launches gets delayed by even milliseconds—because other processes are competing for the same cores—GPUs sit idle waiting for instructions. Modern multi-GPU frameworks assign one CPU process per GPU specifically to isolate kernel launch responsibilities, which means a system with 8 GPUs needs at least 8–10 concurrent CPU processes just to keep the GPUs fed.

3. Inter-GPU Synchronization

Coordinating communication between GPUs using libraries like NCCL requires barrier-style synchronization where all GPUs must reach the same point before proceeding. If one CPU core is delayed scheduling its GPU’s work, every other GPU in the cluster waits. The researchers profiled cases where one CPU core delayed by just 1 millisecond forced all other GPUs to busy-wait for that duration, amplifying a small per-core delay into a cluster-wide stall.

The insight here isn’t subtle: GPUs capable of processing trillions of operations per second end up idling because a CPU running at a few gigahertz can’t dispatch work fast enough. And according to the data, this happens constantly.

What the Data Shows

The research team analyzed 4.65 million job scheduler logs from two university computing clusters running approximately 35,000 CPU cores and several hundred GPUs including NVIDIA H200 and RTX 6000 models. The findings were striking:

  • Users routinely request absurdly low CPU-to-GPU ratios.
  • On the instructional cluster, the median allocation was just 1–2 CPU cores per GPU.
  • Some users requested a single CPU core for 4 or 8 GPUs—a ratio of 0.25 cores per GPU.
  • Even on the research cluster with enforced proportional allocation, 60% of jobs showed CPU-to-GPU ratios below 8 cores per GPU—the threshold where pathological slowdowns begin.

This isn’t unique to academic clusters. AWS GPU instances commonly provide only 3–6 virtual CPUs per GPU by default. A p5.48xlarge instance with 8× H100 GPUs costs $55.04/hour, while each additional CPU core costs roughly $0.05/hour—meaning users often skimp on a $0.40/hour expense that’s blocking efficient use of a $55/hour resource. GPU compute costs roughly 100–1,600× more than CPU cores, yet the cheaper component frequently becomes the bottleneck.

The Experiment: How Bad Can It Get?

Refer to caption

To quantify the impact, the researchers designed an “attacker-victim” experiment using vLLM, a popular LLM serving framework. They sent a target request (the “victim”) to a model running on 4 or 8 GPUs, then bombarded the system with concurrent “attacker” requests at 8–16 requests per second, each with prompts ranging from 1,800 to 114,000 tokens. Results:

  • With minimal CPU allocation (number of GPUs + 1 core): Systems frequently timed out entirely within 200 seconds. CPU utilization stayed near 100%, but GPUs sat mostly idle—less than 40% utilized in many cases.
  • With adequate CPU allocation (4–8 cores per GPU): The same workload completed 1.36× to 5.40× faster. Time-to-first-token latency dropped from multiple minutes to under 30 seconds. GPU utilization jumped to 80–95%.

The researchers tested this across multiple GPU generations—NVIDIA H100, H200, and RTX 6000 Blackwell—and the pattern held universally, confirming this is a fundamental characteristic of how multi-GPU systems behave under CPU contention, not a quirk of specific hardware.

Why It’s Getting Worse, Not Better

Several emerging trends will intensify CPU bottlenecks rather than alleviate them:

  • Longer contexts: Models now support million-token contexts. Tokenizing a 1-million-token prompt requires multiple seconds of CPU time per request, scaling linearly with input length.
  • AI agents: Tool calls require frequent CPU-side processing between GPU inference steps. Recent work shows tool processing can account for up to 90.6% of total latency in agentic workflows.
  • Multimodal inputs: Images and video require CPU-intensive preprocessing before GPU inference begins.
  • Faster GPUs: As GPU performance improves faster than CPU performance, per-GPU compute time decreases—making CPU coordination overhead a larger relative fraction of total latency.

The Economics Make This Baffling

Adding 16 CPU cores to an 8× H100 instance at $0.05/hour each costs just $0.80/hour—a 1.5% increase in total instance cost. The researchers’ data shows this is remarkably cost-effective: for CPU-starved workloads, performance scales nearly linearly with additional cores at minimal expense. Instead of provisioning more GPUs, simply allocating adequate CPU resources delivers better throughput per dollar.

Yet users consistently underprovision—either because they’re unaware of the bottleneck or because scheduler defaults allocate minimal cores. The default Slurm parameter --cpus-per-task=1 allocates a single CPU core per task, severely degrading multi-GPU performance unless users explicitly override it.

What You Should Do

For Cloud Users

Don’t accept default CPU allocations. Explicitly request 4–8 CPU cores per GPU for LLM serving workloads. The marginal cost is negligible compared to GPU instance pricing, and the performance improvement is substantial.

For Cluster Operators

Enforce minimum CPU-to-GPU ratios in schedulers. The researchers suggest at least 4 cores per GPU to avoid pathological slowdowns, with 8 cores preferred for high-throughput serving.

For Framework Developers

Current optimizations like CUDA Graphs and chunked prefill help but don’t eliminate CPU bottlenecks. The paper identifies shared-memory broadcast contention as a structural issue that requires architectural changes—such as GPU-initiated networking or persistent GPU kernels—that remove CPUs from the critical path entirely.

The Bottom Line

AI infrastructure optimization has focused almost exclusively on GPU efficiency while ignoring the control plane. Organizations spend enormous sums on GPUs—a single DGX H100 system costs over $300,000—while inadvertently starving them with inadequate CPU provisioning, a problem that costs dollars per hour to fix.

In the age of trillion-parameter models and GPU clusters costing millions of dollars, a $50 CPU shortage can waste it all. The researchers demonstrate this isn’t a theoretical concern but a measurable, widespread problem affecting real deployments today—and one that’s trivial to fix if you know to look for it.

The full paper, Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference by Euijun Chung, Yuxiao Jia, Aaron Jezghani, and Hyesoon Kim from Georgia Tech, is available on arXiv (arXiv:2603.22774v1).

For discussions on AI infrastructure, GPU clusters, and system optimization, join our WhatsApp community where engineers discuss developments.


Discover more from WireUnwired Research

Subscribe to get the latest posts sent to your email.

WireUnwired Editorial Team
WireUnwired Editorial Team
Articles: 253

Leave a Reply