YK Research

Amdahl's Revenge

Why Your Blackwell Cluster Sits 80% Idle. And Who Captures the Gap.

18 April 2026 · YK Research · Follow-up to The CPU Shortage (16 Apr)

The Mispricing

GPU Util in Agent Loops
~18%
CPU Latency Share
90.6%
Effective $/tok Penalty
CPU:GPU Target
1:4

The market has priced “AI = GPU wins.” It has not priced what happens after inference becomes agentic. In complex reasoning workflows on Blackwell-class clusters, GPU utilization drops under 20%. The other 80% is CPU-side orchestration. A $40k Blackwell sitting at 20% utilization is a $32k stranded asset per node per year. At hyperscaler scale, that's billions.

This is Amdahl's Law showing up in the 2026 AI capex cycle. It rewires who wins.

The Profile That Named It

Raj et al. (Georgia Tech + Intel, Nov 2025) ran five agentic workloads Haystack RAG, Toolformer, ChemCrow, LangChain, SWE-Agent, on a real cluster and measured where time goes.

🔍
Tool processing on CPUs can take up to 90.6% of the total latency. CPU dynamic energy consumes up to 44% of total dynamic energy at large batch sizes.
Raj et al., arXiv:2511.00739 (Nov 2025)

The paper's proposed schedulers (CGAM + MAWS) recover 2.1× and 1.41× P50 latency speedup. That's the tell: if software scheduling can claw back 2×, the current deployment is deeply suboptimal.

Lead author Ritik Raj received the IBM PhD Fellowship in February 2026 specifically for this line of work. IBM — the industrial research giant that has been CPU-centric for 60 years — is betting on agentic-AI CPU efficiency. They see the same trade.

GPU Utilization by Workload Type

Source: Raj et al. (2025), industry benchmarks on Blackwell / H100 clusters. Reasoning agent = chain-of-thought with tool calls.

Effective $/Token Penalty

Effective cost = 1 / utilization. At 18% util, a reasoning agent pays 5.6× the datasheet $/token.

Where the Missing 80% Goes

Source: Synthesized from Raj et al. + production agent runtime profiles. Exact mix varies by framework (LangGraph, CrewAI, custom).

State Management (~30%)

Tracking what each sub-agent has done, dependencies, parent/child relationships. Framework overhead from LangGraph, CrewAI, custom orchestrators.

Verification / Reflection (~20%)

Does this output make sense? Continue, retry, or spawn a new agent? Often requires its own inference call back to the GPU — but the CPU decides when.

Serialization (~15%)

Tokens → strings → JSON for tool calls, then parse responses back into context. Not a memcpy problem — it's the string and schema-validation tax at million-req/sec scale.

I/O Wait (~15%)

Waiting on external APIs, databases, web search. CPU blocks, GPU idle. PCIe generation and lane count matter here.

Amdahl's Law, Applied to AI Capex

Maximum speedup with N parallel units is 1 / ((1-p) + p/N), where p is the parallelizable fraction. Adding GPUs only helps the p portion.

Classic Amdahl's Law. In an agent loop where only 20% of work is parallelizable, adding 64 GPUs gets you 1.24× speedup, not 64×. You're paying for 64, using 1.24.
🔍
If 80% of wall time is serial CPU work, throwing more Blackwells at it gives you almost zero speedup. You have to attack the serial portion.

Why This Is Structural, Not a Cycle

1. The workload mix has already changed

Agents are in production at scale — Cursor, Devin, SWE-Agent, Copilot Workspace, every enterprise RAG deployment. This isn't a projection. It's already the majority of new token volume at many providers.

2. The hardware vendors have priced it in

NVIDIA is shipping Vera as a standalone CPU (no GPU attached). CoreWeave is the announced customer; Jensen hinted in a Jan 2026 Bloomberg interview that “many more” are coming. A company printing 75% GPU gross margins does not split out a standalone CPU SKU unless they see a large, serial market forming.

3. Intel got caught by surprise

Their Q4 2025 earnings call admitted unexpected server CPU demand and raised 2026 capex on foundry tools, shifting wafer allocation from PC to server. Intel is the largest server CPU incumbent. If they didn't see it coming, the market hasn't either.

CPU:GPU Ratio Shift

The shift is roughly 4× more CPU per GPU for reasoning-heavy workloads than static inference.

Who Captures the Gap

AMD ($AMD), Still Highest Conviction

EPYC Venice on TSMC N2 with 256 cores / 512 threads is the best general-purpose server CPU of 2026. Amdahl framing tightens the 16 April call: AMD isn't just winning server share, it's winning the most CPU-constrained segment of that share.

NVIDIA ($NVDA), The Quiet Rerating

Consensus: “NVDA is peak margins, ex-growth.” Vera standalone changes that. NVIDIA now sells both sides of the tray. Add NVLink-C2C coherent CPU↔GPU memory (1.8 TB/s, only NVIDIA has this) and they own the architecture that best kills the Amdahl bottleneck. Vera is a hidden growth pillar nobody is modeling.

TSMC ($TSM), Wins Regardless

Builds for AMD Venice, NVIDIA Vera, ARM AGI, Graviton5, Cobalt 200, Axion. The CPU wave adds structural wafer demand on top of GPUs and mobile.

ARM ($ARM), Dual Royalty

Licensing fees from every custom server CPU (Graviton, Cobalt, Axion, Grace/Vera, Ampere) plus direct AGI CPU revenue. Head-node CPUs for reasoning-heavy agents skew ARM (coherent memory + perf/W). Tension: competing with licensees. TAM expansion dwarfs the tension.

Ampere (via SoftBank, unlisted), The Hidden Hand

SoftBank acquired Ampere to fold it into the ARM orbit. AmpereOne MX at 256 cores with aggressive perf/W. SoftBank's pre-positioning to capture merchant agentic-CPU demand, independent of hyperscaler custom silicon. Watch SoftBank earnings color.

GUC (3443.TW), Picks and Shovels

TSMC's back-end design subsidiary. Every hyperscaler custom CPU pays GUC for back-end. Low float, under-covered, direct exposure. Still the sleeper.

What Kills It

RiskSeverityProbabilityImpact on ThesisMitigant
Agent frameworks collapse CPU tax into the GPUMEDIUM25%Speculative decoding, parallel tool-call batching, in-GPU state reduce CPU shareRatio shift smaller, but still happens. AMD/TSMC calls largely intact.
20% utilization is cherry-picked worst caseHIGH30%Well-tuned production inference hits 60-80%. Thesis becomes tail use case.Raj et al. profile is on real workloads. Reasoning loop is the growing mix, not the exception.
Agentic AI capability stallsHIGH15%Workload mix reverts to batch inference. CPU demand normalizes.Current usage curves argue against this. Enterprise adoption is accelerating.
Hyperscaler custom silicon ramps <2 yearsMEDIUM20%Merchant CPU TAM (AMD, Ampere, merchant ARM) compressesNVIDIA, TSMC calls unaffected. AMD rerates down, not out.
Amdahl framing proves too literalLOW15%Real workloads are async (overlapped CPU/GPU) not strictly serialEven with overlap, 90.6% CPU latency share is dominant term.

Position & Bottom Line

Unchanged sizing from 16 April. What changes is the conviction mechanism: the thesis now has a named quantitative backbone (90.6% CPU latency in Raj et al.) and a named trade for NVIDIA beyond “more Blackwells.” Treat Vera standalone as a margin-preserving growth lever NVIDIA has not yet been credited for.

AMD
Accumulate
NVDA
Hold
TSM
Accumulate
ARM
Selective
🔍
If the 16 April piece was 'CPUs are in short supply,' this one is 'the reason why is Amdahl's Law applied to agent loops, and it rewires who wins the AI capex cycle.' GPU capex is a stock. Agent CPU capex is the flow that makes it productive. The market has bought the stock and not modeled the flow.
[1] Raj, R. et al. “A CPU-Centric Perspective on Agentic AI,” arXiv:2511.00739v2, Nov 2025. “Tool processing on CPUs can take up to 90.6% of the total latency.” Author received IBM PhD Fellowship, Feb 2026.
[2] Patel, D. (SemiAnalysis), “CPUs are Back: The Datacenter CPU Landscape in 2026,” 18 Apr 2026.
[5] NVIDIA Developer Blog, “Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer.” Vera standalone reference.