For two years the industry has answered every inference question with "bigger cluster." That answer was always going to fray the moment the memory bus stopped being two cities apart.
The interesting shift in hardware right now isn't model size or peak FLOPS. It's the topology — where the weights live, where the activations live, and how far they have to travel between the two. The discrete-card model that everyone learned to optimise for was a gaming-era artifact. The unified-memory model is the structural correction, and it's quietly making single-user, single-tenant inference economically obvious in places it wasn't a year ago.
1. The discrete-card model is a gaming artifact
CPUs and GPUs evolved separately because they served different buyers. The CPU lineage came out of general computing. The GPU lineage came out of consumer 3D rendering and never quite shook off the assumption that the GPU is a slave coprocessor on the other end of a bus. The PCIe link between them is the accident of history that AI inference still pays for, every single forward pass.
For training, that cost amortises. You're moving the same weights through the bus once and then doing trillions of operations on the GPU side. For single-user inference, you don't get that amortisation. You're shipping a fresh KV cache and a fresh activation tensor across the bus on every token. The bus becomes the floor of your latency budget.
This is the part of the architecture that the rest of the stack has been bending around for a decade. Quantisation, KV-cache offload, speculative decoding, paged attention — half of them are clever tricks to keep more of the working set on the GPU side and stop crossing the bus. They work. They also exist because the topology is wrong for the workload.
2. The unified bet is structural, not a single vendor's idea
Several vendors are converging on the same answer from different angles.
- Apple Silicon — CPU, GPU, neural engine, and a single LPDDR memory pool on one package. The current Ultra and Max parts ship up to 192 GB of that pool, every byte of which is addressable by both the CPU and the GPU without a copy.
- AMD's Strix Halo (Ryzen AI Max+ 395) — the same bet on the x86 side. Up to 128 GB of shared LPDDR, an integrated GPU that gets to treat that pool as VRAM, and a power envelope that fits in a laptop.
- Nvidia's small workstation parts (the GB10 / DGX Spark line) — Nvidia's own admission that the unified topology has a real customer. 128 GB of coherent memory between a Grace CPU and a Blackwell GPU, in a box that sits on a desk.
These aren't the same product. They're three different bets on the same thesis: for inference, the memory bus matters more than peak FLOPS, and the right place for the bus is inside the package.
The framework layer is what actually earns the hardware. MLX on the Apple side, ROCm with its unified extensions on the AMD side, CUDA on the Nvidia side. Frameworks designed for the discrete world — PyTorch and TensorFlow in their default modes — have to be coaxed into using the unified topology properly. Frameworks built around the new topology from day one don't have that legacy to fight, and they ship optimisations the older stacks are still backporting.
The buzzword most decks haven't caught up with yet is the framework name, not the hardware. The hardware is well-rehearsed by now. The framework is where the productivity actually lives.
3. What actually changes for inference
This is the comparison that decides whether you should care:
| Dimension | Discrete CPU + GPU | Unified Memory |
|---|---|---|
| Where the KV cache lives | VRAM (24–80 GB ceiling per consumer card) | Shared with system RAM (96–192 GB on a workstation) |
| Cost of model swap | Full PCIe transfer of weights on every load | Memory-mapped, near-instant on warm boot |
| Idle power | 50–300 W floor while the card is alive | 5–15 W floor for the whole machine |
| Practical max resident model, FP16 | ~70 B on a single 80 GB card | ~120 B on a 192 GB workstation, without sharding |
| Sustained tokens/sec per dollar of capex | Strong for batched serving, weak for one user | The reverse |
| Concurrency model | Built for many tenants on one card | Built for one tenant on the whole box |
The unified column does not win every fight. It wins the specific fight that local inference has been losing on the discrete-card model: holding a 30–70 B parameter model resident, with a large KV cache, at low idle power, for one user at a time, without sharding. That workload was the bad neighbour in every shared-cluster argument, and it's now the workload the new topology is good at.
4. When the unified bet still loses
Four cases where the discrete-card model still wins, and where the unified topology will frustrate you if you pitch it in anyway:
- Training. Gradient passes are bandwidth-bound in ways inference isn't. The fabric between cards in a real training cluster still beats a single package, and it isn't close. If you're training a foundation model or even seriously fine-tuning a 70 B one, the hyperscaler hasn't moved.
- High-concurrency serving. Forty users on one workstation will eat each other. The unified machines are built for the case where you are the user, not for the case where a thousand users are. Multi-tenant inference still wants discrete cards and a serving stack designed around them.
- Bursty request flow that needs autoscaling. A box on a desk does not scale to a traffic spike. If your workload bursts by 10x at unpredictable hours, the elastic API is doing the elastic part for you, and that's worth paying for.
- Models above the package memory ceiling. A 400 B parameter model isn't fitting in 192 GB at any quantisation you'd actually want to run. The largest models still need the cluster.
The shorthand: the unified bet wins on single-tenant inference of mid-sized models at low duty cycle. It loses on training, on multi-tenant serving, and on anything past the memory ceiling. Don't pitch the architecture at a workload it's not for.
5. The eighteen-month window
Three things have to be true at once for the math to flip on a workload, and right now all three are true on more workloads than they were last year:
- The model is good enough at the size that fits. A 27–32 B model, 4-bit quantised, on the unified topology is genuinely useful for a real internal tool — extraction, summarisation, structured generation, classification with rationale. A year ago it wasn't. The quality line moved.
- The framework is mature enough to expose the hardware. Quantisation kernels, KV-cache management, and speculative decoding are now first-class on the unified frameworks. The "it runs, but it's slow" gap has mostly closed for the formats that matter.
- The economics pencil. A workstation in the $5–8K range, amortised over two years against a steady API spend for the same workload on the same model class, comes out ahead for any non-trivial internal use case. Add data sovereignty or rate-limit risk and the comparison stops being close.
The conversation in most companies is still "which API." That will keep being the right answer for the bursty, multi-tenant, public-facing workloads. For the long tail of internal inference — the agent that runs nightly over the inbox, the extraction pipeline behind the dashboard, the copilot for a team of twelve — the default answer is shifting, and the people running the numbers usually know it before the people writing the slide decks do.
The honest read isn't that the cloud is over. It's that the cloud is no longer the default answer to "where does this inference run?" — and once the default cracks, the question gets interesting again on workloads where nobody bothered asking it before.