Tech

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

Initial Editor·2026-05-15·7min read·1,356 words·12 views

For two years the industry has answered every inference question with "bigger cluster." That answer was always going to fray the moment the memory bus stopped being two cities apart.

The interesting shift in hardware right now isn't model size or peak FLOPS. It's the topology — where the weights live, where the activations live, and how far they have to travel between the two. The discrete-card model that everyone learned to optimise for was a gaming-era artifact. The unified-memory model is the structural correction, and it's quietly making single-user, single-tenant inference economically obvious in places it wasn't a year ago.

1. The discrete-card model is a gaming artifact

CPUs and GPUs evolved separately because they served different buyers. The CPU lineage came out of general computing. The GPU lineage came out of consumer 3D rendering and never quite shook off the assumption that the GPU is a slave coprocessor on the other end of a bus. The PCIe link between them is the accident of history that AI inference still pays for, every single forward pass.

For training, that cost amortises. You're moving the same weights through the bus once and then doing trillions of operations on the GPU side. For single-user inference, you don't get that amortisation. You're shipping a fresh KV cache and a fresh activation tensor across the bus on every token. The bus becomes the floor of your latency budget.

This is the part of the architecture that the rest of the stack has been bending around for a decade. Quantisation, KV-cache offload, speculative decoding, paged attention — half of them are clever tricks to keep more of the working set on the GPU side and stop crossing the bus. They work. They also exist because the topology is wrong for the workload.

2. The unified bet is structural, not a single vendor's idea

Several vendors are converging on the same answer from different angles.

  • Apple Silicon — CPU, GPU, neural engine, and a single LPDDR memory pool on one package. The current Ultra and Max parts ship up to 192 GB of that pool, every byte of which is addressable by both the CPU and the GPU without a copy.
  • AMD's Strix Halo (Ryzen AI Max+ 395) — the same bet on the x86 side. Up to 128 GB of shared LPDDR, an integrated GPU that gets to treat that pool as VRAM, and a power envelope that fits in a laptop.
  • Nvidia's small workstation parts (the GB10 / DGX Spark line) — Nvidia's own admission that the unified topology has a real customer. 128 GB of coherent memory between a Grace CPU and a Blackwell GPU, in a box that sits on a desk.

These aren't the same product. They're three different bets on the same thesis: for inference, the memory bus matters more than peak FLOPS, and the right place for the bus is inside the package.

The framework layer is what actually earns the hardware. MLX on the Apple side, ROCm with its unified extensions on the AMD side, CUDA on the Nvidia side. Frameworks designed for the discrete world — PyTorch and TensorFlow in their default modes — have to be coaxed into using the unified topology properly. Frameworks built around the new topology from day one don't have that legacy to fight, and they ship optimisations the older stacks are still backporting.

The buzzword most decks haven't caught up with yet is the framework name, not the hardware. The hardware is well-rehearsed by now. The framework is where the productivity actually lives.

3. What actually changes for inference

This is the comparison that decides whether you should care:

Dimension Discrete CPU + GPU Unified Memory
Where the KV cache lives VRAM (24–80 GB ceiling per consumer card) Shared with system RAM (96–192 GB on a workstation)
Cost of model swap Full PCIe transfer of weights on every load Memory-mapped, near-instant on warm boot
Idle power 50–300 W floor while the card is alive 5–15 W floor for the whole machine
Practical max resident model, FP16 ~70 B on a single 80 GB card ~120 B on a 192 GB workstation, without sharding
Sustained tokens/sec per dollar of capex Strong for batched serving, weak for one user The reverse
Concurrency model Built for many tenants on one card Built for one tenant on the whole box

The unified column does not win every fight. It wins the specific fight that local inference has been losing on the discrete-card model: holding a 30–70 B parameter model resident, with a large KV cache, at low idle power, for one user at a time, without sharding. That workload was the bad neighbour in every shared-cluster argument, and it's now the workload the new topology is good at.

4. When the unified bet still loses

Four cases where the discrete-card model still wins, and where the unified topology will frustrate you if you pitch it in anyway:

  • Training. Gradient passes are bandwidth-bound in ways inference isn't. The fabric between cards in a real training cluster still beats a single package, and it isn't close. If you're training a foundation model or even seriously fine-tuning a 70 B one, the hyperscaler hasn't moved.
  • High-concurrency serving. Forty users on one workstation will eat each other. The unified machines are built for the case where you are the user, not for the case where a thousand users are. Multi-tenant inference still wants discrete cards and a serving stack designed around them.
  • Bursty request flow that needs autoscaling. A box on a desk does not scale to a traffic spike. If your workload bursts by 10x at unpredictable hours, the elastic API is doing the elastic part for you, and that's worth paying for.
  • Models above the package memory ceiling. A 400 B parameter model isn't fitting in 192 GB at any quantisation you'd actually want to run. The largest models still need the cluster.

The shorthand: the unified bet wins on single-tenant inference of mid-sized models at low duty cycle. It loses on training, on multi-tenant serving, and on anything past the memory ceiling. Don't pitch the architecture at a workload it's not for.

5. The eighteen-month window

Three things have to be true at once for the math to flip on a workload, and right now all three are true on more workloads than they were last year:

  1. The model is good enough at the size that fits. A 27–32 B model, 4-bit quantised, on the unified topology is genuinely useful for a real internal tool — extraction, summarisation, structured generation, classification with rationale. A year ago it wasn't. The quality line moved.
  2. The framework is mature enough to expose the hardware. Quantisation kernels, KV-cache management, and speculative decoding are now first-class on the unified frameworks. The "it runs, but it's slow" gap has mostly closed for the formats that matter.
  3. The economics pencil. A workstation in the $5–8K range, amortised over two years against a steady API spend for the same workload on the same model class, comes out ahead for any non-trivial internal use case. Add data sovereignty or rate-limit risk and the comparison stops being close.

The conversation in most companies is still "which API." That will keep being the right answer for the bursty, multi-tenant, public-facing workloads. For the long tail of internal inference — the agent that runs nightly over the inbox, the extraction pipeline behind the dashboard, the copilot for a team of twelve — the default answer is shifting, and the people running the numbers usually know it before the people writing the slide decks do.

The honest read isn't that the cloud is over. It's that the cloud is no longer the default answer to "where does this inference run?" — and once the default cracks, the question gets interesting again on workloads where nobody bothered asking it before.

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills
Tech· 2026-05-13· 5min

MCP Gives You the Kitchen. Skills Are the Recipe.

Most teams ship one of these and call the job done. MCP gives the agent tools. Skills tell it which to use, in what order, with which fallbacks. Without skills, your MCP integration ends with users asking 'okay, what now?'

#claude-code#mcp#agents#skills