Tech

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

Initial Editor·2026-05-15·7min read·1,445 words

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the team building it got to walk through the museum and decide which exhibits to skip.

This post isn't about whether MLX is faster than PyTorch on a Mac. (It often is, on Mac, for the workloads it's tuned for. That's the boring part.) It's about which design choices PyTorch baked in for the discrete-GPU world that MLX has been free to throw away, and what the throwing-away actually costs.

1. The discrete-world choices PyTorch can't unwind

PyTorch shipped in 2016 against a hardware reality that has not changed since: CPU on one side, GPU on the other, a bus in between, and a programmer who has to remember which side every tensor is sitting on. That reality is encoded in the API. tensor.to('cuda'). model.cuda(). .cpu(). .detach(). The entire device-management surface is a tax that the framework keeps paying because too much existing code depends on it.

It isn't just .to(device). Eager-by-default execution was the right call in 2016 — debugging was the bottleneck — and torch.compile is the multi-year project of trying to unwind that choice without breaking the existing world. The autograd tape design assumes you can mutate tensors in place. The distributed story grew up around NCCL and a specific topology of networked GPU servers.

None of this is bad engineering. It's accumulated commitment. The framework can't break millions of lines of downstream code to fix a default. So the fix lives in an opt-in compiler, partial mode flags, and a second set of APIs you have to learn alongside the first.

MLX didn't have any of that surface to protect. The interesting thing isn't that they made different choices. It's which ones they made.

2. Lazy evaluation as the default, not as a flag

In PyTorch, when you write c = a + b, the addition happens. Right then. The kernel launches, the result materialises, you can print(c) and see it. To get fusion and memory planning across operations, you opt in to torch.compile and accept its constraints.

In MLX, c = a + b builds a node in a compute graph and returns immediately. The arithmetic happens when something forces evaluation — when you call mx.eval(c), when you .item() to pull a Python scalar, or when the runtime decides the graph is ready. Until then, the planner can fuse adjacent ops into single kernels, reorder reads, and reuse buffers.

The payoff is real. For inference of a typical transformer block — RMSNorm, matmul, softmax, matmul — a lazy framework can collapse the whole chain into one or two kernels. PyTorch needs torch.compile to do the same thing, and even then it can't always cross Python control-flow boundaries cleanly. MLX gets the benefit by default, because it isn't pretending the result was already computed.

The cost is a different debugging mental model. The point at which a bug surfaces is the point of evaluation, not the point of expression. Print statements lie. You learn to call mx.eval deliberately, and a class of "why is this slow" questions turns into "where am I forcing the graph."

3. Unified memory as a primitive, not an annotation

This is the API decision that matters most, and the easiest to undervalue.

In PyTorch on Apple Silicon, every tensor still has a device. You allocate on 'cpu', you .to('mps') to move it to the GPU side, and the framework manages a copy across an abstraction that no longer matches the hardware. The hardware has one memory pool. The API has two.

In MLX, an mx.array has no device parameter. There is no .to(). The CPU code path and the GPU code path read and write the same buffer. The framework dispatches kernels to whichever side is appropriate without copying. You stop writing .to('mps') calls because there is no other side to move to.

The cost is portability. MLX code does not run on CUDA. It does not run on ROCm. It does not run on a desktop with an Nvidia card. If you write a research codebase in MLX and need to scale it onto an H100 cluster, you are rewriting against a different framework. PyTorch's device parameter is the abstraction tax that buys you the ability to ignore the hardware.

That's a real tradeoff. A framework can't be a pretend layer over hardware it's also pretending unifies. MLX picks one topology and lets the API match it. PyTorch picks portability and pays the abstraction cost on every platform.

4. Python and Swift, not Python and a footnote

PyTorch's primary language is Python. Its C++ backend (libtorch) exists, but the Python ecosystem is where the gravity lives. On Apple platforms, the consumer-facing surface is Swift — and PyTorch's Swift story has always been a third-party, partial, often-stale port.

MLX ships first-class Swift bindings (mlx-swift) maintained by the same team as the Python ones. The two APIs are close enough that translating between them is mechanical, and the model checkpoints they read are the same files on disk.

For most ML work, that doesn't matter — the gradient updates happen in Python and that's the end of it. It matters at the seam: you train or fine-tune in Python, and then someone has to ship the model in a Swift app for iPhone, iPad, or Mac. PyTorch's path is Python → ONNX → Core ML → Swift, with conversion losses and tooling gaps at every hop. MLX's path is Python → save → load in Swift. The model file is the same.

The cost is reach. MLX has two first-class language bindings and PyTorch has every binding a researcher has ever wanted. If your deployment target isn't Swift, the Swift-parity argument earns you nothing.

5. What MLX gives up to make all this work

Every framework decision is a tradeoff, and the post isn't worth writing without the column where MLX loses.

PyTorch decision MLX rewrite Cost Payoff
.to(device) for host/device split Unified memory as a primitive No portability to discrete GPUs Zero-copy between CPU/GPU paths
Eager-by-default execution Lazy by default; mx.eval() to materialise Different debugging model Whole-graph fusion, automatic memory planning
Python primary, Swift via third parties First-class Python + Swift parity Smaller language reach Train in Python, ship in Swift without conversion
Huge kernel surface, broad hardware Narrow kernel surface, Apple-only Some exotic ops still missing Every kernel tuned for one topology
Autograd via tape over mutable tensors Functional autograd over arrays Less Pythonic in places Cleaner composition, JIT-friendlier

The specific things MLX gives up:

  • Hardware reach. Apple Silicon only. If your roadmap includes "we'll move to a rented H100 cluster," MLX is not the framework you start in.
  • Distributed training maturity. MLX has added multi-node and gradient sharding, but the production-tested defaults — NCCL, FSDP, DeepSpeed — live in the PyTorch world. For a serious training run across many machines, MLX is the harder bet.
  • Ecosystem of checkpoints. Hugging Face has tens of thousands of models with first-class PyTorch loaders. The mlx-community namespace has caught up considerably for popular open-weights models, but the long tail is PyTorch-first.
  • The reflexive resume keyword. "Five years of PyTorch" is a hireable line. "Three years of MLX" is not yet. That will shift, but slowly.

Where this leaves the framework decision

If you are doing ML research that needs to run on whatever hardware your university or company eventually buys, PyTorch is still the safer default. The portability tax is real and the abstraction surface is what you're paying for.

If you are building inference, fine-tuning, or production tooling that will sit on Apple Silicon and ship into Apple platforms, MLX is the framework that got to skip the museum exhibits. The defaults match the hardware. The lazy-by-default execution gives you whole-graph optimisation without an opt-in. The Python and Swift APIs let you train and ship without a conversion pipeline. The price is that you've bet on one vendor's silicon — and on workloads that don't move off it, that bet is currently paying.

The point isn't that MLX is the right framework for everyone. It's that someone finally got to pick the defaults knowing what we'd learned in the decade since PyTorch shipped — and the things they threw away are the things that were quietly costing the rest of us the most.

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills
Tech· 2026-05-13· 5min

MCP Gives You the Kitchen. Skills Are the Recipe.

Most teams ship one of these and call the job done. MCP gives the agent tools. Skills tell it which to use, in what order, with which fallbacks. Without skills, your MCP integration ends with users asking 'okay, what now?'

#claude-code#mcp#agents#skills