What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the team building it got to walk through the museum and decide which exhibits to skip.

This post isn't about whether MLX is faster than PyTorch on a Mac. (It often is, on Mac, for the workloads it's tuned for. That's the boring part.) It's about which design choices PyTorch baked in for the discrete-GPU world that MLX has been free to throw away, and what the throwing-away actually costs.

1. The discrete-world choices PyTorch can't unwind

PyTorch shipped in 2016 against a hardware reality that has not changed since: CPU on one side, GPU on the other, a bus in between, and a programmer who has to remember which side every tensor is sitting on. That reality is encoded in the API. tensor.to('cuda'). model.cuda(). .cpu(). .detach(). The entire device-management surface is a tax that the framework keeps paying because too much existing code depends on it.

It isn't just .to(device). Eager-by-default execution was the right call in 2016 — debugging was the bottleneck — and torch.compile is the multi-year project of trying to unwind that choice without breaking the existing world. The autograd tape design assumes you can mutate tensors in place. The distributed story grew up around NCCL and a specific topology of networked GPU servers.

None of this is bad engineering. It's accumulated commitment. The framework can't break millions of lines of downstream code to fix a default. So the fix lives in an opt-in compiler, partial mode flags, and a second set of APIs you have to learn alongside the first.

MLX didn't have any of that surface to protect. The interesting thing isn't that they made different choices. It's which ones they made.

2. Lazy evaluation as the default, not as a flag

In PyTorch, when you write c = a + b, the addition happens. Right then. The kernel launches, the result materialises, you can print(c) and see it. To get fusion and memory planning across operations, you opt in to torch.compile and accept its constraints.

In MLX, c = a + b builds a node in a compute graph and returns immediately. The arithmetic happens when something forces evaluation — when you call mx.eval(c), when you .item() to pull a Python scalar, or when the runtime decides the graph is ready. Until then, the planner can fuse adjacent ops into single kernels, reorder reads, and reuse buffers.

The payoff is real. For inference of a typical transformer block — RMSNorm, matmul, softmax, matmul — a lazy framework can collapse the whole chain into one or two kernels. PyTorch needs torch.compile to do the same thing, and even then it can't always cross Python control-flow boundaries cleanly. MLX gets the benefit by default, because it isn't pretending the result was already computed.

The cost is a different debugging mental model. The point at which a bug surfaces is the point of evaluation, not the point of expression. Print statements lie. You learn to call mx.eval deliberately, and a class of "why is this slow" questions turns into "where am I forcing the graph."

3. Unified memory as a primitive, not an annotation

This is the API decision that matters most, and the easiest to undervalue.

In PyTorch on Apple Silicon, every tensor still has a device. You allocate on 'cpu', you .to('mps') to move it to the GPU side, and the framework manages a copy across an abstraction that no longer matches the hardware. The hardware has one memory pool. The API has two.

In MLX, an mx.array has no device parameter. There is no .to(). The CPU code path and the GPU code path read and write the same buffer. The framework dispatches kernels to whichever side is appropriate without copying. You stop writing .to('mps') calls because there is no other side to move to.

The cost is portability. MLX code does not run on CUDA. It does not run on ROCm. It does not run on a desktop with an Nvidia card. If you write a research codebase in MLX and need to scale it onto an H100 cluster, you are rewriting against a different framework. PyTorch's device parameter is the abstraction tax that buys you the ability to ignore the hardware.

That's a real tradeoff. A framework can't be a pretend layer over hardware it's also pretending unifies. MLX picks one topology and lets the API match it. PyTorch picks portability and pays the abstraction cost on every platform.

4. Python and Swift, not Python and a footnote

PyTorch's primary language is Python. Its C++ backend (libtorch) exists, but the Python ecosystem is where the gravity lives. On Apple platforms, the consumer-facing surface is Swift — and PyTorch's Swift story has always been a third-party, partial, often-stale port.

MLX ships first-class Swift bindings (mlx-swift) maintained by the same team as the Python ones. The two APIs are close enough that translating between them is mechanical, and the model checkpoints they read are the same files on disk.

For most ML work, that doesn't matter — the gradient updates happen in Python and that's the end of it. It matters at the seam: you train or fine-tune in Python, and then someone has to ship the model in a Swift app for iPhone, iPad, or Mac. PyTorch's path is Python → ONNX → Core ML → Swift, with conversion losses and tooling gaps at every hop. MLX's path is Python → save → load in Swift. The model file is the same.

The cost is reach. MLX has two first-class language bindings and PyTorch has every binding a researcher has ever wanted. If your deployment target isn't Swift, the Swift-parity argument earns you nothing.

5. What MLX gives up to make all this work

Every framework decision is a tradeoff, and the post isn't worth writing without the column where MLX loses.

PyTorch decision	MLX rewrite	Cost	Payoff
`.to(device)` for host/device split	Unified memory as a primitive	No portability to discrete GPUs	Zero-copy between CPU/GPU paths
Eager-by-default execution	Lazy by default; `mx.eval()` to materialise	Different debugging model	Whole-graph fusion, automatic memory planning
Python primary, Swift via third parties	First-class Python + Swift parity	Smaller language reach	Train in Python, ship in Swift without conversion
Huge kernel surface, broad hardware	Narrow kernel surface, Apple-only	Some exotic ops still missing	Every kernel tuned for one topology
Autograd via tape over mutable tensors	Functional autograd over arrays	Less Pythonic in places	Cleaner composition, JIT-friendlier

The specific things MLX gives up:

Hardware reach. Apple Silicon only. If your roadmap includes "we'll move to a rented H100 cluster," MLX is not the framework you start in.
Distributed training maturity. MLX has added multi-node and gradient sharding, but the production-tested defaults — NCCL, FSDP, DeepSpeed — live in the PyTorch world. For a serious training run across many machines, MLX is the harder bet.
Ecosystem of checkpoints. Hugging Face has tens of thousands of models with first-class PyTorch loaders. The mlx-community namespace has caught up considerably for popular open-weights models, but the long tail is PyTorch-first.
The reflexive resume keyword. "Five years of PyTorch" is a hireable line. "Three years of MLX" is not yet. That will shift, but slowly.

Where this leaves the framework decision

If you are doing ML research that needs to run on whatever hardware your university or company eventually buys, PyTorch is still the safer default. The portability tax is real and the abstraction surface is what you're paying for.

If you are building inference, fine-tuning, or production tooling that will sit on Apple Silicon and ship into Apple platforms, MLX is the framework that got to skip the museum exhibits. The defaults match the hardware. The lazy-by-default execution gives you whole-graph optimisation without an opt-in. The Python and Swift APIs let you train and ship without a conversion pipeline. The price is that you've bet on one vendor's silicon — and on workloads that don't move off it, that bet is currently paying.

The point isn't that MLX is the right framework for everyone. It's that someone finally got to pick the defaults knowing what we'd learned in the decade since PyTorch shipped — and the things they threw away are the things that were quietly costing the rest of us the most.

What MLX Got to Throw Away (That PyTorch Can't)

1. The discrete-world choices PyTorch can't unwind

2. Lazy evaluation as the default, not as a flag

3. Unified memory as a primitive, not an annotation

4. Python and Swift, not Python and a footnote

5. What MLX gives up to make all this work

Where this leaves the framework decision

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

MCP Gives You the Kitchen. Skills Are the Recipe.

What MLX Got to Throw Away (That PyTorch Can't)

1. The discrete-world choices PyTorch can't unwind

2. Lazy evaluation as the default, not as a flag

3. Unified memory as a primitive, not an annotation

4. Python and Swift, not Python and a footnote

5. What MLX gives up to make all this work

Where this leaves the framework decision

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

MCP Gives You the Kitchen. Skills Are the Recipe.

New posts, every week.Delivered Sunday mornings.

New posts, every week.
Delivered Sunday mornings.