Running a 671B-parameter model on a single node

The conventional wisdom is that serving a frontier-scale model means a rack of datacenter GPUs and the budget to match. For dense models that is broadly true. For mixture-of-experts models, it is not — and the gap between the conventional wisdom and the actual requirement is where the economics get interesting.

Why MoE is different

A mixture-of-experts model has enormous total parameter count but activates only a small fraction of it per token. A 671-billion-parameter MoE model might route each token through a handful of experts, so the compute per token looks more like a much smaller dense model. The catch is memory: all those experts still have to live somewhere fast enough to reach on demand.

That reframes the problem. The dense and attention layers — the part that benefits most from raw GPU compute — are comparatively small. The sparse experts are large but are touched selectively. So instead of asking "how many GPUs fit this model," the right question is "where do the experts live, and how fast can we reach them."

The KTransformers pattern

The answer that makes single-node serving work is to split the model across two kinds of memory. Dense and attention layers run on the GPU, where their arithmetic intensity pays off. The sparse experts sit in system RAM — 1.5 to 3 terabytes of it — and are streamed to compute as tokens route to them.

This is the pattern KTransformers popularized, and it changes the hardware bill of materials completely. You no longer need enough GPU memory to hold the whole model. You need one or two GPUs for the dense path and a great deal of high-bandwidth system memory for the experts.

Decode is a bandwidth problem

Once experts live in system memory, token generation speed becomes a memory-bandwidth problem, not a GPU-count problem. Decode throughput tracks aggregate memory bandwidth almost linearly, which is why the silicon generation matters more than the GPU here.

A dual-socket Intel Xeon AMX node on the current-but-affordable generation delivers roughly 0.6 to 0.7 TB/s of aggregate memory bandwidth — enough for usable interactive decode on a 671B MoE model.
The newer generation, with twelve channels and faster MRDIMM memory, reaches 1.2 to 1.7 TB/s — two to three times the decode throughput, on the same architecture.

Two implementation details decide whether you actually get that bandwidth. The node must be dual-socket, because a single socket's bandwidth is too low to serve interactively. And expert weights must be replicated per socket and pinned NUMA-aware, so each socket-and-GPU pair serves its own shard without paying a cross-socket penalty. Get those right and the spec sheet is real; get them wrong and throughput collapses back toward a single socket.

What this unlocks

The result is that a model which appears to demand a rack runs on one node with one or two GPUs and a lot of RAM, with no shared tenancy and no per-token meter in front of it. The GPU itself becomes a floating component: a consumer card today, a datacenter accelerator as availability allows, without changing the architecture.

That is exactly what Forge MoE is shaped around — Xeon AMX plus GPU, dual-socket, NUMA-aware, with the memory bandwidth to make single-node MoE serving practical rather than theoretical.

Running a 671B-parameter model on a single node

Why MoE is different

The KTransformers pattern

Decode is a bandwidth problem

What this unlocks

Ready to get off the cloud meter?