Running Kimi 2.6 on a single Forge MoE node

Kimi 2.6 is a trillion-parameter mixture-of-experts model that most people assume needs a rack of accelerators. It fits on one Forge MoE node. Here is why, and what to expect.

ai-inferencehardware

Kimi 2.6 is a large open-weight mixture-of-experts model in the trillion-parameter class — roughly a trillion total parameters, with only a few tens of billions active on any given token. The instinct when you see "trillion parameters" is to reach for a rack of datacenter GPUs and an interconnect to match. For dense models that instinct is correct. For a sparse MoE model like Kimi, it is the wrong mental model, and acting on it costs an order of magnitude more than it should.

A Forge MoE node runs Kimi 2.6 on a single machine. Here is the reasoning, the configuration, and an honest account of what you get.

Why a trillion-parameter MoE fits on one node

The number that scares people — total parameter count — is not the number that governs decode speed. In a mixture-of-experts model, each token is routed to a small subset of experts. The dense attention path and the active experts do the compute; the overwhelming majority of the weights sit idle for any individual token. Kimi 2.6 keeps a trillion parameters on hand but only touches a small fraction per step.

That structural fact is what the KTransformers pattern exploits. Instead of holding every weight in scarce, expensive GPU memory, you split the model by how it is actually used:

  • Dense layers and attention run on the GPU — the compute-bound, latency-sensitive part of each step.
  • The sparse expert weights live in system RAM — hundreds of gigabytes of them, streamed across high-bandwidth memory channels as the router selects which experts a token needs.

This inverts the usual constraint. You no longer need enough GPU memory to hold a trillion parameters. You need one capable GPU for the dense path, and a lot of fast system memory with the bandwidth to feed the experts. That is exactly what a Forge MoE node is built around.

The node

Forge MoE pairs dual Intel Xeon processors with AMX — Advanced Matrix Extensions, the on-die matrix units that make CPU-side expert evaluation viable — with 1.5 to 3 TB of system memory and a GPU for the dense path. The memory is the product here: aggregate bandwidth lands between 0.6 and 1.7 TB/s depending on generation, and on Forge MoE the experts stream from that pool rather than from disk.

Kimi 2.6's full-precision weights do not fit in system RAM, and they do not need to. Quantized to roughly 4 bits per weight, a trillion-parameter model occupies on the order of 500 GB — comfortably inside the node's memory envelope, with room left for the KV cache and headroom for longer contexts. Four-bit quantization is the standard operating point for self-hosted MoE serving; the quality cost is small and well understood, and it is what makes single-node serving practical.

What to expect

MoE decode on this architecture is bandwidth-bound, not compute-bound. Throughput tracks how fast you can stream the active experts out of memory each step, which is why memory bandwidth is the figure that matters and why the generation gap is real: a Gen 6 node with MRDIMM-class bandwidth roughly doubles the decode rate of a Gen 4 node on the same model. Two things follow from that:

  • One node is sized for interactive and small-batch serving — an internal assistant, an agentic backend, a coding copilot, evaluation and batch jobs. It is not a high-concurrency public endpoint on day one; for that you run several nodes and shard the traffic.
  • NUMA-aware placement matters. Dual-socket means the expert weights and the threads that touch them should stay on the same memory domain. The serving stack handles this, but it is the difference between the bandwidth on the spec sheet and the bandwidth you actually see.

You bring the weights and the stack. KTransformers is the reference path for this pattern; vLLM and SGLang are options as their MoE-offload support matures. The node is yours — full root, your choice of runtime, your quantization, your context length.

Why a dedicated node, not a per-token API

You can rent Kimi-class inference by the token from a hosted gateway. That is the right call for spiky, low-volume, or experimental traffic. The moment your usage becomes steady, the math turns: a metered endpoint prices every token, every month, forever, and you never touch the machine underneath it.

A Forge MoE node is a fixed monthly cost for the whole machine. Run it at one token per day or saturate it around the clock — the price does not move. There is no per-token meter, no egress charge on the data you pull back, and no shared GPU quietly stealing cycles from a neighbor. For a team running a model like Kimi 2.6 as core infrastructure rather than an occasional API call, owning the node end to end costs a fraction of renting equivalent accelerator capacity, and the cost is one you can forecast.

Getting started

If you are serving an open-weight MoE model — Kimi 2.6, DeepSeek-class architectures, or the next trillion-parameter release — Forge MoE is built for exactly this shape of workload. Read the AI inference overview for the node specs, or apply for charter access to reserve one in the founding cohort.

Ready to get off the cloud meter?

Charter applications are open for the first deployment. Apply in two minutes, or join the waitlist — no payment required.