Run trillion-parameter models on a single node.
Forge MoE runs mixture-of-experts models — dense layers on GPU, sparse experts in fast system memory — so one node serves models that would otherwise need a rack.
MoE done right
KTransformers pattern: experts stream from 1.5–3 TB of RAM.
Bandwidth is the product
0.6–1.7 TB/s aggregate; Gen 6 doubles the decode.
The GPU floats
RTX 4090/5090 to datacenter cards as supply lands.
Dedicated, not shared
The whole node — no shared GPU, no token meter.
Workloads this is shaped for.
Serve open-weight models at predictable cost.
DeepSeek-class architectures on one node.
Dedicated GPU time with no contention.
AI Inference, answered.
What models can Forge MoE run?
Mixture-of-experts models in the hundreds-of-billions to trillion-parameter range (e.g. DeepSeek-class) via the KTransformers pattern, plus dense models within the node's envelope.
Which GPUs do you use?
The card floats with supply: RTX 4090/5090 today, with datacenter accelerators as a premium variant as they land. You pick a capability tier.
Why not just rent datacenter GPUs?
For bandwidth-bound MoE decode, one Xeon AMX node with one or two GPUs serves models that would otherwise need many accelerators — at a fraction of the cost, with no shared tenancy.
Is there a per-token API?
No. Forge MoE is a dedicated node you run your own stack on. A per-token gateway on idle capacity is on the roadmap.
Ready to get off the cloud meter?
Charter applications are open for the first deployment. Apply in two minutes, or join the waitlist — no payment required.