Question 1

What models can Forge MoE run?

Accepted Answer

Mixture-of-experts models in the hundreds-of-billions to trillion-parameter range (e.g. DeepSeek-class) via the KTransformers pattern, plus dense models within the node's envelope.

Question 2

Which GPUs do you use?

Accepted Answer

The card floats with supply: RTX 4090/5090 today, with datacenter accelerators as a premium variant as they land. You pick a capability tier.

Question 3

Why not just rent datacenter GPUs?

Accepted Answer

For bandwidth-bound MoE decode, one Xeon AMX node with one or two GPUs serves models that would otherwise need many accelerators — at a fraction of the cost, with no shared tenancy.

Question 4

Is there a per-token API?

Accepted Answer

No. Forge MoE is a dedicated node you run your own stack on. A per-token gateway on idle capacity is on the roadmap.

Run trillion-parameter models on a single node.

MoE done right

Bandwidth is the product

The GPU floats

Dedicated, not shared

Workloads this is shaped for.

AI Inference, answered.

Ready to get off the cloud meter?