Solution

Self-hosted AI inference on dedicated hardware.

Serve open-weight models on whole nodes you control, with predictable cost instead of per-token billing. Forge MoE pairs Intel Xeon AMX with GPUs so a single node can serve trillion-parameter mixture-of-experts models.

The problem

The problem with inference APIs and shared GPUs

Per-token inference APIs are convenient until volume scales — then cost becomes unpredictable and you are rate-limited on infrastructure you do not control. Shared GPU rentals add cold starts and noisy-neighbor contention. For steady inference workloads, paying retail per token on someone else's hardware is the expensive path.

How Smelt solves it

Whole-node GPU access

The entire node is yours — no shared GPU, no token meter in front of your workload, no cold-start lottery.

MoE on a single node

The KTransformers pattern runs dense and attention layers on the GPU and sparse experts in 1.5 to 3 TB of system memory, so one node serves models that would otherwise need a rack.

Predictable cost

Fixed node pricing instead of per-token billing. Run as much inference as the hardware allows for one rate.

Bring your own stack

Run vLLM, KTransformers, or your own serving stack on bare metal or under managed Kubernetes.

FAQ

Questions, answered.

Can I run DeepSeek or other large MoE models?

Yes — Forge MoE is purpose-built for mixture-of-experts models in the hundreds of billions to trillion-parameter range, using the KTransformers pattern of dense layers on GPU and sparse experts in system memory.

Is there a per-token charge?

No. You run your own serving stack on a dedicated node at a fixed rate. A per-token gateway on idle capacity is on the roadmap, but the core product has no metering in front of it.

Bare metal or managed?

Either. Take full root access, or run a single-tenant Kubernetes cluster and let us manage it.

Ready to get off the cloud meter?

Charter applications are open for the first deployment. Apply in two minutes, or join the waitlist — no payment required.