Back to blogMarch 23, 2026

GPU Inference API Pricing Compared

By Ryan Ewing

GPU inference pricing is confusing. Every provider has a different pricing model — per token, per second, per GPU-hour, reserved vs. on-demand — and the real cost depends on your workload pattern. This post cuts through the noise with a direct comparison and explains why some providers can charge less than others.

The Short Answer

If you're running inference on open-source models, you're likely overpaying. The major cloud providers (AWS Bedrock, Google Vertex, Azure) charge 3–10x more per token than specialized inference providers. The specialized providers (Together AI, Fireworks, Replicate) are cheaper but still carry margins on dedicated GPU reservations.

Lilac takes a different approach: we route inference to idle enterprise GPUs that are already powered on. No reserved capacity to amortize. The result is lower per-token pricing with no minimums or commitments.

Lilac Pricing

Model	Input	Output	Latency
Kimi K2.6	$0.70/M tokens	$3.50/M tokens	0.45s TTFT

More models are being added. Get in touch to request specific models, or start on the cheap inference API page.

How Pricing Typically Works

Most inference providers price in one of three ways:

Per-token pricing — You pay for input and output tokens separately. This is the most common model for API-based inference. Prices vary wildly by model size and provider.

Per-second / per-GPU-hour pricing — You rent dedicated GPU capacity and pay for uptime regardless of utilization. Cheaper at scale but you eat the cost of idle time between requests.

Reserved capacity — Commit to a monthly spend for guaranteed throughput. Lowest per-token cost but highest commitment. Most providers require 1–3 month minimums.

Why Pricing Varies So Much

The cost of running inference depends on three things:

GPU hardware cost — An H100 costs ~$2–4/hour depending on the provider. An A100 costs ~$1–2/hour. This sets the floor.
Utilization rate — A GPU running at 90% utilization serves 3x the requests of one at 30%. Higher utilization = lower cost per token.
Margin structure — Cloud providers like AWS add 40–60% margin on top of hardware costs. Specialized providers run thinner margins.

The math is simple: if a provider can run GPUs at higher utilization with lower margins, they can charge less per token.

How Idle GPU Economics Work

Here's what makes Lilac's pricing possible:

The average data center runs at ~50% GPU utilization. Enterprise on-prem clusters average as low as 10%. That's millions of GPU-hours sitting idle — hardware that's already powered on, cooled, and paid for.

Lilac connects inference requests to that spare capacity. GPU providers install a lightweight Kubernetes operator and earn revenue on hardware that would otherwise sit idle. Because the fixed costs (power, cooling, depreciation) are already covered by the provider's primary workloads, the marginal cost of serving inference is dramatically lower.

This is the same economic model that made AWS possible: Amazon had excess server capacity from their e-commerce peaks and sold it to developers. Lilac applies this to GPU inference.

The cheapest GPU is one that's already running and has nothing to do.

What to Look For When Comparing

When evaluating inference providers, look beyond the per-token headline price:

Cold start time — Serverless providers may have 10–30s cold starts. Shared endpoints (like Lilac) are always warm.
Minimums and commitments — Many providers require monthly minimums of $500–5,000. Lilac has no minimums.
Rate limits — Some cheap providers throttle heavily. Check the actual throughput you'll get.
Model availability — Not every provider supports every model. Check that your target model is available.
API compatibility — Lilac uses an OpenAI-compatible API, so switching requires changing one line of code (your base_url).

Getting Started

Lilac's API is OpenAI-compatible. Swap your base URL and start running inference:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.getlilac.com/v1",
    api_key="lk-...",
)

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[{"role": "user", "content": "Hello!"}],
)

No contracts, no minimums, pay per token. Request API access to get started.