Back to blog

    How Idle GPUs Make Cheap Inference Possible

    By Lucas Ewing


    TL;DR

    We serve Kimi K2.5 at $0.40/M input, $2.00/M output by running models on idle enterprise GPUs. Early access customers get 25% off all tokens above 1B/month for 3 months ($0.30/M input, $1.50/M output). No contracts, no minimums. Get API access or email contact@getlilac.com.


    Why idle GPUs matter

    The average enterprise GPU cluster runs at about 50% utilization. Training jobs finish, inference traffic dips, and the hardware just sits there. The power, cooling, and depreciation are already paid for, but the GPUs aren't doing anything.

    Lilac's Kubernetes operator finds that spare capacity and spins up inference workloads on it. When the cluster's own jobs need GPUs back, our operator steps aside immediately. The GPU owner's workloads always come first.

    Since the fixed costs are already covered, the cost of serving inference on that spare capacity is much lower than renting dedicated GPUs. We pass those savings through to you.

    What we serve today

    We're currently serving Kimi K2.5, Moonshot AI's 1T parameter mixture-of-experts model with 32B active parameters.

    ModelInput PriceOutput Price
    Kimi K2.5$0.40 / M tokens$2.00 / M tokens

    The API is OpenAI-compatible. Switching takes one line:

    from openai import OpenAI
    
    client = OpenAI(
        base_url="https://api.getlilac.com/v1",
        api_key="lilac_sk_...",
    )
    
    response = client.chat.completions.create(
        model="kimi-k2-5",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    

    How we compare

    We benchmarked our endpoint with NVIDIA AIPerf at 100 concurrent requests and compared against every Kimi K2.5 provider listed on OpenRouter.

    Kimi K2.5 provider comparison: output speed vs price across OpenRouter providers

    At 28 tokens/sec per user and $2.00/M output, Lilac is in a comparable speed band while landing at the lowest price in this benchmark snapshot. Fireworks and Venice push higher throughput, but at $3.00-3.50/M output.

    How the operator works

    GPU providers install our Kubernetes operator with a single kubectl apply. It does four things:

    1. Monitors node utilization and finds reclaimable GPU capacity
    2. Deploys inference servers (vLLM) onto idle nodes
    3. Routes API requests to healthy instances
    4. Preempts inference workloads when primary jobs need GPUs back

    Providers choose which node pools to expose and set their own availability windows. Their workloads always take priority.


    More on idle GPU economics: The GPU Scarcity Paradox. Broader pricing comparison: GPU Inference API Pricing Compared. Demand-side entry points: Cheap Inference API and Kimi K2.5 API.