Self-Hosted LLM vs API: A Cost Breakdown

The self-hosted LLM vs API cost question comes down to one thing: how much you use it, and how steadily. Calling a hosted model API costs you a few dollars per million tokens with zero fixed cost, so it wins when your usage is low or bursty. Self-hosting an open-weight model means renting a GPU by the hour or month, a fixed bill whether you use it or not, so it wins only at high, steady volume that keeps the hardware busy. Most teams starting out are firmly in API territory and do not need their own GPU yet.

The honest version is that there is a crossover point, often called the break-even, where the fixed cost of a GPU finally beats the per-token cost of an API. Where that point sits depends heavily on your volume, your model, and how well you keep the hardware utilized. This guide walks through the cost drivers on each side, the hidden costs nobody quotes you, and how to find your own break-even. If you want the hardware side in detail, see our guide on VPS specs to run an LLM, and for picking a server in general, how to choose a VPS. The pattern here mirrors the one in which hosting type is best for small websites: match the tool to where you are now, not where you might be someday.

What does it cost to call a hosted LLM API?

Hosted APIs charge per token, with no fixed monthly cost. You pay separately for input (your prompt) and output (the model's reply), priced per million tokens. A token is roughly three-quarters of a word, so a million tokens is a long afternoon of chat or a few hundred document summaries. You pay only for what you send and receive.

Prices span a wide range by model. On the closed-model side, OpenAI's flagship GPT-5.5 lists at $5.00 per million input tokens and $30.00 per million output tokens, while its smallest model, GPT-5.4-nano, is $0.20 input and $1.25 output (OpenAI API pricing, figures approximate and current as of June 2026). Open-weight models served through a provider are cheaper still. Together AI lists Llama 3.1 8B at $0.18 per million tokens and Llama 3.3 70B at $0.88 per million (Together AI pricing). All of these are approximate and change often, so treat them as a snapshot, not a contract.

The structure is what matters more than the exact numbers. The cost scales linearly with use: ten times the traffic, ten times the bill. There is nothing to provision, no idle waste, and no maintenance. Output tokens usually cost several times more than input tokens, so chatty, long-form responses cost disproportionately more than short ones. That linear, no-floor shape is the whole appeal, and also the whole limitation.

What does it cost to self-host an open-weight model?

Self-hosting flips the structure: you pay a fixed price for a GPU and the per-token cost effectively goes to zero. You rent (or buy) a graphics card, load an open-weight model like Llama or Mistral onto it, and serve requests yourself. The meter runs on time, not tokens, so the bill is the same whether the card is flat out or sitting idle.

Cloud GPU rental is the usual on-ramp because it avoids a hardware purchase. As a snapshot of current marketplace rates, RunPod lists an RTX 4090 at $0.69 per hour, an A100 PCIe at $1.39 per hour, and an H100 SXM at $3.29 per hour (RunPod pricing). Spot and community tiers run cheaper, dedicated enterprise instances run higher, and a small model can run on a single consumer card while a 70B model wants a high-memory data-center GPU. Run an H100 around the clock and the math is simple: roughly $3.29 an hour is about $2,400 a month, paid in full whether you serve one request or a million.

The cost drivers on the self-hosting side are the GPU class you need (driven by model size and required speed), how many hours you keep it running, and your utilization. The card does not care whether it is busy. A GPU at 10 percent load costs about ten times as much per token as the same card at full load (Braincuber). That single fact is why self-hosting punishes low and bursty traffic so badly.

Where is the break-even point between self-hosting and an API?

The break-even is the monthly volume at which a fixed GPU bill becomes cheaper than per-token API fees. Below it, the API is cheaper because you pay nothing for idle time. Above it, self-hosting wins because the per-token cost has fallen toward zero. For a mid-size open model, that crossover tends to land in the range of hundreds of millions of tokens per day.

One published analysis puts a concrete number on it. Running Llama 70B, self-hosting costs roughly $4,360 a month at around 500 million tokens per day, against roughly $22,500 a month to serve the same volume through an API, a fivefold gap in self-hosting's favor at that scale. The same piece pegs the practical crossover near $4,200 a month of API spend (Braincuber). The shape of the curve is the lesson: API cost rises in a straight line with usage, while self-hosting is a flat line set by your GPU. The two cross at one point, and that point moves with your model and your prices.

A few things shift the break-even, sometimes dramatically:

The model you compare against. Against an expensive flagship closed model, self-hosting an open model breaks even at a far lower volume. Against a cheap open-model API provider already running optimized infrastructure at thin margins, the break-even shifts much higher, often into tens of millions of tokens per day.
Your utilization. The break-even assumes the GPU stays busy. Idle hours raise your effective per-token cost and push the crossover further away.
Whether you can use spot or cheaper hardware. Cheaper or interruptible GPUs lower the fixed line and move the break-even down.

What hidden costs do both options carry?

Both sides have costs that never appear on the headline price, and ignoring them is how budgets blow up. On the API side it is mostly per-token sprawl and data movement. On the self-hosting side it is idle time and the engineering effort to run the thing. The honest accounting includes all of it, not the sticker price.

On the API side, watch for:

Output token blowup. Output usually costs several times more than input, so verbose responses and long chains of reasoning quietly multiply your bill.
Retries and context stuffing. Failed calls you retry, plus large prompts and long conversation history resent on every turn, all bill as fresh input tokens.
Egress, if you pull data across clouds. Moving data out of a cloud is metered. AWS charges about $0.09 per GB for the first 10 TB of internet egress each month (AWS egress, via EgressCost). For pure API calls this is small, but it grows once large documents or images are involved.

On the self-hosting side, watch for:

Idle GPU time. This is the big one. Every hour the card sits unused is money spent serving nothing, and it is why bursty workloads suit APIs better.
Engineering and maintenance. Someone has to deploy the model, monitor it, patch it, and keep it up. Full operational cost commonly runs three to five times the raw GPU price once people are counted (Braincuber).
Overprovisioning for peaks. Sizing hardware for your busiest hour means paying for that capacity during every quiet hour too.
Egress and storage. Model weights are large, and moving outputs or data out of your provider is metered the same way it is for an API.

When does each option win?

Each option wins in a clear, predictable zone, and the deciding factor is your sustained volume and how steady it is. Low or bursty usage favors APIs, because you pay nothing for idle time and skip all the operational work. High, steady, predictable volume favors self-hosting, because a busy GPU drives the per-token cost toward zero. Most teams sit in the first zone longer than they expect.

Choose a hosted API if:

Your usage is low, early-stage, or spiky, and idle hardware would sit wasted.
You want zero infrastructure to manage and a bill that tracks usage exactly.
You need frequent access to the newest or largest closed models.
You are still prototyping and your volume is not stable enough to predict.

Choose to self-host if:

Your volume is high and steady enough to keep a GPU genuinely busy.
You have run the numbers and you are comfortably past your own break-even.
You need data residency, privacy, or control that a third-party API cannot give you.
You have the engineering capacity to operate and maintain the deployment.

A reserved or dedicated GPU is the right move when you have consistent, heavy, predictable inference and the team to run it. For occasional or unpredictable workloads, you genuinely do not need it yet, and forcing it on bursty traffic is one of the more expensive mistakes in this space. Platforms that host open models, including Carpathian, can shift some of that operational burden by running the hardware for you, though that convenience is itself a cost to weigh against doing it yourself.

What is the honest limitation of this comparison?

The honest limitation is that there is no universal break-even number, and anyone who quotes you one without your inputs is guessing. The crossover depends heavily on three things you control: your token volume, the specific model you run, and how well you keep the hardware utilized. Change any one and the answer moves, sometimes by an order of magnitude.

The published figures here, like a roughly $4,200-a-month crossover or a 500-million-token-a-day break-even, are useful reference points from a specific analysis, not laws (Braincuber). Your own number could be far lower if you are escaping an expensive flagship model, or far higher if you are competing with a cheap open-model provider running at scale. The only reliable method is to do the arithmetic with your numbers: estimate your monthly tokens, multiply by current API rates for your model, and compare that against the monthly cost of a GPU sized for your model at your expected utilization.

How do you run the numbers for yourself?

Work through it in order and the answer falls out. The point is to compare your API spend at your volume against the fixed cost of a GPU that could do the same job, with utilization factored in honestly. It takes five minutes and beats any rule of thumb.

Estimate your monthly token volume, splitting input and output, since output usually costs more.
Price the API path. Multiply your input and output tokens by the per-million rates for the model you would use, from the provider's current pricing page.
Price the self-hosted path. Pick the GPU your model needs, take its hourly rate, and multiply by the hours you would keep it running. Be honest about whether it stays busy.
Add the hidden costs. On self-hosting, add engineering and maintenance, often several times the raw GPU cost. On the API, add egress and the overhead of retries and resent context.
Compare and recheck quarterly. Both API prices and GPU rates move, so a decision that was right last quarter can flip. Revisit when your volume materially changes.

If the two paths come out close, default to the API. It carries no operational burden and no idle risk, and it lets you switch later once your volume is high and steady enough to make self-hosting an easy call. The better-run infrastructure tends to be the kind you stop thinking about: sized to what you use, quietly dependable, and priced for the work it does rather than the capacity it might.