General

AI Inference API: OpenAI-Compatible Endpoints

June 27, 2026
9 min read

Learn how inference APIs work, how to migrate, and how to keep your own provider key with a BYOK gateway.

AI Inference API: OpenAI-Compatible Endpoints | Carpathian

An AI inference API is how your application turns a hosted model into a feature: you send a request with your messages, the model returns a completion, and your product gets smarter. The catch with most providers is lock-in and unpredictable per-token billing. Carpathian's AI inference API is OpenAI-compatible, so the libraries you already use keep working by changing two things: the base URL and the API key.

This guide covers what an inference API is, why OpenAI compatibility matters, how to point existing code at Carpathian, how to keep your own provider key through a gateway, and how to avoid the mistakes that bite teams the first time they integrate.

What is an AI inference API?

"Inference" is the act of running a trained model to get an answer, as opposed to "training," which is the expensive process of building the model in the first place. An inference API exposes that capability over HTTP so your code never touches a GPU or a model weight; it just calls an endpoint.

You send a structured request like this:

{
  "model": "your-model",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Summarize this support ticket." }
  ]
}

The API returns the model's response along with token usage. The mental model is simple: think of the API as a function call that happens to run on someone else's hardware. You pass messages in, you get text out, and the heavy lifting (loading the model, allocating GPU memory, running the math) happens behind the endpoint.

The anatomy of a chat-completions request

A few fields do most of the work, and understanding them makes the rest of this guide concrete:

  • model selects which hosted model answers. With a pinned model, this stays stable.
  • messages is the conversation so far, as a list of role-tagged turns (system, user, assistant). The system message sets behavior; the rest is the dialogue.
  • stream controls whether the response arrives all at once or token by token, which matters for chat UIs where you want text to appear as it is generated.
  • temperature and related sampling parameters tune how deterministic or creative the output is.

Because this shape is the standard, the same request works across OpenAI-compatible providers, which is the whole point of the next section.

Why OpenAI compatibility matters

The OpenAI request and response format has become the de facto standard for talking to LLMs. SDKs, agent frameworks, observability tools, and tutorials all speak it. An OpenAI-compatible API means:

  • No rewrite. The official OpenAI SDKs and most agent frameworks work unchanged.
  • A two-line switch. Point the SDK at a new base_url and pass a new key.
  • No lock-in. If your code targets the standard, you can move between compatible providers without rearchitecting, which keeps a provider honest on price and reliability.
  • A bigger ecosystem. Tooling built for the standard (retries, logging, evaluation harnesses) works without adapters.

Carpathian exposes standard chat-completions style endpoints, so a client configured for OpenAI works against Carpathian-hosted models. The value is not only convenience today; it is the option to leave tomorrow, which is the cheapest insurance you can buy against lock-in.

Switching existing code to Carpathian

Most SDKs let you override the base URL and key. Conceptually, in Python:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-region.carpathian.ai/ai/v1",
    api_key="your-carpathian-key",
)

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Hello"}],
)

The same idea applies in JavaScript or any language with an OpenAI client: set the base URL, set the key, leave the rest. The exact base URL and model name come from your account, and the shape of your code does not change. See documentation for the current endpoint and the models available to you.

A safe migration sequence

Switching providers is low risk when you do it in order rather than all at once:

  1. Generate a key. Create an account and issue an API key scoped to the application you are migrating.
  2. Point one client at the new base URL. Change base_url and api_key in a single, non-critical service first.
  3. Send a test completion. Confirm the response shape matches what your code already expects (it should, because the format is standard).
  4. Run your existing tests. If you have an evaluation set or golden outputs, run them against the new endpoint and compare.
  5. Move one feature, then expand. Cut over a low-stakes feature, watch latency and quality, then bring the rest across.
  6. Retire the old path. Once you trust the new endpoint, remove the old key and stop the variable bill.

Because the model is pinned, outputs that pass validation today keep behaving the same way, so you are not re-testing on someone else's release schedule.

Already paying for OpenAI or Anthropic? Keep your keys.

If you are not ready to move models but want control over cost, access, and security, Carpathian also offers a bring-your-own-key (BYOK) gateway. You keep your provider account, and requests flow through a proxy that adds:

  • Rate limiting, so a runaway loop or a leaked key cannot drain your account in an afternoon.
  • IP controls, so the key only works from your own servers and approved networks.
  • Token budgets, so each team or application has a ceiling you set.
  • Usage logging, so you can answer "who called what, when, and how much" during a review.

It is the same OpenAI-compatible shape, so your code still does not change. The gateway is about governance over a provider you already use, while private LLM hosting is about moving the model itself onto US-based infrastructure you control. Many teams start with BYOK for governance, then graduate to private hosting when data residency becomes a hard requirement.

Flat pricing instead of per-token guesswork

The hardest part of building on a public AI API is forecasting the bill. Per-token pricing means your cost rises with your success: the more your feature gets used, the more you pay, and finance gets a different number every month. Carpathian's model is flat monthly pricing on US-based infrastructure, so AI becomes a predictable line item rather than a variable you re-estimate. For steady, high-throughput workloads (classifying every ticket, enriching every record), flat pricing is usually the cheaper and saner choice. Current plans are on the pricing page.

Common mistakes when integrating an inference API

  • Hardcoding the base URL. Put the base URL and key in configuration, not in code, so switching providers stays a two-line change.
  • Ignoring streaming early. If you are building a chat UI, decide on streaming up front; retrofitting it later means reworking how the response is consumed.
  • Skipping retries and timeouts. Network calls fail. Wrap requests with sensible timeouts and retry logic before you ship, not after an incident.
  • Leaking keys in client code. API keys belong on your server, never in a browser bundle or mobile app where anyone can read them.
  • Forgetting token accounting. Even on flat pricing, tracking usage tells you which features are heavy and where to optimize prompts.
  • Not pinning the model. If you rely on a provider that can swap the model silently, your tested outputs are not guaranteed; a pinned model removes that risk.

Real-world example: replacing a support summarizer

Imagine a team that already summarizes support tickets through a public API and dreads the monthly bill, which climbs every time the product grows. Their summarizer is a single service that builds a messages array and calls chat completions. To migrate, they change the base URL and key, run their existing golden-output tests against the Carpathian endpoint, and confirm the summaries match what their reviewers expect. They cut over one queue, watch it for a day, then move the rest. The code diff is two lines; the billing change is from a variable number to a fixed one. Nothing about their architecture changes, which is exactly what an OpenAI-compatible standard is supposed to deliver.

Build checklist

  1. Create an account and generate an API key.
  2. Point your OpenAI SDK at your Carpathian base URL and key.
  3. Send a test chat completion and confirm the response shape matches what your code expects.
  4. Add timeouts, retries, and key handling on the server side.
  5. Move a non-critical feature over first, validate outputs, then expand.
  6. Decide whether you also want private LLM hosting for residency, or the BYOK gateway for governance over an existing provider.

Frequently asked questions

Will my existing OpenAI code work? If it targets the standard chat-completions format, yes. You change the base URL and key; the request and response shapes stay the same, which is the entire benefit of an OpenAI-compatible API.

What is an OpenAI-compatible API, exactly? It is an endpoint that accepts the same request format and returns the same response format that OpenAI clients expect. Because that format is the de facto standard, code, SDKs, and tooling built for it work without modification.

Do you support streaming responses? The API follows the standard format used by OpenAI-compatible clients, including streamed responses where the client requests them, so chat UIs can render text as it is generated.

Can I keep using my own OpenAI or Anthropic key? Yes, through the bring-your-own-key gateway. You get rate limiting, IP controls, budgets, and logging on top of your existing provider key, without changing your code.

What is the difference between the BYOK gateway and managed hosting? The gateway adds governance in front of a provider you already use, so the model still runs on that provider. Managed hosting runs the model on Carpathian's US-based infrastructure, which is the right choice when data residency is a requirement. See private LLM hosting for that path.

How is this billed? Flat monthly pricing rather than per-token, so your AI cost stays predictable as usage grows. See pricing for current plans.

Where does inference run? On Carpathian's US-based infrastructure, the same cloud behind our server hosting and software development, so you are not routing requests through an anonymous overseas region.

Can I try a model before writing any code? Yes. Start with the free AI chat to see whether the model fits your use case, then move to the API when you are ready to build. If you are still comparing options, see our roundup of ChatGPT alternatives in 2026.

About the Author

Samuel Malkasian

Samuel Malkasian

Founder | Carpathian AI