One OpenAI-compatible API
for every LLM.
A production-grade gateway in a single 30 MB Go binary. Routes to OpenAI, Anthropic, Gemini, and local Ollama with automatic failover, exact and semantic caching, per-customer spend caps, and cost tracking out of the box.
Open source today. Managed cloud launching soon — join the waitlist.
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer llm0_live_..." \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}]}'Same endpoint for claude-opus-4-7, gemini-2.5-pro, or any local Ollama model.
Performance
Faster than LiteLLM. Faster than Portkey.
Measured server-side from gateway_logs.latency_ms. Full methodology, reproduce steps, and per-status-code percentiles in the README.
What's in the box
Built for production, out of the box.
Every feature your LLM infrastructure needs on day one — shipped together, measured end-to-end, documented in a single README.
Multi-provider routing
OpenAI, Anthropic, Gemini, and local Ollama behind a single OpenAI-compatible endpoint. Provider chosen from model name prefix — add new models via SQL, no redeploy.
Automatic cross-provider failover
On 429, 5xx, 4xx, timeout, or connection errors the gateway transparently retries the next provider in the chain. Configurable cloud-first, local-first, or air-gapped modes.
Exact + semantic caching
Exact matches hit in <1 ms from Redis. Semantic cache goes further: embed the request, find the closest prior question via pgvector cosine similarity, and return its cached answer. Paraphrased duplicates like “What is the capital of France?” and “Tell me the capital of france” hit at 0.954 similarity in 41 ms — at $0 cost, no LLM call.
Per-API-key rate limiting
Token-bucket algorithm runs atomically in Redis via Lua. No race conditions under concurrency. Rejections short-circuit at ~2 ms, protecting the gateway from abuse bursts.
Per-customer spend caps
Daily and monthly USD limits per end-user via one HTTP header. Hit the cap and choose: block with 429, or downgrade to a cheaper model. Plus hard project-level monthly caps.
Zero lock-in, self-hosted
One Go binary, Postgres + Redis. MIT licensed. docker compose up and you own the gateway, the data, and the spend controls. Managed cloud coming for teams that want zero ops.
More providers coming
- Groq
- Together
- Fireworks
- DeepSeek
- xAI / Grok
- Mistral
- AWS Bedrock
Prefix-based routing means community can add new providers without touching core. Request a provider →
Managed cloud
Everything self-hosted has — plus the parts teams actually pay for.
The open-source gateway is complete. The managed cloud adds the dashboard, visibility, alerts, and team features you'd otherwise have to build yourself.
Web dashboard
Create and rotate API keys, set per-customer spend caps, toggle caching, and manage team members — all without touching SQL or a shell.
Real-time analytics
Spend by provider, cache-hit rate, p50/p99 latency, top customers, model mix — live charts with 7-, 30-, and 90-day views.
Budget alerts
Get notified by email, Slack, or webhook when a customer, project, or the whole account crosses daily, monthly, or forecast spend thresholds.
Webhooks & events
Subscribe to every request, cache hit, cache miss, rate-limit trigger, and budget breach. Stream into your own datastore, BI tool, or incident channel.
Teams, SSO, and audit
Invite teammates with role-based access, enforce SSO/SAML on enterprise, and get an immutable audit log of every key rotation, limit change, and admin action.
Zero ops, global edge
We run Redis, Postgres, pgvector, and the embedding service. Multi-region failover, automatic upgrades, and 99.95% SLA on team + enterprise plans.
Self-hosted stays MIT-licensed and free forever. Managed cloud launches with a generous free tier plus team and enterprise plans.
How it works
Drop-in. No code rewrite.
Point your SDK at the gateway
No code changes. The gateway speaks the OpenAI API — your existing openai / anthropic / google-genai SDKs just need a new base URL and key.
Gateway picks a provider
Model name prefix (gpt-*, claude-*, gemini-*, or Ollama) decides where the request goes. Caches are checked first; if hit, response returns in <20 ms.
Automatic failover if it fails
Any 429, 5xx, 4xx, timeout, or connection failure transparently retries the next provider. Your users never see the error.
Coming soon
Managed LLM0 — more power, zero ops.
Everything in the open-source gateway, plus the dashboard, live analytics, budget alerts, webhooks, and team seats teams pay for.