Stop one user from blowing
your AI bill.
OpenAI emails you afterthe money's gone. LLM0 is a drop-in spend firewall that blocks requests before the API call when a per-user or per-project budget would be exceeded — plus cost-per-customer tracking, caching, and failover. One line of code.
Open source today. Managed cloud launching soon — join the waitlist.
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer llm0_live_..." \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}]}'Point your base_url here instead of api.openai.com — same for Claude, Gemini, or local Ollama. Every call is metered against your caps.
What's in the box
Spend control first. Gateway included.
Hard caps, per-customer cost tracking, and caching to cut the bill — wrapped around a fast multi-provider gateway. Shipped together, measured end-to-end, documented in a single README.
Hard budget caps — block before the call
LLM0 estimates each request's cost and rejects it with a structured 402/429 before the API call if it would breach a per-user or per-project budget. OpenAI/Anthropic/Gemini only cap after you've overspent — this stops the $X,XXX surprise overnight.
Per-customer spend caps
Daily and monthly USD limits per end-user via one HTTP header. Hit the cap and choose: block with 429, or downgrade to a cheaper model. A leaked key or runaway loop can't drain your account.
Know your cost per customer
Tag every request with X-Customer-ID and see exactly which users and which pricing tiers cost you money. Find the whales and the money-losing subscribers before they wreck your margin.
Cut spend with exact + semantic caching
Exact matches hit in <1 ms from Redis. Semantic cache embeds the request and returns the closest prior answer via pgvector — paraphrased duplicates hit at $0, no LLM call. Local Ollama calls are always $0.
Per-API-key rate limiting
Token-bucket algorithm runs atomically in Redis via Lua. No race conditions under concurrency. Rejections short-circuit at ~2 ms, protecting you from abuse bursts on a public endpoint.
One drop-in line — OpenAI, Claude, Gemini, Ollama
Change a base_url, not your SDK. One OpenAI-compatible endpoint in front of every provider, with normalized SSE streaming and automatic cross-provider failover on 429/5xx/timeout. Single 30 MB Go binary, MIT, self-hosted.
More providers coming
- Groq
- Together
- Fireworks
- DeepSeek
- xAI / Grok
- Mistral
- AWS Bedrock
Prefix-based routing means community can add new providers without touching core. Request a provider →
How it works
Drop-in. Caps enforced before every call.
Point your base_url at LLM0
No code changes. LLM0 speaks the OpenAI API — your existing openai / anthropic / google-genai SDKs just need a new base URL and key. Tag each request with X-Customer-ID to track cost per user.
Budget checked before the call
LLM0 estimates the request's cost and checks it against per-user and per-project caps. Over budget? It blocks with a 402/429 before any money is spent. Caches are checked too — hits return in <20 ms at $0.
Routes, fails over, and logs the cost
Under budget, the model prefix (gpt-*, claude-*, gemini-*, Ollama) picks the provider, retries the next one on any 429/5xx/timeout, and records exact spend per customer.
Performance
A firewall that adds no latency.
Budget checks, rate limits, and cache lookups run in milliseconds — faster than LiteLLM, in one 30 MB binary. Measured server-side from gateway_logs.latency_ms. Full methodology, reproduce steps, and per-status-code percentiles in the README.
Managed cloud
Everything self-hosted has — plus the parts teams actually pay for.
The open-source firewall enforces the caps. The managed cloud adds the spend dashboard, cost-per-customer visibility, alerts, and team features you'd otherwise have to build yourself.
Exposure dashboard
See your worst-case exposure at a glance, set and edit per-customer caps, rotate keys, and manage team members — all without touching SQL or a shell.
Cost-per-customer analytics
See exactly which customers and pricing tiers cost you money — spend by provider, cache-hit savings, model mix, and top spenders, with 7-, 30-, and 90-day views.
Budget alerts
Get notified by email, Slack, or webhook when a customer, project, or the whole account crosses daily, monthly, or forecast spend thresholds.
Webhooks & events
Subscribe to every request, cache hit, cache miss, rate-limit trigger, and budget breach. Stream into your own datastore, BI tool, or incident channel.
Teams, SSO, and audit
Invite teammates with role-based access, enforce SSO/SAML on enterprise, and get an immutable audit log of every key rotation, limit change, and admin action.
Zero ops, global edge
We run Redis, Postgres, pgvector, and the embedding service. Multi-region failover, automatic upgrades, and 99.95% SLA on team + enterprise plans.
Self-hosted stays MIT-licensed and free forever. Managed cloud launches with a generous free tier plus team and enterprise plans.
Coming soon
Managed LLM0 — spend visibility, zero ops.
Everything in the open-source firewall, plus the spend dashboard, cost-per-customer analytics, budget alerts, webhooks, and team seats. Free tier, no credit card.