v0.2.0 shipped · MIT licensed · Self-hosted

Stop one user from blowing
your AI bill.

OpenAI emails you afterthe money's gone. LLM0 is a drop-in spend firewall that blocks requests before the API call when a per-user or per-project budget would be exceeded — plus cost-per-customer tracking, caching, and failover. One line of code.

Join managed waitlist Self-host today

Open source today. Managed cloud launching soon — join the waitlist.

terminal

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer llm0_live_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}]}'

Point your base_url here instead of api.openai.com — same for Claude, Gemini, or local Ollama. Every call is metered against your caps.

What's in the box

Spend control first. Gateway included.

Hard caps, per-customer cost tracking, and caching to cut the bill — wrapped around a fast multi-provider gateway. Shipped together, measured end-to-end, documented in a single README.

Hard budget caps — block before the call

LLM0 estimates each request's cost and rejects it with a structured 402/429 before the API call if it would breach a per-user or per-project budget. OpenAI/Anthropic/Gemini only cap after you've overspent — this stops the $X,XXX surprise overnight.

Per-customer spend caps

Daily and monthly USD limits per end-user via one HTTP header. Hit the cap and choose: block with 429, or downgrade to a cheaper model. A leaked key or runaway loop can't drain your account.

Know your cost per customer

Tag every request with X-Customer-ID and see exactly which users and which pricing tiers cost you money. Find the whales and the money-losing subscribers before they wreck your margin.

Cut spend with exact + semantic caching

Exact matches hit in <1 ms from Redis. Semantic cache embeds the request and returns the closest prior answer via pgvector — paraphrased duplicates hit at $0, no LLM call. Local Ollama calls are always $0.

Per-API-key rate limiting

Token-bucket algorithm runs atomically in Redis via Lua. No race conditions under concurrency. Rejections short-circuit at ~2 ms, protecting you from abuse bursts on a public endpoint.

One drop-in line — OpenAI, Claude, Gemini, Ollama

Change a base_url, not your SDK. One OpenAI-compatible endpoint in front of every provider, with normalized SSE streaming and automatic cross-provider failover on 429/5xx/timeout. Single 30 MB Go binary, MIT, self-hosted.

More providers coming

Groq
Together
Fireworks
DeepSeek
xAI / Grok
Mistral
AWS Bedrock

Prefix-based routing means community can add new providers without touching core. Request a provider →

How it works

Drop-in. Caps enforced before every call.

Point your base_url at LLM0

No code changes. LLM0 speaks the OpenAI API — your existing openai / anthropic / google-genai SDKs just need a new base URL and key. Tag each request with X-Customer-ID to track cost per user.

Budget checked before the call

LLM0 estimates the request's cost and checks it against per-user and per-project caps. Over budget? It blocks with a 402/429 before any money is spent. Caches are checked too — hits return in <20 ms at $0.

Routes, fails over, and logs the cost

Under budget, the model prefix (gpt-*, claude-*, gemini-*, Ollama) picks the provider, retries the next one on any 429/5xx/timeout, and records exact spend per customer.

Performance

A firewall that adds no latency.

Budget checks, rate limits, and cache lookups run in milliseconds — faster than LiteLLM, in one 30 MB binary. Measured server-side from gateway_logs.latency_ms. Full methodology, reproduce steps, and per-status-code percentiles in the README.

3 msCache-hit p50Full auth + rate-limit + cache path (4 vCPU Linux)

2 msFast-fail rejectionRate-limited requests short-circuit

1,672 req/sSustained throughputDigitalOcean 4 vCPU shared Linux droplet

30 MBBinary sizeOne static Go binary, Postgres + Redis

Managed cloud

Everything self-hosted has — plus the parts teams actually pay for.

The open-source firewall enforces the caps. The managed cloud adds the spend dashboard, cost-per-customer visibility, alerts, and team features you'd otherwise have to build yourself.

Exposure dashboard

See your worst-case exposure at a glance, set and edit per-customer caps, rotate keys, and manage team members — all without touching SQL or a shell.

Cost-per-customer analytics

See exactly which customers and pricing tiers cost you money — spend by provider, cache-hit savings, model mix, and top spenders, with 7-, 30-, and 90-day views.

Budget alerts

Get notified by email, Slack, or webhook when a customer, project, or the whole account crosses daily, monthly, or forecast spend thresholds.

Webhooks & events

Subscribe to every request, cache hit, cache miss, rate-limit trigger, and budget breach. Stream into your own datastore, BI tool, or incident channel.

Teams, SSO, and audit

Invite teammates with role-based access, enforce SSO/SAML on enterprise, and get an immutable audit log of every key rotation, limit change, and admin action.

Zero ops, global edge

We run Redis, Postgres, pgvector, and the embedding service. Multi-region failover, automatic upgrades, and 99.95% SLA on team + enterprise plans.

Self-hosted stays MIT-licensed and free forever. Managed cloud launches with a generous free tier plus team and enterprise plans.

Coming soon

Managed LLM0 — spend visibility, zero ops.

Everything in the open-source firewall, plus the spend dashboard, cost-per-customer analytics, budget alerts, webhooks, and team seats. Free tier, no credit card.

Stop one user from blowingyour AI bill.