Back to Blog

How to put Claude Code in charge of a free LLM API pool

Alex Kim
13 min read
How to put Claude Code in charge of a free LLM API pool

Last updated: June 26, 2026

TL;DR

A free LLM API pool stacks the free tiers from many providers behind one OpenAI-compatible endpoint. The interesting move isn't using it to replace your paid model. It's the inversion: keep Claude Code as the orchestrator, give it the free pool as a worker tier through an MCP server, and let it delegate cheap, low-stakes tasks while it spends its budget only on the hard ones.

What a free LLM API pool actually is

A free LLM API pool aggregates the free tiers from many providers behind a single OpenAI-compatible endpoint, so one API call can route across all of them. Here's the part that surprised me: stacked together, those free tiers are not a toy. FreeLLMAPI, the open-source project this is built on, pools 16 providers and puts the combined free quota at around 1.7 billion tokens a month.

On its own, each tier is basically a demo. A few million tokens here, a few thousand requests a day there, and every one of them has its own SDK and its own way of rate-limiting you. Wiring up 16 of those by hand is the kind of project you start on a Saturday and abandon by Sunday. A pool does the boring part for you. One endpoint, one client, and a router underneath that picks a model, fails over when a provider says no, and keeps count so you don't blow past anyone's free cap.

The stack runs across Groq, Cerebras, Google, Mistral, OpenRouter, Cohere, Cloudflare, and a dozen more, 54 free models in all right now, with more as you add providers. Full credit for the aggregator goes to Tashfeen Ahmed, who built and maintains it. What I want to talk about is the layer you put on top.

Why free-only routing is the wrong default

Routing everything to free models is the wrong default, and the reason is compounding. When a weak model plans badly, it doesn't hand you one bad answer. It points every step after it in the wrong direction, and you find out three steps later when nothing lines up. I've watched it happen. The free pool is genuinely good at bulk work and genuinely shaky at the calls that decide whether the work mattered.

So I stopped asking "free or paid" and started asking "which of these tasks actually needs the expensive brain." Most of my work is lopsided. A coding session is maybe a fifth real reasoning, the architecture and the nasty debugging, and four-fifths mechanical. Renaming things. Converting a file. Summarizing what a function does. I was paying top rates for all of it, and the part that ate my weekly cap was the part a free model would have shrugged off.

That gap is the whole opportunity. Keep the smart model in charge. Hand it a bench of free workers. Let it decide, task by task, who does what.

The inversion: Claude as orchestrator, the free pool as workers

Make Claude Code the orchestrator and the free pool its worker tier, with an MCP server in between that turns the pool into tools Claude can call. Claude stays the brain. It reads your code, holds the plan, makes the judgment calls. When it hits something mechanical, it hands that one thing to a free model instead of burning its own tokens on it.

Claude Code already speaks the Model Context Protocol, so you register the pool once and Claude picks up a handful of tools:

ToolWhat it does
delegate(prompt, model?)Run one completion on the free pool. Omit the model for auto-routing.
fanout(tasks[], concurrency?)Run many prompts in parallel across the pool, bounded concurrency.
list_models()List available free models and the auto context-window ceiling.
claude_budget()Report real weekly and 5-hour usage, plus a routing mode.
log_route_decision(gist, route, ...)Record each routing decision for the audit trail.

Think of it as a manager with an outsourcing team. The manager looks at each request, decides whether it's worth doing personally, and ships the rest out. Then, and this is the part that keeps it honest, the manager reads the work that comes back before using it. The judgment stays in the room. The volume goes out the door.

The routing rubric: which tasks are worth premium tokens

Claude decides per task with two hard gates and a difficulty score. The gates are vetoes. If either one trips, the task stays on Claude no matter how trivial it looks. The score handles everything the gates didn't touch.

Two gates, and they win over everything:

  1. Sensitivity. Anything carrying a secret or credential stays on Claude and never reaches the pool. This isn't a guideline I hope Claude remembers. It's enforced in code: delegate and fanout scan the prompt and flat-out refuse anything that looks like a credential. Since delegating ships your prompt to outside providers, this gate doubles as a privacy fence. More on that below.
  2. Capability. If a task needs more context than the free pool can hold, it stays on Claude.

Then the difficulty score, 1 to 10, which is really just reasoning depth times ambiguity times steps times how much it hurts to get it wrong:

ScoreBandExamplesDefault route
1–3TrivialBoilerplate, format conversion, summarize, extractFree pool
4–6ModerateStandard codegen, drafting, a straightforward refactorFree pool
7–8HardSubtle multi-step reasoning, architecture, tricky debuggingClaude
9–10CriticalDeep or novel reasoning, safety-critical correctnessClaude

By default anything 7 or higher stays on Claude, everything under it gets delegated, as long as no gate fired. There's one more safety net I lean on: Claude reads the free model's answer before it trusts it, and if the thing comes back wrong, off-spec, or empty, it just redoes that one task itself. The pool gets first crack. Claude is the backstop.

The budget guard: pause before you blow the weekly cap

The budget guard watches Claude Code's live usage and tightens the routing threshold as your weekly cap runs down, so the closer you get to empty, the more it delegates. It nudges, it doesn't enforce. The monitor watches and biases the rubric; it never freezes Claude in the middle of something.

Claude Code only hands its live rate-limit numbers to one place: the statusLine. A small script grabs them there and stashes them, and claude_budget() turns that into a routing mode.

ModeTriggerBehavior
normalUnder 60% weeklyUse Claude freely, delegate bulk work as convenient
prefer_sonnet60%+ weeklyDrop the orchestrator's own Opus calls to Sonnet
delegate_more75%+ weeklyPush routine sub-tasks to the free pool; threshold rises to 9
exhausted95%+ weeklyReport the reset time and pause
unknownNo live signalBehave as normal

One honest catch. That live signal only shows up in an interactive session, where Claude Code is actually painting a statusLine. Run it headless, with claude -p or the Agent SDK or a cron job, and there's no statusLine and no other documented way to read those rate-limit windows. I looked. So on a headless box you set a conservative mode by hand and move on. The guard is automatic when you're at the keyboard and manual when you're not.

The feedback loop: log every decision

Log every routing decision, both directions, including the tasks Claude keeps for itself. The log holds the task type (never the prompt itself), the route, the score, and which gate fired if one did. That trail is doing real work.

Two reasons it earns its keep. One, you can actually see what happened, instead of crossing your fingers that the rubric ran. Two, every one of those decisions is a labeled example. A good rubric run by a smart model is a labeling machine. Let it run across a few thousand tasks and you've quietly built a dataset of task, score, route, and outcome, which is exactly what you'd train a small, cheap, fast classifier on to take over the routing later. The expensive model writes the labels. A tiny model eventually does the job.

How to build it

You put this on top of the open-source free pool, then point Claude Code at it as an MCP server. The shape of it:

  1. Run the pool on a loopback port. Set HOST=127.0.0.1 and a unique port like PORT=3733 so nothing's exposed past your machine. Drop your provider keys in on the Keys page; they're stored encrypted.
  2. Register the MCP server with Claude Code, once:
    claude mcp add --transport http freellmapi \
      http://localhost:3733/mcp \
      -H "Authorization: Bearer freellmapi-…"
    
  3. Wire up the budget signal. Point Claude Code's statusLine at the capture script so claude_budget() has something to read. On a headless box, skip it and pin a mode in the server env instead.
  4. Get the rubric into context. The rubric is a protocol Claude follows, not server code, so it has to actually reach Claude when it connects. The MCP server passes it through at connect time, and a line in your CLAUDE.md backs it up. A rubric that only lives in a doc never runs.

That last one is the trap. The gate and the log are code, so it's obvious when they work. The rubric is a protocol, and a protocol you never deliver to the model is dead on arrival, even while the shipped code makes the whole thing look alive.

Data egress: what stays on Claude

Delegating ships your prompt to outside free providers, each with its own retention and training policy, so the sensitive stuff stays on Claude, full stop. When Claude's the orchestrator, its prompts are full of your code, your files, your tool output. That's exactly what you don't want wandering off to a retention policy you've never read.

The credential gate covers the automatic half. It scans every delegated prompt and refuses anything that smells like a secret. But it catches credentials, not confidentiality in general. Proprietary code and private data are still your call. My rule of thumb is simple: if you'd think twice about pasting it into some random free chatbot, don't delegate it. The pool is for volume, not for secrets.

Frequently asked questions

Is there a free LLM API?

Yes. Most major providers offer a free tier, and open-source projects like FreeLLMAPI aggregate many of them behind one OpenAI-compatible endpoint. Stacked together, the free tiers add up to roughly 1.7 billion tokens per month across 54 free models. Each tier alone is small, but pooled and routed they cover a lot of real work.

Can I use Claude Code with free models?

Yes, in two ways. You can point Claude Code at a free pool directly as its model backend, or you can keep Claude as the orchestrator and give it the free pool as a worker tier through an MCP server. The second is usually better, because Claude keeps making the judgment calls while delegating only the mechanical tasks.

How many free LLM tokens can you get per month?

The FreeLLMAPI project estimates roughly 1.7 billion tokens per month across its 16 stacked provider free tiers. That figure depends on which providers you add keys for and their current limits, which change often. Treat it as an order-of-magnitude ceiling, not a guarantee, since free tiers shift without notice.

What is an LLM router or LLM gateway?

An LLM router is a layer that sits in front of multiple model providers and picks which one handles each request. A gateway adds shared concerns on top: authentication, rate tracking, failover, and usage logging. A free LLM API pool is a router and gateway combined, aimed specifically at stacking free tiers behind one endpoint.

Does a free LLM pool send my data to third parties?

Yes, delegating a prompt forwards it to whichever free provider handles it, and each has its own retention and training policy. Treat the pool as untrusted for sensitive content. The orchestration pattern adds a credential scan that refuses prompts carrying secrets, but proprietary code and private data still need your own judgment before delegating.

How do I stop burning my Claude budget on simple tasks?

Delegate the simple tasks to a free model and reserve your paid model for the hard ones. The practical version is a routing rubric: score each task on difficulty, send anything trivial to a free pool, and keep the high-stakes reasoning on Claude. A budget signal can raise the bar automatically as your weekly cap depletes.

What is an MCP server in this context?

An MCP (Model Context Protocol) server exposes tools that Claude Code can call during a session. Here, the MCP server wraps the free LLM pool and offers tools like delegate and fanout, so Claude can hand a task to a free model the same way it would call any other tool, then read the result back.

Can free models replace Claude entirely?

For bulk, low-stakes work, often yes. For orchestration, planning, and high-stakes reasoning, not reliably, because errors at the planning layer compound through everything downstream. The pattern here keeps free models in the role they're good at (volume) and keeps the expensive model in the role that decides whether the work was worth doing.

Is FreeLLMAPI free and open source?

Yes, FreeLLMAPI is an open-source project (MIT licensed) by Tashfeen Ahmed, available on GitHub with a live model catalog at freellmapi.co. It aggregates free provider tiers behind one OpenAI-compatible endpoint. The Claude-orchestrator layer described here is a pattern you build on top of that foundation.

Where this goes next

Start small. You don't need the classifier or the budget guard or the log to get something out of this on day one. You need the free pool running, Claude registered as an MCP client, and one rule in your head: hand off the boring tasks, keep the hard ones. Everything else is tuning.

The real change is how you think about the model. Once your smartest one can offload its own grunt work, you stop treating it like something you ration and start treating it like a manager you've actually staffed. The weekly cap stops being a meter you watch out of the corner of your eye and turns into something the system just handles.

If you want the patterns we use to run Claude Code in production, come hang out in the WotAI community with 760+ other builders, and grab the newsletter for the weekly breakdown.

#claude-code#MCP#free-llm-api#llm-router#ai-orchestration
3 live calls a week

Three live calls a week. Bring your hardest build.

Every week we get on three 30-minute calls to work through real Claude Code builds, live. Bring the thing you're stuck on. Can't make it? Every call is recorded, so nothing's lost.

Free to join. Real people. No spam.