Prompt caching, model routing, structured outputs and tiered pricing. The four levers that knocked $4,200/mo off a client bill last quarter.
An AI bill that doubled in three months is usually a routing problem, not a usage problem. Here are the four levers that move the needle, in order of effort-to-reward.
If your system prompt is over 1k tokens and you call the model more than 50 times per hour, enable prompt caching. Anthropic and OpenAI both cache prefix tokens at 90% discount. On one client this alone cut the bill 38%.
Not every request needs the flagship model. Route extraction and classification to Haiku 4.5 or gpt-4o-mini, route reasoning to Sonnet 4.6 or GPT-5.5. Our router averages 4-to-1 cheaper calls without quality regression on the cheap-model jobs.
When you need 12 fields, ask for 12 fields with response_format=json_schema. Free-text responses with parsing burn 3-5x more tokens to get the same data and the parser breaks weekly.
OpenAI batch API is 50% off list and runs within 24 hours. Lead enrichment, content drafts, weekly classification jobs all belong here.
A B2B SaaS client was at $7,100/mo in March. We applied all four: April bill $2,900, May tracking similar. Quality scores unchanged. Engineering time to implement: 2 weeks.
If your AI bill grew faster than your usage, our AI integration audit finds the leaks. Send us your last 3 invoices for a 30-minute review.