← Findings 🕐 9 min read

Findings

The Wasted Tokens Report: The Optimization Opportunity Most Organizations Haven't Acted On Yet

Brandon Sneider · March 2026

A 30-minute diagnostic with your CTO or CIO, structured around 10 questions:

“The organizations that implemented AI gateways early discovered something remarkable: 50-90% of their inference costs were addressable through basic optimization. The same AI investment, dramatically better economics.”

Executive Summary

Research shows 50-90% of enterprise inference costs can be eliminated with basic optimization – model routing, semantic caching, and prompt engineering. For most organizations, this is the highest-ROI AI initiative available today
Enterprise LLM API spending hit $8.4B by mid-2025 and is projected to reach $15B in 2026. The organizations that build cost visibility now will compound that advantage as spending scales
The average enterprise with 500+ developers has a $200K-$2M/year optimization opportunity in duplicate, redundant, or misrouted AI token consumption – recoverable without reducing AI usage or capability
An AI gateway providing unified observability typically pays for itself in 30-60 days through cost optimization alone. The AI gateway market grew from $400M (2023) to $3.9B (2024) because the ROI case is straightforward
Shadow AI presents both a cost and a security risk: while 77% of employees use GenAI at work (EY 2025), only 28% of organizations have formal AI usage policies. Unmonitored tools add an estimated $670K per breach incident (IBM 2025)
The opportunity is not to spend less on AI – it is to get dramatically more value from every dollar already being spent

The Five Types of Token Waste

1. The Shadow AI Tax — “What You Don’t Know IS Hurting You”

The problem: While only 40% of companies have purchased official AI subscriptions, employees at 90%+ of organizations actively use AI tools through personal accounts that IT never approved.

The cost:

77% of employees use GenAI at work, often without disclosure (EY 2025 Work Reimagined Survey)
When 71.2% of unauthorized AI usage concentrates on ChatGPT alone (Harmonic Security, 22M enterprise prompts analyzed), organizations face:
- Duplicate subscriptions: Teams buying their own Copilot/ChatGPT/Claude licenses = 2-5x the cost of an enterprise agreement
- Data leakage: Proprietary code and client data going to consumer AI tools with no enterprise data protections
- Zero cost visibility: No idea how much is being spent across the org

Metric to present: “For every $1 you spend on official AI tools, your employees are spending an estimated $0.50-$2.00 on unauthorized tools you can’t see.”

Real numbers:

Company Size	Official AI Spend	Estimated Shadow AI Spend	Total Invisible Cost
100 devs	$38K/yr (Copilot Business)	$19K-$76K/yr	$57K-$114K/yr
500 devs	$190K/yr	$95K-$380K/yr	$285K-$570K/yr
2,000 devs	$760K/yr	$380K-$1.5M/yr	$1.1M-$2.3M/yr
5,000 devs	$1.9M/yr	$950K-$3.8M/yr	$2.9M-$5.7M/yr

2. The Duplicate Context Tax — “Paying to Read the Same Thing Over and Over”

The problem: LLMs charge for every input token on every call. In multi-agent systems and agentic workflows, the same context (system prompts, documentation, codebase snippets) gets sent repeatedly.

The cost:

A 20-turn conversation can consume 5,000-10,000 tokens when only 500-1,000 of recent context would suffice — a 5-10x waste
A Reflexion loop running 10 cycles consumes 50x the tokens of a single pass
Multi-agent architectures with shared context windows routinely duplicate 60-80% of input tokens
Without prompt caching, an orchestrator agent spawning 50 workers all sharing the same context pays full price 50 times

What caching saves:

Cached tokens cost 10% of normal input price — a 90% reduction
Latency drops ~75% for cached portions
For an org running 1M+ agentic requests/month, caching alone saves $50K-$500K/year

Metric to present: “Without prompt caching, you’re paying 10x more for AI to re-read context it already knows. It’s like paying your consultant’s hourly rate to re-read the brief before every single conversation.”

3. The Wrong Model Tax — “Using a Ferrari to Go to the Grocery Store”

The problem: Without model routing, organizations default to their most expensive model for everything — including tasks a model 25-100x cheaper could handle equally well.

The cost:

Task	Appropriate Model	Cost	Default (GPT-4/Claude Opus)	Cost	Waste Factor
Code autocomplete	GPT-4o mini / Haiku	$0.15/1M tokens	GPT-4o / Opus	$2.50-$15/1M	17-100x
Simple Q&A	GPT-4o mini / Haiku	$0.15/1M tokens	GPT-4o / Opus	$2.50-$15/1M	17-100x
Code review	Sonnet / GPT-4o	$3/1M tokens	Opus / GPT-4	$15/1M	5x
Architecture	Opus / GPT-4	$15/1M tokens	Opus / GPT-4	$15/1M	1x (correct)

With intelligent model routing, organizations cut token spend by 30-50% without quality degradation (Kosmoy)
A fine-tuned 7B-13B model handles commodity tasks at equal quality for 25-100x less per token

Metric to present: “You’re running $15/million-token models on tasks a $0.15/million-token model handles perfectly. That’s like hiring a partner-level attorney to do document review.”

4. The Verbose Prompt Tax — “Bloated Instructions Nobody Reads”

The problem: System prompts, RAG context, and conversation history are rarely optimized. Teams copy-paste maximum context “just in case.”

The cost:

Teams routinely pass 4-8 full documents into a prompt when a paragraph would do
Aggressive context trimming cuts input tokens by 50%+ with no loss in precision
System prompts that grow to 2,000-5,000 tokens when 500 would suffice
Conversation history carrying the full thread when only the last 3-5 turns matter

Real-world example:

Before optimization: 8,000 input tokens per request × 100K requests/month = 800M tokens/month
After optimization: 2,000 input tokens per request × 100K requests/month = 200M tokens/month
At GPT-4o pricing ($2.50/1M input): $2,000/month → $500/month = 75% savings

Metric to present: “One customer reduced LLM costs by 90% through prompt optimization alone. The AI was doing excellent work — it was just being given 10x more context than it needed.”

5. The No-Visibility Tax — “You Can’t Optimize What You Can’t See”

The problem: Without an AI gateway, organizations have no unified view of:

Total AI spend across all tools, teams, and use cases
Which teams/projects consume the most tokens
What success rate their AI calls achieve
Where hallucinations and failures occur
How costs compare to value delivered

The cost:

Organizations without centralized AI observability overspend by 50-90% vs. those with gateways (LeanLM)
Gartner predicts 70% of organizations building multi-LLM applications will use AI gateway capabilities by 2028 — those who wait pay the tax longer
The AI gateway market exploded from $400M (2023) to $3.9B (2024) because the ROI is obvious

What a gateway reveals:

Cost per team, per project, per use case
Token consumption patterns (peak times, waste patterns)
Quality metrics (success rate, hallucination rate, latency)
Compliance violations (unauthorized models, data policy breaches)
Optimization opportunities (caching candidates, routing opportunities)

The Total Wasted Tokens Calculator

For a company with N developers using AI tools without an AI gateway:

Item	Low Estimate	High Estimate	Formula
Shadow AI duplicate spend	N × $190/yr	N × $760/yr	Unauthorized subscriptions per developer
Duplicate context waste	N × $120/yr	N × $600/yr	Based on avg. agentic usage patterns
Wrong model routing	N × $200/yr	N × $1,000/yr	Premium model overuse
Verbose prompt waste	N × $100/yr	N × $400/yr	Unoptimized context windows
No-visibility tax	N × $150/yr	N × $500/yr	Missed optimization opportunities
Total waste per developer	N × $760/yr	N × $3,260/yr

Company Size	Low Waste Estimate	High Waste Estimate
100 developers	$76,000/yr	$326,000/yr
500 developers	$380,000/yr	$1,630,000/yr
1,000 developers	$760,000/yr	$3,260,000/yr
5,000 developers	$3,800,000/yr	$16,300,000/yr

Plus the breach risk: Shadow AI-related breaches cost an average of $670K more than standard breaches (IBM 2025).

The AI Gateway Solution Landscape

Gateway	Type	Starting Price	Best For
Portkey	Managed SaaS	$49/mo	Fast start, excellent observability
Helicone	Open source / managed	Free (self-host)	Budget-conscious, dev-friendly
LiteLLM	Open source	Free (self-host)	High-volume, technical teams
Kong AI Gateway	Enterprise platform	$100/model/mo	Existing Kong customers
TrueFoundry	Enterprise platform	Custom	Full MLOps platform
Cloudflare AI Gateway	CDN-integrated	Free tier available	Cloudflare ecosystem
Azure API Management	Cloud-native	Varies	Microsoft shops
Apigee (Google)	Cloud-native	Varies	Google shops

Free Diagnostic: “The AI Spend Audit”

The 10-Question AI Spend Audit

A 30-minute diagnostic with your CTO or CIO, structured around 10 questions:

How many AI tool subscriptions does your organization pay for? (Per tool, per seat count)
Do you know how many employees use AI tools on personal accounts?
What LLM API calls are your applications making? (Models, volume, cost)
Do you have prompt caching enabled across your AI integrations?
Are you using the same model tier for all tasks?
What’s your average input token count per request?
Can you see AI spend broken down by team or project?
Have you had any data leakage incidents related to AI tool usage?
Do you have an AI gateway or unified observability layer?
What’s your total monthly AI spend? (Most can’t answer this.)

Output: A one-page “AI Spend Health Check” with:

Estimated total AI spend (visible + invisible)
Estimated waste percentage
Top 3 optimization opportunities
Projected savings from an AI gateway
Risk assessment for shadow AI exposure

This naturally leads to:

Full AI Gateway Implementation engagement ($75K-$200K)
AI Cost Optimization retainer ($10K-$25K/month)
Shadow AI Audit and Policy Development ($25K-$50K)

Key Talking Points for C-Suite

For the CFO:

“Your AI tool spending is growing 40%+ year-over-year, but you probably can’t tell me where 50-90% of it goes. An AI gateway is like implementing expense management for AI — you wouldn’t run a $5M travel budget without Concur, so why are you running a $5M AI budget without visibility?”

For the CTO:

“Every team is running their own AI experiments with their own tools and their own models. Without a gateway, you’re paying premium prices for commodity tasks, you can’t cache repeated context, and you have no way to know which AI investments are actually paying off.”

For the CISO:

“77% of your employees are using AI tools you don’t monitor. Each unauthorized tool is an unaudited data exfiltration vector. Shadow AI breaches cost $670K more than standard breaches because you don’t even know they happened until it’s too late.”

For the CEO:

“Your competitors who implement AI gateways and observability will know exactly what they’re spending on AI, which teams are getting value, and where to double down. You’ll be guessing. In a market moving this fast, guessing is losing.”

Sources

What This Means for Your Organization

If you have 500+ developers using AI tools, you likely have a $380K-$1.6M/year optimization opportunity sitting in duplicate, misrouted, and unmonitored token consumption. That figure is recoverable without cutting AI usage – in most cases, the same work gets done with the same tools at a fraction of the cost.

The path to capturing it is well-established. An AI gateway (Portkey, Helicone, or LiteLLM) delivers value through three mechanisms: model routing (sending commodity tasks to commodity models, saving 30-50%), prompt caching (eliminating duplicate context, saving up to 90% on repeated calls), and visibility (showing you where the optimization opportunities live so you can act on them). The technology is mature, the implementation timeline is 4-8 weeks, and most organizations reach positive ROI within 30-60 days.

The practical first step is the 30-minute AI spend audit outlined above. If your CTO can answer “what is your total monthly AI spend, broken down by team and tool?” you are already ahead of most organizations. If not, that visibility gap is the single highest-priority item – because once you can see where the spend goes, the optimization decisions make themselves.

If you suspect your organization has a significant optimization opportunity but lack the visibility to quantify it, that 30-minute diagnostic conversation is the fastest way to find out.

Brandon Sneider | brandon@brandonsneider.com March 2026