Prepared by Jay Oberhelman in collaboration with Claude Pro, Perplexity Pro, ChatGPT Pro, and local LMS_Gemma4_12B-Q4_K_M
The reframe
The AI bill is not the chat layer.
For a 500-person org, the cost story is not employees asking questions. It is the organization repeatedly gathering, re-explaining, and re-sending the same documents and context before the model can do anything useful — and any automation set loose without a cap.
Spend frontier tokens on judgment, synthesis, and writing — not on chewing the same raw context a hundred times.
Kill the anecdote with arithmetic derived
A $500M bill is not human chat. The math forces it.
A normal assistant turn is roughly 15 input + 500 output tokens. At frontier list pricing (~$3 / $15 per million in/out), that turn costs about three-quarters of a cent. Round up to a penny and walk it forward.
One turn
~$0.01
15 in + 500 out tokens, rounded generously upward.
To reach $500M
~50–67B
turns. At a penny each, the count runs to tens of billions.
Spread over 500 staff
~385k
turns per person, per working day, every working day.
Therefore
Not people
No human types 385k times a day. The spend is machine-driven.
The number only makes sense as unbounded context, repeated ingestion, and automation with no guardrails — reportedly an org that never set usage limits. That is a governance failure, not ordinary employee behavior. It is also the exact failure mode worth designing out before an SMB scales.
Pricing basis is frontier list price (Sonnet-class ~$3/$15, Opus-class ~$5/$25 per M in/out); exact figures and sourcing live in the companion pricing notes. Turn count assumes ~260 working days. The $500M figure itself is anecdotal — used as a failure pattern, not a citation.
How cost actually behaves directional
A messy-org model, not idealized vendor math
These bands describe a real organization with duplication and friction — not a clean "everyone chats efficiently" world. The value is in showing where the meter runs, so leadership can manage the few behaviors that dominate spend. Confidence tags below follow the same logic as the benchmark hub: what's derived, what's directional, what's inferred.
Where usage concentrates directional
Band
Share
Behavior
$/user/day
L1 · Chat
55%
Short prompts, Q&A, light drafting
$0.50–2
L2 · Docs
25%
Uploads, synthesis, repeated context setup
$3–12
L3 · Formal
15%
Long artifacts, redrafting, meetings-to-docs
$8–25
L4 · Power
5%
Heavy uploads, large contexts, proto-agents
$20–100+
The top two bands drive most of the spend and almost all of the surprise.
Two operating modes scenario
Mode
Per day
Annualized
Mostly sane
$1.8k–4.5k
$0.47M–1.17M
Messy / bloated
$6k–18k
$1.56M–4.68M
The difference between these two rows is not headcount and not model choice. It is whether context gets rebuilt constantly or captured once. That gap — roughly 3× — is the whole argument.
Scenario envelopes for a non-tech 500-person org, ~260 working days. Directional planning ranges, not a forecast or quote.
Where the waste comes from
The bloat is structural, not curiosity
What runs the meter
Context-window walls. The model "forgets," so users react the only way they can — re-pasting the same documents into every new chat. The wall manufactures the waste.
Full-context resends. Whole email threads, policies, and prior drafts shipped into each request instead of a structured knowledge layer.
"Just one more version." Duplicate drafts spun up for internal alignment and stakeholder politics.
Premium models for grunt work. Frontier models doing OCR, extraction, and cleanup that belong on cheap ones.
Humans orchestrating broken processes. Power users hand-feeding workflows; the meter becomes a tax on organizational mess.
The mental-model reset
Vendors price like SaaS — a seat, a flat feel. The behavior is cloud: every additional ingest, retry, and output keeps the meter moving. Treating it as seat-spend is exactly why the bill surprises people.
Once it's read as utility compute, the right instinct follows automatically: don't pay to move the same data through the meter twice. Cache the expensive part. Govern the loops. Reserve the premium engine for the work that actually changes a decision.
The fix
Separate ingestion from judgment
The ingest layer, in one sentence: a centralized, indexed store of SMB's own knowledge that the model queries — instead of staff pasting raw files into a chat box every time. Build the working set once; reuse it everywhere.
It is not "build our own GPT" and it is not a GPU program. In phase one it is a small managed cloud service that prepares documents before Claude ever sees them.
The operating model
Tier 1
Utility service
A managed cloud service runs OCR, cleanup, classification, extraction, chunking, and first-pass summaries on small / cheap models.
Permit packets
Inspection reports
Emails & SOPs
Scans & images
Tier 2
Structured handoff
It emits one standard payload: document type, key fields, short summary, relevant excerpts, links back to source.
Smaller context
Consistent shape
Reusable across teams
Tier 3
Claude Enterprise
Claude sees only the request plus the compact payload — and spends its tokens on reasoning, comparison, synthesis, and drafting.
High-value output
Governable spend
Lower effective cost
Two levers, not one
Cost: extraction is done once and reused instead of re-billed per user, per chat. Accuracy: a model handed a tight, relevant payload is far less likely to get distracted or hallucinate than one fed fifty pasted documents. The discipline that saves money also improves the answers — which is why "we'll just be careful" is not a substitute for it.
The ask
One workflow. One pilot. Clear criteria.
Approve a managed Tier-1 utility layer in front of Claude Enterprise. Cloud-only, no new IT function, no hardware.
Pick one document-heavy workflow (permits, inspections, or funding applications) for a 30–60 day test.
Instrument it. Tag each use as judgment, synthesis, reconstruction, or automated; track token spend, turnaround, and how often the same context gets rebuilt.
Decide on evidence. Cost avoided, quality, and rework reduction are the criteria for any expansion — not enthusiasm.
How many times are we paying AI to remember what we already know?
Keep that question at the center and the spend follows organizational memory, not license counts. An SMB does not need to become an AI infrastructure company to get the benefit of AI infrastructure discipline.
AI ingestion block
Compact machine-readable summary for loading this memo into another model. House format — not RAG scaffolding.
{
"artifact": "decision_memo",
"subject": "SMB AI cost, context, ingest-layer architecture",
"audience": "COO + flat exec/VP leadership; single room",
"thesis": "Enterprise AI cost scales with repeated context reconstruction, not headcount; it behaves like cloud compute, not SaaS seats.",
"falsification": "~$0.01/turn implies ~50-67B turns to reach $500M, i.e. ~385k turns/person/workday over 500 staff -> impossible for humans -> spend is ingestion + unguarded automation.",
"cost_model": {
"user_bands": { "L1_chat":"55% / $0.50-2", "L2_docs":"25% / $3-12", "L3_formal":"15% / $8-25", "L4_power":"5% / $20-100+" },
"modes": { "sane":"$0.47M-1.17M/yr", "messy":"$1.56M-4.68M/yr" },
"basis": "messy-org scenario envelopes, ~260 workdays, frontier list pricing; directional not forecast",
"confidence": { "math":"derived", "bands":"directional", "modes":"scenario", "500M_anecdote":"inferred" }
},
"solution": {
"definition": "indexed knowledge store the model queries vs users pasting raw files",
"tiers": { "1":"cheap-model OCR/extraction/summary", "2":"structured payload handoff", "3":"Claude Enterprise for judgment" },
"levers": ["cost: preprocess once, reuse","accuracy: tight payload reduces distraction/hallucination"]
},
"ask": "one document-heavy workflow, 30-60 day instrumented pilot, expansion gated on cost-avoided + quality + rework",
"decision_question": "How many times are we paying AI to remember what we already know?"
}