Decision memo · v1.0 · 2026-06-19
SMB AI — cost, context & the ingest layer
Prepared by Jay Oberhelman in collaboration with Claude Pro, Perplexity Pro, ChatGPT Pro, and local LMS_Gemma4_12B-Q4_K_M
The reframe

The AI bill is not the chat layer.

For a 500-person org, the cost story is not employees asking questions. It is the organization repeatedly gathering, re-explaining, and re-sending the same documents and context before the model can do anything useful — and any automation set loose without a cap.

Spend frontier tokens on judgment, synthesis, and writing — not on chewing the same raw context a hundred times.
Kill the anecdote with arithmetic derived

A $500M bill is not human chat. The math forces it.

A normal assistant turn is roughly 15 input + 500 output tokens. At frontier list pricing (~$3 / $15 per million in/out), that turn costs about three-quarters of a cent. Round up to a penny and walk it forward.

One turn
~$0.01
15 in + 500 out tokens, rounded generously upward.
To reach $500M
~50–67B
turns. At a penny each, the count runs to tens of billions.
Spread over 500 staff
~385k
turns per person, per working day, every working day.
Therefore
Not people
No human types 385k times a day. The spend is machine-driven.

The number only makes sense as unbounded context, repeated ingestion, and automation with no guardrails — reportedly an org that never set usage limits. That is a governance failure, not ordinary employee behavior. It is also the exact failure mode worth designing out before an SMB scales.

Pricing basis is frontier list price (Sonnet-class ~$3/$15, Opus-class ~$5/$25 per M in/out); exact figures and sourcing live in the companion pricing notes. Turn count assumes ~260 working days. The $500M figure itself is anecdotal — used as a failure pattern, not a citation.

How cost actually behaves directional

A messy-org model, not idealized vendor math

These bands describe a real organization with duplication and friction — not a clean "everyone chats efficiently" world. The value is in showing where the meter runs, so leadership can manage the few behaviors that dominate spend. Confidence tags below follow the same logic as the benchmark hub: what's derived, what's directional, what's inferred.

Where usage concentrates directional

BandShareBehavior$/user/day
L1 · Chat55%Short prompts, Q&A, light drafting$0.50–2
L2 · Docs25%Uploads, synthesis, repeated context setup$3–12
L3 · Formal15%Long artifacts, redrafting, meetings-to-docs$8–25
L4 · Power5%Heavy uploads, large contexts, proto-agents$20–100+

The top two bands drive most of the spend and almost all of the surprise.

Two operating modes scenario

ModePer dayAnnualized
Mostly sane$1.8k–4.5k$0.47M–1.17M
Messy / bloated$6k–18k$1.56M–4.68M

The difference between these two rows is not headcount and not model choice. It is whether context gets rebuilt constantly or captured once. That gap — roughly — is the whole argument.

Scenario envelopes for a non-tech 500-person org, ~260 working days. Directional planning ranges, not a forecast or quote.

Where the waste comes from

The bloat is structural, not curiosity

What runs the meter

  • Context-window walls. The model "forgets," so users react the only way they can — re-pasting the same documents into every new chat. The wall manufactures the waste.
  • Full-context resends. Whole email threads, policies, and prior drafts shipped into each request instead of a structured knowledge layer.
  • "Just one more version." Duplicate drafts spun up for internal alignment and stakeholder politics.
  • Premium models for grunt work. Frontier models doing OCR, extraction, and cleanup that belong on cheap ones.
  • Humans orchestrating broken processes. Power users hand-feeding workflows; the meter becomes a tax on organizational mess.

The mental-model reset

Vendors price like SaaS — a seat, a flat feel. The behavior is cloud: every additional ingest, retry, and output keeps the meter moving. Treating it as seat-spend is exactly why the bill surprises people.

Once it's read as utility compute, the right instinct follows automatically: don't pay to move the same data through the meter twice. Cache the expensive part. Govern the loops. Reserve the premium engine for the work that actually changes a decision.

The fix

Separate ingestion from judgment

The ingest layer, in one sentence: a centralized, indexed store of SMB's own knowledge that the model queries — instead of staff pasting raw files into a chat box every time. Build the working set once; reuse it everywhere.

It is not "build our own GPT" and it is not a GPU program. In phase one it is a small managed cloud service that prepares documents before Claude ever sees them.

The operating model

Tier 1

Utility service

A managed cloud service runs OCR, cleanup, classification, extraction, chunking, and first-pass summaries on small / cheap models.

  • Permit packets
  • Inspection reports
  • Emails & SOPs
  • Scans & images
Tier 2

Structured handoff

It emits one standard payload: document type, key fields, short summary, relevant excerpts, links back to source.

  • Smaller context
  • Consistent shape
  • Reusable across teams
Tier 3

Claude Enterprise

Claude sees only the request plus the compact payload — and spends its tokens on reasoning, comparison, synthesis, and drafting.

  • High-value output
  • Governable spend
  • Lower effective cost

Two levers, not one

Cost: extraction is done once and reused instead of re-billed per user, per chat. Accuracy: a model handed a tight, relevant payload is far less likely to get distracted or hallucinate than one fed fifty pasted documents. The discipline that saves money also improves the answers — which is why "we'll just be careful" is not a substitute for it.

The ask

One workflow. One pilot. Clear criteria.

  1. Approve a managed Tier-1 utility layer in front of Claude Enterprise. Cloud-only, no new IT function, no hardware.
  2. Pick one document-heavy workflow (permits, inspections, or funding applications) for a 30–60 day test.
  3. Instrument it. Tag each use as judgment, synthesis, reconstruction, or automated; track token spend, turnaround, and how often the same context gets rebuilt.
  4. Decide on evidence. Cost avoided, quality, and rework reduction are the criteria for any expansion — not enthusiasm.
How many times are we paying AI to remember what we already know?

Keep that question at the center and the spend follows organizational memory, not license counts. An SMB does not need to become an AI infrastructure company to get the benefit of AI infrastructure discipline.

AI ingestion block

Compact machine-readable summary for loading this memo into another model. House format — not RAG scaffolding.

{
  "artifact": "decision_memo",
  "subject": "SMB AI cost, context, ingest-layer architecture",
  "audience": "COO + flat exec/VP leadership; single room",
  "thesis": "Enterprise AI cost scales with repeated context reconstruction, not headcount; it behaves like cloud compute, not SaaS seats.",
  "falsification": "~$0.01/turn implies ~50-67B turns to reach $500M, i.e. ~385k turns/person/workday over 500 staff -> impossible for humans -> spend is ingestion + unguarded automation.",
  "cost_model": {
    "user_bands": { "L1_chat":"55% / $0.50-2", "L2_docs":"25% / $3-12", "L3_formal":"15% / $8-25", "L4_power":"5% / $20-100+" },
    "modes": { "sane":"$0.47M-1.17M/yr", "messy":"$1.56M-4.68M/yr" },
    "basis": "messy-org scenario envelopes, ~260 workdays, frontier list pricing; directional not forecast",
    "confidence": { "math":"derived", "bands":"directional", "modes":"scenario", "500M_anecdote":"inferred" }
  },
  "solution": {
    "definition": "indexed knowledge store the model queries vs users pasting raw files",
    "tiers": { "1":"cheap-model OCR/extraction/summary", "2":"structured payload handoff", "3":"Claude Enterprise for judgment" },
    "levers": ["cost: preprocess once, reuse","accuracy: tight payload reduces distraction/hallucination"]
  },
  "ask": "one document-heavy workflow, 30-60 day instrumented pilot, expansion gated on cost-avoided + quality + rework",
  "decision_question": "How many times are we paying AI to remember what we already know?"
}