Economics of Intelligence
API Providers charge per million tokens. The prices drop every month.
- Input Tokens (Prompt): Cheap.
- Output Tokens (Completion): Expensive (usually 3x-10x input price).
This creates a false sense of security. Developers think, "It's only $0.15 per million tokens, I can iterate forever."
The RAG Multiplier
Retrieval Augmented Generation (RAG) is the standard architecture for modern AI apps.
- User asks a short question (10 tokens).
- You search your vector database.
- You retrieve 10 relevant documents (2,000 tokens).
- You inject them into the system prompt.
Your "10 token" query is actually a 2,010 token request. Every single turn of the conversation re-sends this massive context.
Chain of Thought (CoT) Costs
Newer reasoning models (like o1 or DeepSeek-R1) use "hidden" Chain of Thought tokens to think before they answer. You pay for these thinking tokens. A complex logic puzzle might generate 10,000 hidden tokens before outputting the final 50 token answer. You are billed for 10,050 output tokens.
Optimization Strategy
- Caching: Use prompt caching for static system instructions.
- Small Models for Routing: Use a tiny model (Llama-3-8B) to classify the query, and only call the big model (GPT-4) for complex tasks.
- Concise Context: Don't dump the whole PDF. Summarize chunks before injection.
Use the Token Cost Estimator to forecast your bill at 10k users. The difference between unoptimized RAG and optimized routing is often the difference between a gross margin of 10% and 80%.