Cost Optimization

Strategies to minimize API costs during compilation and production.

Compilation Costs

Compilation is the most expensive phase because you're evaluating many candidate prompts. Here's how to control costs:

1. Use Token Budgets

const result = await compiler.compile(program, trainset, {
  candidates: 50,
  budget: {
    maxTokens: 100000,  // Hard limit
    onBudgetWarning: (used, max) => {
      console.warn(`⚠️ Token usage: ${used}/${max}`);
    }
  }
});

2. Early Stopping

Skip poorly performing candidates early:

const result = await compiler.compile(program, trainset, {
  candidates: 50,
  earlyStopThreshold: 0.5,  // Skip if score < 50%
});

3. Smaller Training Sets

Start with 5-10 high-quality examples:

// Quality > Quantity
const trainset = [
  // Choose diverse, representative examples
  { input: {...}, output: {...} },
  { input: {...}, output: {...} },
  // 5-10 total is often enough
];

4. Use Cheaper Models

// Use fast/cheap model for compilation
const compileProvider = createProvider("openai", {
  defaultModel: "gpt-4o-mini"  // Cheaper than gpt-4o
});

// Use better model for production
const prodProvider = createProvider("openai", {
  defaultModel: "gpt-4o"
});

Cost Tracking

Track token usage during compilation:

import { CostTracker, estimateCost } from "@mzhub/promptc";

// Estimate before running
const estimate = compiler.estimateCost(trainset.length, {
  candidates: 20
});
console.log("Estimated tokens:", estimate.estimatedTokens);

// Actual usage after compilation
console.log("Actual tokens:", result.meta.tokenUsage.totalTokens);
console.log("API calls:", result.meta.tokenUsage.calls);

// Cost calculation (example rates)
const rates = {
  "gpt-4o-mini": { input: 0.15 / 1000000, output: 0.60 / 1000000 },
  "gpt-4o": { input: 2.50 / 1000000, output: 10.00 / 1000000 }
};

const cost = 
  result.meta.tokenUsage.inputTokens * rates["gpt-4o-mini"].input +
  result.meta.tokenUsage.outputTokens * rates["gpt-4o-mini"].output;

console.log(`Cost: $${cost.toFixed(4)}`);

Production Costs

1. Caching

Cache responses for repeated inputs:

import { PromptCache } from "@mzhub/promptc";

const cache = new PromptCache({
  maxSize: 1000,
  ttlMs: 60 * 60 * 1000  // 1 hour
});

async function runWithCache(input) {
  const cached = cache.get("extract", input);
  if (cached) return cached;
  
  const result = await compiled.run(input);
  cache.set("extract", input, result);
  return result;
}

2. Model Selection

Task Type	Recommended Model	Why
Simple extraction	gpt-4o-mini	Fast, cheap, accurate for structured tasks
Complex reasoning	gpt-4o	Better at multi-step logic
High volume	groq (llama)	Fastest inference, lowest latency
Development	ollama	Free, runs locally

3. Batching

Process multiple items in a single call when possible:

// Instead of N separate calls...
const results = await Promise.all(
  items.map(item => program.run({ text: item }))
);

// Design schema to handle batches
const BatchSchema = defineSchema({
  description: "Extract names from multiple texts",
  inputs: {
    texts: z.array(z.string())  // Process multiple at once
  },
  outputs: {
    results: z.array(z.object({
      text: z.string(),
      names: z.array(z.string())
    }))
  }
});

Cost vs Quality

Always test cheaper options first. gpt-4o-mini often performs as well as gpt-4o for structured extraction tasks, at 1/10th the cost.

Next: Production Deployment →