Evaluators

Evaluators score how well LLM outputs match expected outputs during compilation.

Built-in Evaluators

exactMatch

Returns 1.0 if outputs are deeply equal, 0.0 otherwise:

import { exactMatch } from "@mzhub/promptc";

const evaluator = exactMatch();

evaluator({ name: "Alice" }, { name: "Alice" });  // 1.0
evaluator({ name: "Alice" }, { name: "Bob" });    // 0.0

partialMatch

Returns the fraction of fields that match:

import { partialMatch } from "@mzhub/promptc";

const evaluator = partialMatch();

evaluator(
  { a: 1, b: 2, c: 3 }, 
  { a: 1, b: 2, c: 4 }
);  // 0.666 (2 out of 3 match)

arrayOverlap

Computes Jaccard similarity for arrays:

import { arrayOverlap } from "@mzhub/promptc";

const evaluator = arrayOverlap();

evaluator(["a", "b", "c"], ["a", "b", "c"]);  // 1.0 (identical)
evaluator(["a", "b"], ["b", "c"]);            // 0.33 (1/3 overlap)
evaluator(["a", "b"], ["c", "d"]);            // 0.0 (no overlap)

llmJudge

Uses an LLM to score the output quality:

import { llmJudge, createProvider } from "@mzhub/promptc";

const provider = createProvider("openai", {
  apiKey: process.env.OPENAI_API_KEY
});

const evaluator = llmJudge({
  provider,
  criteria: "accuracy and completeness"  // Optional
});

// Returns a score between 0 and 1
const score = await evaluator(prediction, groundTruth);

Cost Consideration

llmJudge makes an API call for each evaluation. Use sparingly during compilation or combine with cheaper evaluators.

Evaluator Interface

All evaluators follow this signature:

type Evaluator<O> = (
  prediction: O,    // LLM output
  groundTruth: O    // Expected output
) => number | Promise<number>;  // Score between 0 and 1

Custom Evaluators

Create your own evaluator for domain-specific scoring:

// Simple custom evaluator
const containsKeywords = (prediction, groundTruth) => {
  const keywords = groundTruth.keywords || [];
  const text = prediction.text?.toLowerCase() || "";
  
  const found = keywords.filter(k => text.includes(k.toLowerCase()));
  return found.length / keywords.length;
};

// Use with compiler
const compiler = new BootstrapFewShot(containsKeywords);

Combining Evaluators

Combine multiple evaluators with weighted averaging:

const combinedEvaluator = async (prediction, groundTruth) => {
  const exactScore = exactMatch()(prediction, groundTruth);
  const overlapScore = arrayOverlap()(
    prediction.items || [], 
    groundTruth.items || []
  );
  
  // Weighted average: 60% exact, 40% overlap
  return exactScore * 0.6 + overlapScore * 0.4;
};

Choosing an Evaluator

Use Case	Recommended Evaluator
Exact answers (classification, extraction)	exactMatch
Partial correctness allowed	partialMatch
List/set outputs	arrayOverlap
Subjective quality (summaries, creative)	llmJudge
Domain-specific	Custom evaluator

Next: Compilers →