Skip to content

Test Data & Evaluation

Test Data Structure

Test data lives in test-data/ and covers two workflows:

  • Data capture (TC-* test cases) — audio → transcription → extraction → review
  • Assessment (AT-* test cases) — observations → AI assessment → evaluation

Directory Layout

test-data/
├── mags/                       # MAG configuration and assessment test cases
│   ├── cobalt-london/          # Cobalt-London (MAG source ID 113, 2 markets)
│   ├── lbr-syp/                # LBR SYP Lumber (MAG 737, 205 markets)
│   └── sa-vegoils/             # SA Vegoils (MAG 1993, 5 markets)
├── contacts/                   # Data capture test cases (organized by person)
│   └── {contactId}/
│       ├── contact-info.json   # Name, company
│       └── {date}/{mag}/       # Per-call test data
├── suites/
│   └── suite-definitions.yaml  # Test suite groupings
├── audio-profiles/             # Audio effect profiles (baseline, noise, etc.)
├── sound-effects/              # Ambient sounds for TTS generation
├── tts-cache/                  # Cached TTS audio files
├── assessment-outputs/         # Assessment eval run results
├── extraction-outputs/         # Extraction eval run results
├── review-outputs/             # Review eval run results
├── review-calibration-outputs/ # Review calibration results
└── transcription-outputs/      # Transcription eval run results

MAG Test Case Structure

mags/{mag-folder}/
  mag-metadata.json             # MAG config with sourceId
  assessments/
    {AT-XXXX}/
      scenario.json             # Category, description, expected contacts
      golden-assessment.json    # Expected assessment output (ground truth)

Data Capture Test Case Structure

contacts/{contactId}/{date}/{mag-folder}/
  call-metadata.json            # Markets, price data for this call
  transcript.yaml               # Conversation script for TTS generation
  golden-record.json            # Expected extraction output (ground truth)
  treatments/
    clean/                      # Baseline audio quality
    bad-conn/                   # VoIP packet loss simulation
    noise/                      # Background noise overlay
    fast-speech/                # Accelerated speech (1.5x)
    telephone/                  # Low bitrate phone codec
    combined/                   # Multiple effects combined
    garbled-numbers/            # Failure case: numbers mis-transcribed
    missing-prices/             # Failure case: prices omitted
    truncated/                  # Failure case: call cut off

Each treatment folder contains a test case ID (TC-XXXX) and generated audio files.

Eval CLI Commands

The evaluation CLI lives in src/cli/ and is invoked via npx tsx src/cli/index.ts.

Transcribe

Evaluate transcription accuracy across providers and audio treatments.

npx tsx src/cli/index.ts transcribe \
  --provider assemblyai|azure-speech|whisper|elevenlabs \
  --delay 500 --resume

Extract

Evaluate extraction accuracy against golden records.

npx tsx src/cli/index.ts extract \
  --provider gemini|anthropic|openai \
  --delay 500 --resume

Review

Evaluate review quality (confidence scoring, issue detection).

npx tsx src/cli/index.ts review \
  --provider gemini|anthropic|openai \
  --delay 500 --resume

Assess

Evaluate assessment accuracy against golden assessments.

npx tsx src/cli/index.ts assess \
  --mag cobalt-london|lbr-syp|sa-vegoils \
  --provider gemini|anthropic|openai \
  --delay 500 --resume --derive

The --derive flag enables the derive strategy (TypeScript formula application for non-benchmark markets, used for SYP Lumber).

Report

Generate comparison reports across eval runs.

npx tsx src/cli/index.ts report \
  --stage assess|extract \
  --runs run_id_1 run_id_2 \
  --format table|csv|json

Common Flags

Flag Purpose
--delay <ms> Inter-batch delay to avoid rate limits
--resume Skip already-completed items (resume interrupted run)
--provider <name> Select AI provider
--mag <name> Filter to specific MAG

Interpreting Results

Each eval run produces a directory in test-data/{stage}-outputs/:

{stage}_YYYYMMDD_HHMMSS/
  manifest.json                 # Run metadata: provider, prompt hash, timestamp, counters
  progress.json                 # Real-time progress tracking
  {test-case-id}/
    {stage}-metadata.yaml       # Input/output metadata, token usage
    {stage}-result.json         # Raw provider output
    evaluation-result.json      # Accuracy metrics vs golden record

Key Metrics

Stage Primary Metric What It Measures
Transcription WER (Word Error Rate) Text accuracy vs original transcript
Transcription Price accuracy Whether price values survived transcription
Extraction Accuracy % Extracted prices matching golden record
Assessment Accuracy % Assessed prices within tolerance of golden assessment

Manifest Fields

The manifest.json tracks:

  • provider — which AI provider was used
  • promptHash — hash of the prompt template (for reproducibility)
  • startedAt / completedAt — timing
  • total / completed / failed — progress counters

Adding a New Test Case

New Data Capture Test Case (TC-*)

  1. Create a contact directory: test-data/contacts/{contactId}/
  2. Add contact-info.json with name and company
  3. Create date/MAG subdirectory: {date}/{mag-folder}/
  4. Add call-metadata.json (markets, expected prices)
  5. Add transcript.yaml (conversation script)
  6. Add golden-record.json (expected extraction output)
  7. Generate audio: npx tsx src/cli/index.ts generate-audio

New Assessment Test Case (AT-*)

  1. Navigate to test-data/mags/{mag-folder}/assessments/
  2. Create directory: AT-XXXX/
  3. Add scenario.json (category, description, expected contacts)
  4. Add golden-assessment.json (expected per-market assessment output)

Supported Commodities

Commodity MAG Test Cases (TC) Test Cases (AT) Notes
Cobalt cobalt-london Multiple contacts, 6+ treatments each ~10 scenarios Base metal, $/lb ranges
SYP Lumber lbr-syp Multiple contacts, 6+ treatments each ~20 scenarios Derived pricing, 205 markets
SA Vegoils sa-vegoils Multiple contacts, 6+ treatments each ~10 scenarios Forward curve tenors (M1–M6)