Test Data & Evaluation¶
Test Data Structure¶
Test data lives in test-data/ and covers two workflows:
- Data capture (TC-* test cases) — audio → transcription → extraction → review
- Assessment (AT-* test cases) — observations → AI assessment → evaluation
Directory Layout¶
test-data/
├── mags/ # MAG configuration and assessment test cases
│ ├── cobalt-london/ # Cobalt-London (MAG source ID 113, 2 markets)
│ ├── lbr-syp/ # LBR SYP Lumber (MAG 737, 205 markets)
│ └── sa-vegoils/ # SA Vegoils (MAG 1993, 5 markets)
├── contacts/ # Data capture test cases (organized by person)
│ └── {contactId}/
│ ├── contact-info.json # Name, company
│ └── {date}/{mag}/ # Per-call test data
├── suites/
│ └── suite-definitions.yaml # Test suite groupings
├── audio-profiles/ # Audio effect profiles (baseline, noise, etc.)
├── sound-effects/ # Ambient sounds for TTS generation
├── tts-cache/ # Cached TTS audio files
├── assessment-outputs/ # Assessment eval run results
├── extraction-outputs/ # Extraction eval run results
├── review-outputs/ # Review eval run results
├── review-calibration-outputs/ # Review calibration results
└── transcription-outputs/ # Transcription eval run results
MAG Test Case Structure¶
mags/{mag-folder}/
mag-metadata.json # MAG config with sourceId
assessments/
{AT-XXXX}/
scenario.json # Category, description, expected contacts
golden-assessment.json # Expected assessment output (ground truth)
Data Capture Test Case Structure¶
contacts/{contactId}/{date}/{mag-folder}/
call-metadata.json # Markets, price data for this call
transcript.yaml # Conversation script for TTS generation
golden-record.json # Expected extraction output (ground truth)
treatments/
clean/ # Baseline audio quality
bad-conn/ # VoIP packet loss simulation
noise/ # Background noise overlay
fast-speech/ # Accelerated speech (1.5x)
telephone/ # Low bitrate phone codec
combined/ # Multiple effects combined
garbled-numbers/ # Failure case: numbers mis-transcribed
missing-prices/ # Failure case: prices omitted
truncated/ # Failure case: call cut off
Each treatment folder contains a test case ID (TC-XXXX) and generated audio files.
Eval CLI Commands¶
The evaluation CLI lives in src/cli/ and is invoked via npx tsx src/cli/index.ts.
Transcribe¶
Evaluate transcription accuracy across providers and audio treatments.
npx tsx src/cli/index.ts transcribe \
--provider assemblyai|azure-speech|whisper|elevenlabs \
--delay 500 --resume
Extract¶
Evaluate extraction accuracy against golden records.
Review¶
Evaluate review quality (confidence scoring, issue detection).
Assess¶
Evaluate assessment accuracy against golden assessments.
npx tsx src/cli/index.ts assess \
--mag cobalt-london|lbr-syp|sa-vegoils \
--provider gemini|anthropic|openai \
--delay 500 --resume --derive
The --derive flag enables the derive strategy (TypeScript formula application for non-benchmark markets, used for SYP Lumber).
Report¶
Generate comparison reports across eval runs.
npx tsx src/cli/index.ts report \
--stage assess|extract \
--runs run_id_1 run_id_2 \
--format table|csv|json
Common Flags¶
| Flag | Purpose |
|---|---|
--delay <ms> | Inter-batch delay to avoid rate limits |
--resume | Skip already-completed items (resume interrupted run) |
--provider <name> | Select AI provider |
--mag <name> | Filter to specific MAG |
Interpreting Results¶
Each eval run produces a directory in test-data/{stage}-outputs/:
{stage}_YYYYMMDD_HHMMSS/
manifest.json # Run metadata: provider, prompt hash, timestamp, counters
progress.json # Real-time progress tracking
{test-case-id}/
{stage}-metadata.yaml # Input/output metadata, token usage
{stage}-result.json # Raw provider output
evaluation-result.json # Accuracy metrics vs golden record
Key Metrics¶
| Stage | Primary Metric | What It Measures |
|---|---|---|
| Transcription | WER (Word Error Rate) | Text accuracy vs original transcript |
| Transcription | Price accuracy | Whether price values survived transcription |
| Extraction | Accuracy % | Extracted prices matching golden record |
| Assessment | Accuracy % | Assessed prices within tolerance of golden assessment |
Manifest Fields¶
The manifest.json tracks:
provider— which AI provider was usedpromptHash— hash of the prompt template (for reproducibility)startedAt/completedAt— timingtotal/completed/failed— progress counters
Adding a New Test Case¶
New Data Capture Test Case (TC-*)¶
- Create a contact directory:
test-data/contacts/{contactId}/ - Add
contact-info.jsonwith name and company - Create date/MAG subdirectory:
{date}/{mag-folder}/ - Add
call-metadata.json(markets, expected prices) - Add
transcript.yaml(conversation script) - Add
golden-record.json(expected extraction output) - Generate audio:
npx tsx src/cli/index.ts generate-audio
New Assessment Test Case (AT-*)¶
- Navigate to
test-data/mags/{mag-folder}/assessments/ - Create directory:
AT-XXXX/ - Add
scenario.json(category, description, expected contacts) - Add
golden-assessment.json(expected per-market assessment output)
Supported Commodities¶
| Commodity | MAG | Test Cases (TC) | Test Cases (AT) | Notes |
|---|---|---|---|---|
| Cobalt | cobalt-london | Multiple contacts, 6+ treatments each | ~10 scenarios | Base metal, $/lb ranges |
| SYP Lumber | lbr-syp | Multiple contacts, 6+ treatments each | ~20 scenarios | Derived pricing, 205 markets |
| SA Vegoils | sa-vegoils | Multiple contacts, 6+ treatments each | ~10 scenarios | Forward curve tenors (M1–M6) |