E2E Eval Pipeline - Usage Guide¶
Prerequisites¶
-
Environment Variables:
-
Dependencies:
CLI Commands¶
List Available Test Cases¶
Output:
Available MAGs:
- MAG_001
Available Suites:
- default: Default test suite with all standard MAGs
Test cases: MAG_001
Run Evaluation Pipeline¶
Run all test cases:
Run specific MAG:
Run specific step only:
# Extraction only (skips transcription)
npx ts-node src/cli/index.ts run --mag MAG_001 --step extraction
# Review only (runs extraction first, then review)
npx ts-node src/cli/index.ts run --mag MAG_001 --step review
Run with verbose output:
Running from a Specific Step¶
The pipeline supports entry at any step, allowing you to: - Test extraction with pre-existing or deliberately mangled transcripts - Test review with known extraction results - Skip expensive transcription when iterating on extraction/review
Pipeline Steps¶
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Transcription │────▶│ Extraction │────▶│ Review │────▶│ Evaluation │
│ (audio→text) │ │ (text→struct) │ │ (confidence) │ │ (vs golden) │
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲ ▲
│ │ │ │
audioPath transcript extraction goldenRecord
(skip) (skip) (skip)
Entry Points¶
| Start From | Required Input | Skipped Steps | Use Case |
|---|---|---|---|
| Audio | audioPath | None | Full E2E test |
| Transcript | transcript | Transcription | Test extraction with known/mangled text |
| Extraction | transcript + extraction | Transcription, Extraction | Test review only |
| Review only | - | Transcription, Extraction, Evaluation | Just run review step |
CLI Usage¶
1. Full Pipeline (from audio):
# Requires audio files in test-data/mags/MAG_001/audio/
npx ts-node src/cli/index.ts run --mag MAG_001
2. Start from Transcript (skip transcription):
# Uses transcript from test-data/mags/MAG_001/transcripts/original.txt
# Skips transcription step entirely
npx ts-node src/cli/index.ts run --mag MAG_001 --step extraction
3. Run Review Only:
# Runs extraction first, then review (no transcription)
npx ts-node src/cli/index.ts run --mag MAG_001 --step review
Programmatic Usage¶
Full Pipeline:
import { createPipelineExecutor } from '@/services/pipeline/index.js';
const executor = createPipelineExecutor(services, config);
// Start from audio - runs all steps
const result = await executor.run({
testCaseId: 'MAG_001',
audioPath: 'path/to/audio.mp3',
goldenRecord: goldenRecord,
});
Start from Transcript (skip transcription):
// Provide transcript instead of audioPath
// Pipeline automatically skips transcription
const result = await executor.run({
testCaseId: 'MAG_001',
transcript: 'Speaker 1: The cobalt price is $14.10...',
goldenRecord: goldenRecord,
// Optional: explicitly skip transcription
skipTranscription: true,
});
Start from Extraction (test review only):
// Provide both transcript and pre-computed extraction
const result = await executor.run({
testCaseId: 'MAG_001',
transcript: 'Speaker 1: The cobalt price is $14.10...',
extraction: {
extractedPoints: [...],
provider: 'gemini',
model: 'gemini-2.5-flash',
},
goldenRecord: goldenRecord,
skipTranscription: true,
skipExtraction: true,
});
Run Individual Steps:
// Transcription only
const transcription = await executor.transcribe('path/to/audio.mp3', 'test-case-id');
// Extraction only
const extraction = await executor.extract(
'transcript text...',
{ market: 'Cobalt', methodology: 'assessment' },
'test-case-id'
);
// Review only (requires extraction result)
const review = await executor.review(
'transcript text...',
extractionResult,
'test-case-id'
);
Testing with Mangled Transcripts¶
A key use case is testing extraction resilience with deliberately corrupted transcripts:
1. Create mangled transcript:
Speaker 1: The price came in at... [static]... teen dollars.
Speaker 2: Sorry, I didn't catch that. Did you say four-teen?
2. Run extraction on it:
import { readFileSync } from 'fs';
const mangledTranscript = readFileSync(
'test-data/mangled/garbled-numbers/transcript.txt',
'utf-8'
);
// Test how well extraction handles corrupted input
const result = await executor.run({
testCaseId: 'garbled-numbers-test',
transcript: mangledTranscript,
goldenRecord: goldenRecord, // Compare against expected output
skipTranscription: true,
});
// Check extraction accuracy with degraded input
console.log('Extraction accuracy:', result.evaluationResults?.extraction?.accuracy);
Skip Flags Reference¶
| Flag | Effect |
|---|---|
skipTranscription | Skip audio→text, use provided transcript |
skipExtraction | Skip text→struct, use provided extraction |
skipReview | Skip confidence assessment |
skipEvaluation | Skip comparison with golden record |
Auto-skip behavior: - If transcript is provided but no audioPath, transcription is auto-skipped - If extraction is provided, extraction is auto-skipped - If no goldenRecord is provided, evaluation is auto-skipped
Example: A/B Testing Extraction Approaches¶
// Same transcript, different extraction configs
const transcript = readFileSync('transcript.txt', 'utf-8');
// Approach A: Gemini Flash
const resultA = await executor.run({
testCaseId: 'approach-a',
transcript,
goldenRecord,
skipTranscription: true,
});
// Approach B: Different prompt/model (swap service)
const executorB = createPipelineExecutor({
...services,
extraction: new ClaudeExtractionService({ model: 'claude-sonnet-4' }),
}, config);
const resultB = await executorB.run({
testCaseId: 'approach-b',
transcript,
goldenRecord,
skipTranscription: true,
});
// Compare results
console.log('Gemini accuracy:', resultA.evaluationResults?.extraction?.accuracy);
console.log('Claude accuracy:', resultB.evaluationResults?.extraction?.accuracy);
Compare Two Runs¶
Output formats:
# Table format (default)
npx ts-node src/cli/index.ts compare abc123 def456
# JSON format
npx ts-node src/cli/index.ts compare abc123 def456 --format json
# CSV format
npx ts-node src/cli/index.ts compare abc123 def456 --format csv
# Show only differences
npx ts-node src/cli/index.ts compare abc123 def456 --diffs-only
Generate Audio Variants¶
# Generate for all MAGs with all profiles
npx tsx src/cli/index.ts generate-audio --all --profiles all
# Generate for specific MAG and test case
npx tsx src/cli/index.ts generate-audio --mag MAG_001_T001 --profiles all
# Specific profiles only
npx tsx src/cli/index.ts generate-audio --mag MAG_001_T001 --profiles clean noise_high voice_female_american
Output Structure:
test-data/audio-outputs/run_xxx/
├── manifest.json # Run metadata
├── progress.json # Run progress/status
└── MAG_001/T001/
├── translations/
│ ├── MAG_001_T001_source.txt # Original transcript
│ └── MAG_001_T001_en.txt # Normalized transcript
├── audio/
│ └── elevenlabs_v_{voiceIds}_spd_{speed}_stb_{stab}_sim_{sim}/
│ ├── MAG_001_T001_en.mp3 # Combined TTS audio
│ ├── MAG_001_T001_en_segments/ # Per-speaker segments
│ └── tts-config.yaml # TTS parameters used
└── profiles/
└── {profile_name}/
├── MAG_001_T001_en_effects.mp3 # Audio with effects
└── MAG_001_T001_en_effects_segments/ # Segment variations
Audio Profile Categories (49 profiles):
| Category | Profiles | Description |
|---|---|---|
bad_connection | 9 | Packet loss, jitter, distortion |
noise | 4 | Background noise at various levels |
combined | 5 | Multiple effects combined |
speed | 5 | Speech rate variations (0.7x-1.5x) |
telephone | 4 | Telephone audio quality |
office | 10 | Office environment + effects |
voice_accent | 5 | Different accents and speakers |
| Other | 7 | Business call, variable effects |
Voice Accent Profiles: - voice_female_american - American female speakers - voice_female_british - British female speakers - voice_female_indian - Indian female speakers - voice_male_spanish - Spanish male speakers - voice_multi_speaker_diverse - Spanish male + Indian female (cross-cultural)
Transcribe Audio Files¶
# Transcribe from latest audio run
npx tsx src/cli/index.ts transcribe --verbose
# Transcribe from specific audio run
npx tsx src/cli/index.ts transcribe --audio-run run_001_20260125_181823 --verbose
# Transcribe specific MAG only
npx tsx src/cli/index.ts transcribe --mag MAG_001_T001 --verbose
# Transcribe specific profiles
npx tsx src/cli/index.ts transcribe --profiles clean noisy --verbose
Output: test-data/transcription-outputs/transcription_YYYYMMDD_HHMMSS/
Transcription Output Structure:
transcription_xxx/MAG_001/T001/profiles/{profile}/
├── transcription.txt # AssemblyAI output text
├── transcription-result.json # Full API response
├── transcription-metadata.yaml # Provider, model, timing
├── original-transcript.txt # Original for comparison
├── evaluation-result.json # WER, CER, domain metrics
└── evaluation-comparison.yaml # Line-by-line diff
Extract Structured Data¶
# Extract from latest transcription run
npx tsx src/cli/index.ts extract --verbose
# Extract from specific transcription run
npx tsx src/cli/index.ts extract --transcription-run transcription_20260126_022028 --verbose
# Extract from original transcripts (skip transcription step)
npx tsx src/cli/index.ts extract --use-original --verbose
# Extract specific MAG only
npx tsx src/cli/index.ts extract --mag MAG_001_T001 --verbose
# Override AI provider (default: gemini; set via --provider or EXTRACTION_PROVIDER env var)
npx tsx src/cli/index.ts extract --provider gemini --verbose
Output: test-data/extraction-outputs/extraction_YYYYMMDD_HHMMSS/
Extraction Output Structure:
extraction_xxx/MAG_001/T001/profiles/{profile}/
├── extracted-points.json # Extracted price points
├── extraction-result.json # Full Gemini response
├── extraction-metadata.yaml # Model, provider, timing
├── evaluation-result.json # Precision, recall, F1 vs golden
└── evaluation-comparison.yaml # Point-by-point comparison
Review Extractions¶
# Review from latest extraction run
npx tsx src/cli/index.ts review --verbose
# Review from specific extraction run
npx tsx src/cli/index.ts review --extraction-run extraction_20260126_025447 --verbose
# Review specific MAG only
npx tsx src/cli/index.ts review --mag MAG_001_T001 --verbose
# Override AI provider (default: gemini; set via --provider or REVIEW_PROVIDER env var)
npx tsx src/cli/index.ts review --provider gemini --verbose
Output: test-data/review-outputs/review_YYYYMMDD_HHMMSS/
Review Output Structure:
review_xxx/MAG_001/T001/profiles/{profile}/
├── review-result.json # Confidence scores + rationale
└── review-metadata.yaml # Model, provider, timing
Review Result Schema:
{
"overallConfidence": 0.85,
"pointReviews": [
{
"pointIndex": 0,
"confidence": 1.0,
"issues": [],
"suggestions": []
},
{
"pointIndex": 1,
"confidence": 0.65,
"issues": ["Source text appears garbled"],
"suggestions": ["Flag as low quality"]
}
],
"rationale": "Overall assessment explanation...",
"model": "gemini-2.5-flash",
"provider": "gemini"
}
Validate Assessment Runs (Post-hoc)¶
Test the assessment validator against existing assessment output files without re-running the full pipeline:
# Validate one or more assessment runs
npx tsx src/cli/index.ts validate-runs \
--runs assessment_20260226_204618 assessment_20260226_204630 \
--mag cobalt-london --tc AT-0002 --provider gemini
# With verbose output and rate-limit delay
npx tsx src/cli/index.ts validate-runs \
--runs assessment_20260226_204618 \
--verbose --delay 500
# Override the Gemini model used for LLM review
npx tsx src/cli/index.ts validate-runs \
--runs assessment_20260226_204618 \
--model gemini-3-flash-preview
Runs are resolved from test-data/assessment-outputs/ by ID, or you can pass a full path.
Adding Test Cases¶
1. Create MAG Directory¶
2. Create Test Case Metadata¶
test-data/mags/MAG_002/test-case.json:
{
"id": "MAG_002",
"market": "Steel",
"language": "en",
"description": "Steel rebar market conversation",
"tags": ["steel", "metals"],
"createdAt": "2025-01-23T00:00:00.000Z",
"updatedAt": "2025-01-23T00:00:00.000Z"
}
3. Create Golden Record¶
test-data/mags/MAG_002/golden-record.json:
{
"testCaseId": "MAG_002",
"sourceTranscript": "Speaker 1: The steel rebar price...",
"extractedPoints": [
{
"market": "Steel Rebar",
"priceValue": 450.00,
"volume": 100,
"volumeUnit": "tonnes",
"type": "transaction",
"sourceReferences": [
{
"lineStart": 1,
"lineEnd": 1,
"text": "The steel rebar price was $450 per tonne for 100 tonnes."
}
]
}
]
}
4. Add Transcript¶
test-data/mags/MAG_002/transcripts/original.txt:
Speaker 1: The steel rebar price was $450 per tonne for 100 tonnes.
Speaker 2: That matches our expectations for the current market.
5. Update Suite Definition¶
Add to test-data/suites/suite-definitions.yaml:
- name: default
description: Default test suite with all standard MAGs
testCases:
- MAG_001
- MAG_002
6. Generate Audio (Optional)¶
Viewing Results¶
Run Results Location¶
Results are stored in test-data/results/{run_id}/:
test-data/results/abc12345-1234-5678-9abc-def012345678/
├── run-config.json # Versioning info (git sha, config hash, etc.)
├── summary.json # Aggregated metrics
└── mags/
└── MAG_001.json # Per-MAG detailed results
Run Config Schema¶
{
"runId": "abc12345-1234-5678-9abc-def012345678",
"gitSha": "5172ed4...",
"timestamp": "2025-01-23T12:00:00.000Z",
"configHash": "a1b2c3d4",
"promptHashes": {
"extraction": "e5f6g7h8",
"review": "i9j0k1l2"
},
"testSuiteHash": "m3n4o5p6",
"services": {
"transcription": { "provider": "assemblyai", "model": "best" },
"extraction": { "provider": "gemini", "model": "gemini-2.5-flash" },
"review": { "provider": "gemini", "model": "gemini-2.5-flash" }
}
}
Summary Schema¶
{
"runId": "abc12345...",
"timestamp": "2025-01-23T12:00:00.000Z",
"totalTestCases": 1,
"passed": 1,
"failed": 0,
"errors": 0,
"metrics": {
"transcription": {
"avgWer": 0.05,
"avgCer": 0.02
},
"extraction": {
"avgAccuracy": 0.95,
"avgPrecision": 0.90,
"avgRecall": 0.95,
"avgF1": 0.92
},
"review": {
"avgConfidenceCorrelation": 0.85,
"avgCalibrationError": 0.08
}
},
"durationMs": 45000
}
Logfire Integration¶
When Logfire is enabled, all pipeline runs are traced:
-
Enable in config:
configs/eval/default.yaml: -
View traces in Logfire dashboard:
- Pipeline execution spans
- Individual step durations
- LLM call details
- Evaluation metrics
Programmatic Usage¶
Using the Pipeline Directly¶
import {
createPipelineExecutor,
type PipelineServicesConfig,
} from '@/services/pipeline/index.js';
import { AssemblyAITranscriptionService } from '@/services/transcription/assemblyai.js';
import { GeminiExtractionService } from '@/services/extraction/gemini.js';
import { GeminiReviewService } from '@/services/review/gemini.js';
// Create services
const services: PipelineServicesConfig = {
transcription: new AssemblyAITranscriptionService({ logfireEnabled: false }),
extraction: new GeminiExtractionService({ logfireEnabled: false }),
review: new GeminiReviewService({ logfireEnabled: false }),
};
// Create executor
const executor = createPipelineExecutor(services, {
logfireEnabled: false,
evaluation: {
numericTolerance: 0.01,
matchThreshold: 0.8,
},
});
// Run full pipeline
const result = await executor.run({
testCaseId: 'MAG_001',
audioPath: 'test-data/mags/MAG_001/audio/clean.mp3',
goldenRecord: goldenRecord,
});
// Or run individual steps
const transcription = await executor.transcribe('path/to/audio.mp3');
const extraction = await executor.extract('transcript text...', { market: 'Cobalt' });
const review = await executor.review('transcript text...', extractionResult);
Using Individual Evaluators¶
import {
TranscriptionEvaluator,
ExtractionEvaluator,
ReviewEvaluator,
} from '@/services/evaluation/index.js';
// Transcription evaluation
const transcriptionEval = new TranscriptionEvaluator({ numericTolerance: 0.01 });
const werResult = await transcriptionEval.evaluate(
originalTranscript,
aiTranscript,
{ provider: 'assemblyai', model: 'best' }
);
// Extraction evaluation
const extractionEval = new ExtractionEvaluator({ matchThreshold: 0.8 });
const extractionResult = await extractionEval.evaluate(
goldenRecord,
extractedPoints,
{ provider: 'gemini', model: 'gemini-2.5-flash' }
);
// Review evaluation
const reviewEval = new ReviewEvaluator();
const reviewResult = await reviewEval.evaluate(
reviewedPoints,
goldenRecord,
{ provider: 'gemini', model: 'gemini-2.5-flash' }
);
Troubleshooting¶
Common Issues¶
- "Transcription service not configured"
- Ensure
ASSEMBLYAI_API_KEYis set -
Check service initialization in CLI
-
"Profile not found"
- Check
configs/audio-generation/profiles/for available profiles -
Run
npx ts-node src/cli/index.ts generate-audio --verboseto see available profiles -
"Test case not found"
- Verify MAG directory exists in
test-data/mags/ -
Check
test-case.jsonandgolden-record.jsonexist -
Low extraction accuracy
- Review fuzzy match threshold in config
- Check golden record format matches expected schema
- Verify transcript quality
Debug Mode¶
Set NODE_DEBUG=pipeline for detailed logging: