Skip to content

E2E Eval Pipeline - Usage Guide

Prerequisites

  1. Environment Variables:

    ASSEMBLYAI_API_KEY=your_assemblyai_key
    GEMINI_API_KEY=your_gemini_key
    ELEVENLABS_API_KEY=your_elevenlabs_key  # For audio generation
    

  2. Dependencies:

    npm install
    

CLI Commands

List Available Test Cases

npx ts-node src/cli/index.ts list

Output:

Available MAGs:
  - MAG_001

Available Suites:
  - default: Default test suite with all standard MAGs
    Test cases: MAG_001

Run Evaluation Pipeline

Run all test cases:

npx ts-node src/cli/index.ts run --all

Run specific MAG:

npx ts-node src/cli/index.ts run --mag MAG_001

Run specific step only:

# Extraction only (skips transcription)
npx ts-node src/cli/index.ts run --mag MAG_001 --step extraction

# Review only (runs extraction first, then review)
npx ts-node src/cli/index.ts run --mag MAG_001 --step review

Run with verbose output:

npx ts-node src/cli/index.ts run --all --verbose

Running from a Specific Step

The pipeline supports entry at any step, allowing you to: - Test extraction with pre-existing or deliberately mangled transcripts - Test review with known extraction results - Skip expensive transcription when iterating on extraction/review

Pipeline Steps

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Transcription  │────▶│   Extraction    │────▶│     Review      │────▶│   Evaluation    │
│  (audio→text)   │     │ (text→struct)   │     │ (confidence)    │     │ (vs golden)     │
└─────────────────┘     └─────────────────┘     └─────────────────┘     └─────────────────┘
      ▲                       ▲                       ▲                       ▲
      │                       │                       │                       │
 audioPath              transcript               extraction             goldenRecord
                         (skip)                   (skip)                  (skip)

Entry Points

Start From Required Input Skipped Steps Use Case
Audio audioPath None Full E2E test
Transcript transcript Transcription Test extraction with known/mangled text
Extraction transcript + extraction Transcription, Extraction Test review only
Review only - Transcription, Extraction, Evaluation Just run review step

CLI Usage

1. Full Pipeline (from audio):

# Requires audio files in test-data/mags/MAG_001/audio/
npx ts-node src/cli/index.ts run --mag MAG_001

2. Start from Transcript (skip transcription):

# Uses transcript from test-data/mags/MAG_001/transcripts/original.txt
# Skips transcription step entirely
npx ts-node src/cli/index.ts run --mag MAG_001 --step extraction

3. Run Review Only:

# Runs extraction first, then review (no transcription)
npx ts-node src/cli/index.ts run --mag MAG_001 --step review

Programmatic Usage

Full Pipeline:

import { createPipelineExecutor } from '@/services/pipeline/index.js';

const executor = createPipelineExecutor(services, config);

// Start from audio - runs all steps
const result = await executor.run({
  testCaseId: 'MAG_001',
  audioPath: 'path/to/audio.mp3',
  goldenRecord: goldenRecord,
});

Start from Transcript (skip transcription):

// Provide transcript instead of audioPath
// Pipeline automatically skips transcription
const result = await executor.run({
  testCaseId: 'MAG_001',
  transcript: 'Speaker 1: The cobalt price is $14.10...',
  goldenRecord: goldenRecord,
  // Optional: explicitly skip transcription
  skipTranscription: true,
});

Start from Extraction (test review only):

// Provide both transcript and pre-computed extraction
const result = await executor.run({
  testCaseId: 'MAG_001',
  transcript: 'Speaker 1: The cobalt price is $14.10...',
  extraction: {
    extractedPoints: [...],
    provider: 'gemini',
    model: 'gemini-2.5-flash',
  },
  goldenRecord: goldenRecord,
  skipTranscription: true,
  skipExtraction: true,
});

Run Individual Steps:

// Transcription only
const transcription = await executor.transcribe('path/to/audio.mp3', 'test-case-id');

// Extraction only
const extraction = await executor.extract(
  'transcript text...',
  { market: 'Cobalt', methodology: 'assessment' },
  'test-case-id'
);

// Review only (requires extraction result)
const review = await executor.review(
  'transcript text...',
  extractionResult,
  'test-case-id'
);

Testing with Mangled Transcripts

A key use case is testing extraction resilience with deliberately corrupted transcripts:

1. Create mangled transcript:

test-data/mangled/garbled-numbers/transcript.txt
Speaker 1: The price came in at... [static]... teen dollars.
Speaker 2: Sorry, I didn't catch that. Did you say four-teen?

2. Run extraction on it:

import { readFileSync } from 'fs';

const mangledTranscript = readFileSync(
  'test-data/mangled/garbled-numbers/transcript.txt',
  'utf-8'
);

// Test how well extraction handles corrupted input
const result = await executor.run({
  testCaseId: 'garbled-numbers-test',
  transcript: mangledTranscript,
  goldenRecord: goldenRecord, // Compare against expected output
  skipTranscription: true,
});

// Check extraction accuracy with degraded input
console.log('Extraction accuracy:', result.evaluationResults?.extraction?.accuracy);

Skip Flags Reference

Flag Effect
skipTranscription Skip audio→text, use provided transcript
skipExtraction Skip text→struct, use provided extraction
skipReview Skip confidence assessment
skipEvaluation Skip comparison with golden record

Auto-skip behavior: - If transcript is provided but no audioPath, transcription is auto-skipped - If extraction is provided, extraction is auto-skipped - If no goldenRecord is provided, evaluation is auto-skipped

Example: A/B Testing Extraction Approaches

// Same transcript, different extraction configs
const transcript = readFileSync('transcript.txt', 'utf-8');

// Approach A: Gemini Flash
const resultA = await executor.run({
  testCaseId: 'approach-a',
  transcript,
  goldenRecord,
  skipTranscription: true,
});

// Approach B: Different prompt/model (swap service)
const executorB = createPipelineExecutor({
  ...services,
  extraction: new ClaudeExtractionService({ model: 'claude-sonnet-4' }),
}, config);

const resultB = await executorB.run({
  testCaseId: 'approach-b',
  transcript,
  goldenRecord,
  skipTranscription: true,
});

// Compare results
console.log('Gemini accuracy:', resultA.evaluationResults?.extraction?.accuracy);
console.log('Claude accuracy:', resultB.evaluationResults?.extraction?.accuracy);

Compare Two Runs

npx ts-node src/cli/index.ts compare <run1-id> <run2-id>

Output formats:

# Table format (default)
npx ts-node src/cli/index.ts compare abc123 def456

# JSON format
npx ts-node src/cli/index.ts compare abc123 def456 --format json

# CSV format
npx ts-node src/cli/index.ts compare abc123 def456 --format csv

# Show only differences
npx ts-node src/cli/index.ts compare abc123 def456 --diffs-only

Generate Audio Variants

# Generate for all MAGs with all profiles
npx tsx src/cli/index.ts generate-audio --all --profiles all

# Generate for specific MAG and test case
npx tsx src/cli/index.ts generate-audio --mag MAG_001_T001 --profiles all

# Specific profiles only
npx tsx src/cli/index.ts generate-audio --mag MAG_001_T001 --profiles clean noise_high voice_female_american

Output Structure:

test-data/audio-outputs/run_xxx/
├── manifest.json                    # Run metadata
├── progress.json                    # Run progress/status
└── MAG_001/T001/
    ├── translations/
    │   ├── MAG_001_T001_source.txt  # Original transcript
    │   └── MAG_001_T001_en.txt      # Normalized transcript
    ├── audio/
    │   └── elevenlabs_v_{voiceIds}_spd_{speed}_stb_{stab}_sim_{sim}/
    │       ├── MAG_001_T001_en.mp3          # Combined TTS audio
    │       ├── MAG_001_T001_en_segments/    # Per-speaker segments
    │       └── tts-config.yaml              # TTS parameters used
    └── profiles/
        └── {profile_name}/
            ├── MAG_001_T001_en_effects.mp3         # Audio with effects
            └── MAG_001_T001_en_effects_segments/   # Segment variations

Audio Profile Categories (49 profiles):

Category Profiles Description
bad_connection 9 Packet loss, jitter, distortion
noise 4 Background noise at various levels
combined 5 Multiple effects combined
speed 5 Speech rate variations (0.7x-1.5x)
telephone 4 Telephone audio quality
office 10 Office environment + effects
voice_accent 5 Different accents and speakers
Other 7 Business call, variable effects

Voice Accent Profiles: - voice_female_american - American female speakers - voice_female_british - British female speakers - voice_female_indian - Indian female speakers - voice_male_spanish - Spanish male speakers - voice_multi_speaker_diverse - Spanish male + Indian female (cross-cultural)

Transcribe Audio Files

# Transcribe from latest audio run
npx tsx src/cli/index.ts transcribe --verbose

# Transcribe from specific audio run
npx tsx src/cli/index.ts transcribe --audio-run run_001_20260125_181823 --verbose

# Transcribe specific MAG only
npx tsx src/cli/index.ts transcribe --mag MAG_001_T001 --verbose

# Transcribe specific profiles
npx tsx src/cli/index.ts transcribe --profiles clean noisy --verbose

Output: test-data/transcription-outputs/transcription_YYYYMMDD_HHMMSS/

Transcription Output Structure:

transcription_xxx/MAG_001/T001/profiles/{profile}/
├── transcription.txt              # AssemblyAI output text
├── transcription-result.json      # Full API response
├── transcription-metadata.yaml    # Provider, model, timing
├── original-transcript.txt        # Original for comparison
├── evaluation-result.json         # WER, CER, domain metrics
└── evaluation-comparison.yaml     # Line-by-line diff

Extract Structured Data

# Extract from latest transcription run
npx tsx src/cli/index.ts extract --verbose

# Extract from specific transcription run
npx tsx src/cli/index.ts extract --transcription-run transcription_20260126_022028 --verbose

# Extract from original transcripts (skip transcription step)
npx tsx src/cli/index.ts extract --use-original --verbose

# Extract specific MAG only
npx tsx src/cli/index.ts extract --mag MAG_001_T001 --verbose

# Override AI provider (default: gemini; set via --provider or EXTRACTION_PROVIDER env var)
npx tsx src/cli/index.ts extract --provider gemini --verbose

Output: test-data/extraction-outputs/extraction_YYYYMMDD_HHMMSS/

Extraction Output Structure:

extraction_xxx/MAG_001/T001/profiles/{profile}/
├── extracted-points.json          # Extracted price points
├── extraction-result.json         # Full Gemini response
├── extraction-metadata.yaml       # Model, provider, timing
├── evaluation-result.json         # Precision, recall, F1 vs golden
└── evaluation-comparison.yaml     # Point-by-point comparison

Review Extractions

# Review from latest extraction run
npx tsx src/cli/index.ts review --verbose

# Review from specific extraction run
npx tsx src/cli/index.ts review --extraction-run extraction_20260126_025447 --verbose

# Review specific MAG only
npx tsx src/cli/index.ts review --mag MAG_001_T001 --verbose

# Override AI provider (default: gemini; set via --provider or REVIEW_PROVIDER env var)
npx tsx src/cli/index.ts review --provider gemini --verbose

Output: test-data/review-outputs/review_YYYYMMDD_HHMMSS/

Review Output Structure:

review_xxx/MAG_001/T001/profiles/{profile}/
├── review-result.json             # Confidence scores + rationale
└── review-metadata.yaml           # Model, provider, timing

Review Result Schema:

{
  "overallConfidence": 0.85,
  "pointReviews": [
    {
      "pointIndex": 0,
      "confidence": 1.0,
      "issues": [],
      "suggestions": []
    },
    {
      "pointIndex": 1,
      "confidence": 0.65,
      "issues": ["Source text appears garbled"],
      "suggestions": ["Flag as low quality"]
    }
  ],
  "rationale": "Overall assessment explanation...",
  "model": "gemini-2.5-flash",
  "provider": "gemini"
}

Validate Assessment Runs (Post-hoc)

Test the assessment validator against existing assessment output files without re-running the full pipeline:

# Validate one or more assessment runs
npx tsx src/cli/index.ts validate-runs \
  --runs assessment_20260226_204618 assessment_20260226_204630 \
  --mag cobalt-london --tc AT-0002 --provider gemini

# With verbose output and rate-limit delay
npx tsx src/cli/index.ts validate-runs \
  --runs assessment_20260226_204618 \
  --verbose --delay 500

# Override the Gemini model used for LLM review
npx tsx src/cli/index.ts validate-runs \
  --runs assessment_20260226_204618 \
  --model gemini-3-flash-preview

Runs are resolved from test-data/assessment-outputs/ by ID, or you can pass a full path.

Adding Test Cases

1. Create MAG Directory

mkdir -p test-data/mags/MAG_002/transcripts
mkdir -p test-data/mags/MAG_002/audio

2. Create Test Case Metadata

test-data/mags/MAG_002/test-case.json:

{
  "id": "MAG_002",
  "market": "Steel",
  "language": "en",
  "description": "Steel rebar market conversation",
  "tags": ["steel", "metals"],
  "createdAt": "2025-01-23T00:00:00.000Z",
  "updatedAt": "2025-01-23T00:00:00.000Z"
}

3. Create Golden Record

test-data/mags/MAG_002/golden-record.json:

{
  "testCaseId": "MAG_002",
  "sourceTranscript": "Speaker 1: The steel rebar price...",
  "extractedPoints": [
    {
      "market": "Steel Rebar",
      "priceValue": 450.00,
      "volume": 100,
      "volumeUnit": "tonnes",
      "type": "transaction",
      "sourceReferences": [
        {
          "lineStart": 1,
          "lineEnd": 1,
          "text": "The steel rebar price was $450 per tonne for 100 tonnes."
        }
      ]
    }
  ]
}

4. Add Transcript

test-data/mags/MAG_002/transcripts/original.txt:

Speaker 1: The steel rebar price was $450 per tonne for 100 tonnes.
Speaker 2: That matches our expectations for the current market.

5. Update Suite Definition

Add to test-data/suites/suite-definitions.yaml:

- name: default
  description: Default test suite with all standard MAGs
  testCases:
    - MAG_001
    - MAG_002

6. Generate Audio (Optional)

npx ts-node src/cli/index.ts generate-audio --mag MAG_002

Viewing Results

Run Results Location

Results are stored in test-data/results/{run_id}/:

test-data/results/abc12345-1234-5678-9abc-def012345678/
├── run-config.json    # Versioning info (git sha, config hash, etc.)
├── summary.json       # Aggregated metrics
└── mags/
    └── MAG_001.json   # Per-MAG detailed results

Run Config Schema

{
  "runId": "abc12345-1234-5678-9abc-def012345678",
  "gitSha": "5172ed4...",
  "timestamp": "2025-01-23T12:00:00.000Z",
  "configHash": "a1b2c3d4",
  "promptHashes": {
    "extraction": "e5f6g7h8",
    "review": "i9j0k1l2"
  },
  "testSuiteHash": "m3n4o5p6",
  "services": {
    "transcription": { "provider": "assemblyai", "model": "best" },
    "extraction": { "provider": "gemini", "model": "gemini-2.5-flash" },
    "review": { "provider": "gemini", "model": "gemini-2.5-flash" }
  }
}

Summary Schema

{
  "runId": "abc12345...",
  "timestamp": "2025-01-23T12:00:00.000Z",
  "totalTestCases": 1,
  "passed": 1,
  "failed": 0,
  "errors": 0,
  "metrics": {
    "transcription": {
      "avgWer": 0.05,
      "avgCer": 0.02
    },
    "extraction": {
      "avgAccuracy": 0.95,
      "avgPrecision": 0.90,
      "avgRecall": 0.95,
      "avgF1": 0.92
    },
    "review": {
      "avgConfidenceCorrelation": 0.85,
      "avgCalibrationError": 0.08
    }
  },
  "durationMs": 45000
}

Logfire Integration

When Logfire is enabled, all pipeline runs are traced:

  1. Enable in config: configs/eval/default.yaml:

    pipeline:
      logfire_enabled: true
    

  2. View traces in Logfire dashboard:

  3. Pipeline execution spans
  4. Individual step durations
  5. LLM call details
  6. Evaluation metrics

Programmatic Usage

Using the Pipeline Directly

import {
  createPipelineExecutor,
  type PipelineServicesConfig,
} from '@/services/pipeline/index.js';
import { AssemblyAITranscriptionService } from '@/services/transcription/assemblyai.js';
import { GeminiExtractionService } from '@/services/extraction/gemini.js';
import { GeminiReviewService } from '@/services/review/gemini.js';

// Create services
const services: PipelineServicesConfig = {
  transcription: new AssemblyAITranscriptionService({ logfireEnabled: false }),
  extraction: new GeminiExtractionService({ logfireEnabled: false }),
  review: new GeminiReviewService({ logfireEnabled: false }),
};

// Create executor
const executor = createPipelineExecutor(services, {
  logfireEnabled: false,
  evaluation: {
    numericTolerance: 0.01,
    matchThreshold: 0.8,
  },
});

// Run full pipeline
const result = await executor.run({
  testCaseId: 'MAG_001',
  audioPath: 'test-data/mags/MAG_001/audio/clean.mp3',
  goldenRecord: goldenRecord,
});

// Or run individual steps
const transcription = await executor.transcribe('path/to/audio.mp3');
const extraction = await executor.extract('transcript text...', { market: 'Cobalt' });
const review = await executor.review('transcript text...', extractionResult);

Using Individual Evaluators

import {
  TranscriptionEvaluator,
  ExtractionEvaluator,
  ReviewEvaluator,
} from '@/services/evaluation/index.js';

// Transcription evaluation
const transcriptionEval = new TranscriptionEvaluator({ numericTolerance: 0.01 });
const werResult = await transcriptionEval.evaluate(
  originalTranscript,
  aiTranscript,
  { provider: 'assemblyai', model: 'best' }
);

// Extraction evaluation
const extractionEval = new ExtractionEvaluator({ matchThreshold: 0.8 });
const extractionResult = await extractionEval.evaluate(
  goldenRecord,
  extractedPoints,
  { provider: 'gemini', model: 'gemini-2.5-flash' }
);

// Review evaluation
const reviewEval = new ReviewEvaluator();
const reviewResult = await reviewEval.evaluate(
  reviewedPoints,
  goldenRecord,
  { provider: 'gemini', model: 'gemini-2.5-flash' }
);

Troubleshooting

Common Issues

  1. "Transcription service not configured"
  2. Ensure ASSEMBLYAI_API_KEY is set
  3. Check service initialization in CLI

  4. "Profile not found"

  5. Check configs/audio-generation/profiles/ for available profiles
  6. Run npx ts-node src/cli/index.ts generate-audio --verbose to see available profiles

  7. "Test case not found"

  8. Verify MAG directory exists in test-data/mags/
  9. Check test-case.json and golden-record.json exist

  10. Low extraction accuracy

  11. Review fuzzy match threshold in config
  12. Check golden record format matches expected schema
  13. Verify transcript quality

Debug Mode

Set NODE_DEBUG=pipeline for detailed logging:

NODE_DEBUG=pipeline npx ts-node src/cli/index.ts run --mag MAG_001 --verbose