Skip to content

E2E Test & Eval Pipeline - Architecture

Overview

This document describes the architecture decisions for the E2E Test & Evaluation Pipeline implemented in FAS-39.

Pipeline Flow

Original Transcript → Golden Record (human-curated)
N Audio Files (varying quality) → AssemblyAI → Transcriptions → EVAL vs Original
Transcript + Context → Gemini Flash → Extraction → EVAL vs Golden Record
Transcript + Extraction → Gemini Flash → Review (confidence + rationale)

Architecture Decisions

1. LangGraph for Orchestration

Decision: Use LangGraph from day 1 for pipeline orchestration.

Rationale: - Foundation code, not throwaway - needs to support 3+ providers per element within weeks - Simple graphs now, easy to flex later - Allows complexity to be added/removed easily - Supports conditional flows and checkpointing when needed

Implementation: - src/services/pipeline/graphs/ contains individual step graphs - full-pipeline-graph.ts orchestrates the complete flow - Each node calls abstracted service classes, NOT inline AI code

2. Service Abstraction Pattern

Decision: Abstract all AI services behind interfaces from day 1.

Rationale: - Need to swap providers/models without changing orchestration logic - Each service handles its own Logfire instrumentation - Testable in isolation

Interfaces:

ITranscriptionService   AssemblyAI (or Whisper, Deepgram, etc.)
IExtractionService      Gemini Flash (or Claude, GPT, etc.)
IReviewService          Gemini Flash (or Claude, GPT, etc.)

Implementation: src/services/interfaces.ts

3. Logfire for Observability

Decision: Use Logfire for both application logging AND LLM traces (not LangSmith).

Rationale: - Single pane of glass for app logs + LLM traces - Reduces tool sprawl - Disabled by default, enabled via parameter

Implementation: - src/lib/logfire.ts wraps Logfire SDK - Each service and pipeline node uses trace() wrapper - logfireEnabled parameter controls activation

4. Entry at Any Step

Decision: Each pipeline segment must be invocable independently.

Rationale: - Test extraction resilience with deliberately mangled transcripts - Test review on known-bad extractions - Supports various testing scenarios

Implementation: - skipTranscription, skipExtraction, skipReview, skipEvaluation flags - PipelineInput accepts pre-computed results for any step - Individual runTranscription(), runExtraction(), runReview() functions

5. Fuzzy Tolerance for Evaluation

Decision: Use fuzzy matching for extraction evaluation.

Rationale: - 14.10 and 14.1 are equivalent - bid, Bid, BID should all match - Numeric tolerance configurable (default: 1%)

Implementation: - src/services/evaluation/utils/fuzzy-matcher.ts - Levenshtein distance for string similarity - Percentage-based tolerance for numeric values

6. Test Data Storage

Decision: Files for test cases, Logfire for results (hybrid approach).

Rationale: - Test cases in repo = version controlled, PR review of changes - Results in Logfire = time-series view, dashboard integration - Clear separation of input (repo) vs output (observability)

Implementation:

test-data/
├── suites/               # Suite definitions (YAML)
├── mags/                 # Test cases (JSON + transcript files)
│   └── MAG_001/
│       ├── test-case.json
│       ├── golden-record.json
│       ├── transcripts/
│       └── audio/
├── mangled/              # Edge case tests
└── results/              # Run outputs (JSON)

7. Config Versioning

Decision: Track full configuration with every run for reproducibility.

Rationale: - Compare results across different versions - Link to application version (git sha) - Track prompt changes

Implementation: src/services/utils/config-hasher.ts

Tracked fields: - run_id: UUID - git_sha: Current commit - config_hash: sha256(config)[:8] - prompt_hashes: {extraction: sha[:8], review: sha[:8]} - test_suite_hash: Hash of all test case files

8. TTS Caching Strategy

Decision: Content-based caching with deterministic folder naming.

Rationale: - Same text + voice + parameters should reuse cached audio - Different voice configurations must have separate cache folders - Avoid redundant ElevenLabs API calls

Implementation:

Cache file naming: {timestamp}_{hash}.mp3 - Hash is SHA256 of text, voiceId, speed, stability, similarityBoost, and language - Ensures deterministic deduplication across runs

TTS folder naming: {provider}_v_{voiceIds}_spd_{speed}_stb_{stability}_sim_{similarity} - Example: elevenlabs_v_abc123_def456_spd_1.00_stb_0.50_sim_0.75 - All parameters included to ensure unique folders per configuration - Voice IDs sorted alphabetically for consistency

Voice-only profiles: When a profile has no audio effects (effects: []), the clean TTS audio is copied to the profiles output folder for consistency.

9. Voice Profile Organization

Decision: Organize voice profiles by actual ElevenLabs voice characteristics.

Current voice_accent profiles: - voice_female_american - American female speakers - voice_female_british - British female speakers - voice_female_indian - Indian female speakers - voice_male_spanish - Spanish male speakers - voice_multi_speaker_diverse - Cross-cultural: Spanish male + Indian female

Profile structure:

id: voice_female_american
name: Voice - Female American
category: voice_accent
tts:
  speaker_voices:
    Speaker 1: <elevenlabs_voice_id>
    Speaker 2: <elevenlabs_voice_id>
effects: []  # Voice-only, no audio degradation

10. CLI-First Invocation

Decision: CLI as the primary invocation method.

Rationale: - Can run in GitHub Actions for automated eval on PR - No auth/network considerations - Easier to script batch test runs - API can wrap CLI later if needed

Implementation: src/cli/index.ts

Commands: - run - Execute full evaluation pipeline - compare - Compare two runs - generate-audio - Generate audio variants from transcripts - transcribe - Transcribe audio files (AssemblyAI) + evaluate - extract - Extract structured data (Gemini) + evaluate - review - Review extractions (Gemini confidence/rationale) - list - List available test cases and suites

11. Centralized AI Configuration

Decision: All AI model settings centralized in a single file.

Rationale: - Single source of truth for model versions - Easy to update models across all services - Consistent provider naming

Implementation: src/lib/ai-config.ts

export const GEMINI_CONFIG = {
  DEFAULT_MODEL: 'gemini-2.5-flash',
  PROVIDER: 'gemini',
} as const;

export const ASSEMBLYAI_CONFIG = {
  PROVIDER: 'assemblyai',
  DEFAULT_MODEL: 'default',
} as const;

Directory Structure

src/
├── lib/
│   ├── ai-config.ts            # Centralized AI model configuration
│   ├── logfire.ts              # Logfire integration
│   └── ...
├── services/
│   ├── types.ts                # Shared types
│   ├── interfaces.ts           # Service interfaces
│   ├── transcription/          # AssemblyAI implementation
│   │   ├── assemblyai.ts       # AssemblyAI service
│   │   └── transcription-storage.ts  # Results storage
│   ├── extraction/             # Gemini extraction
│   │   ├── gemini.ts           # Gemini extraction service
│   │   └── extraction-storage.ts  # Results storage
│   ├── review/                 # Gemini review
│   │   ├── gemini.ts           # Gemini review service
│   │   └── review-storage.ts   # Results storage
│   ├── audio-generation/       # Audio generation (from audio-pipeline)
│   ├── evaluation/             # Evaluators
│   │   ├── transcription-evaluator.ts
│   │   ├── extraction-evaluator.ts
│   │   ├── review-evaluator.ts
│   │   └── utils/
│   │       ├── wer-calculator.ts
│   │       ├── domain-extractor.ts
│   │       └── fuzzy-matcher.ts
│   ├── pipeline/               # LangGraph orchestration
│   │   ├── graphs/
│   │   │   ├── transcription-graph.ts
│   │   │   ├── extraction-graph.ts
│   │   │   ├── review-graph.ts
│   │   │   └── full-pipeline-graph.ts
│   │   ├── factory.ts
│   │   └── types.ts
│   └── utils/
│       └── config-hasher.ts
└── cli/
    ├── index.ts                # CLI entry point
    ├── config-loader.ts        # Test data loading
    ├── types.ts
    └── commands/
        ├── run.ts              # Full pipeline command
        ├── compare.ts          # Run comparison
        ├── generate-audio.ts   # Audio generation
        ├── transcribe.ts       # Transcription + eval
        ├── extract.ts          # Extraction + eval
        └── review.ts           # Review + confidence

test-data/
├── audio-outputs/              # Generated audio files (gitignored)
│   └── run_xxx/
│       ├── manifest.json       # Run metadata
│       ├── progress.json       # Run progress
│       └── MAG_001/T001/
│           ├── translations/   # Normalized transcripts
│           ├── audio/          # TTS audio (grouped by config)
│           └── profiles/       # Final audio with effects
├── transcription-outputs/      # Transcription results (gitignored)
│   └── transcription_xxx/
│       ├── manifest.json
│       └── MAG_001/T001/profiles/{profile}/
│           ├── transcription.txt
│           ├── evaluation-result.json
│           └── evaluation-comparison.yaml
├── extraction-outputs/         # Extraction results (gitignored)
│   └── extraction_xxx/
│       └── MAG_001/T001/profiles/{profile}/
│           ├── extracted-points.json
│           ├── evaluation-result.json
│           └── evaluation-comparison.yaml
├── review-outputs/             # Review results (gitignored)
│   └── review_xxx/
│       └── MAG_001/T001/profiles/{profile}/
│           └── review-result.json
├── audio-profiles/             # Audio profile definitions
│   ├── bad_connection/         # Connection degradation profiles
│   ├── combined/               # Combined effect profiles
│   ├── noise/                  # Background noise profiles
│   ├── office/                 # Office environment profiles
│   ├── speed/                  # Speech rate profiles
│   ├── telephone/              # Telephone quality profiles
│   └── voice_accent/           # Voice/accent variation profiles
├── mags/                       # Test cases
│   └── MAG_001/
│       ├── test-case.json
│       ├── transcripts/
│       │   └── T001/
│       │       ├── transcript.txt
│       │       └── golden-record.json
│       ├── sound-effects.yaml
│       └── sounds/             # MAG-specific sound files
├── tts-cache/                  # TTS cache (tracked)
│   └── en/
│       └── {timestamp}_{hash}.mp3  # Content-based naming
├── sound-effects/              # Sound effect files (tracked)
├── suites/                     # Test suite definitions
└── mangled/                    # Edge case test transcripts

Evaluation Metrics

Transcription Accuracy

  • WER (Word Error Rate) - Industry standard
  • CER (Character Error Rate)
  • Domain Metrics:
  • Price accuracy
  • Volume accuracy
  • Number accuracy

Extraction Accuracy

  • Precision - Correct extractions / Total extracted
  • Recall - Correct extractions / Total in golden record
  • F1 Score - Harmonic mean
  • Per-field accuracy breakdown
  • False positives/negatives count

Review Quality

  • ECE (Expected Calibration Error)
  • MCE (Maximum Calibration Error)
  • Confidence Correlation - Does confidence predict accuracy?
  • Over/Underconfidence rates

Out of Scope

Per FAS-39 ticket: - Dashboard UI (separate ticket) - Production hardening - Assessment stage (separate from Review) - Profile Management System (own ticket) - Test Weighting System (dashboard ticket) - Multi-run averaging (dashboard ticket)

References

  • FAS-39 Linear ticket
  • FAS-39/call-transcript.txt - Original requirements discussion
  • FAS-39/content.md - Ticket description