E2E Test & Eval Pipeline - Architecture¶
Overview¶
This document describes the architecture decisions for the E2E Test & Evaluation Pipeline implemented in FAS-39.
Pipeline Flow¶
Original Transcript → Golden Record (human-curated)
↓
N Audio Files (varying quality) → AssemblyAI → Transcriptions → EVAL vs Original
↓
Transcript + Context → Gemini Flash → Extraction → EVAL vs Golden Record
↓
Transcript + Extraction → Gemini Flash → Review (confidence + rationale)
Architecture Decisions¶
1. LangGraph for Orchestration¶
Decision: Use LangGraph from day 1 for pipeline orchestration.
Rationale: - Foundation code, not throwaway - needs to support 3+ providers per element within weeks - Simple graphs now, easy to flex later - Allows complexity to be added/removed easily - Supports conditional flows and checkpointing when needed
Implementation: - src/services/pipeline/graphs/ contains individual step graphs - full-pipeline-graph.ts orchestrates the complete flow - Each node calls abstracted service classes, NOT inline AI code
2. Service Abstraction Pattern¶
Decision: Abstract all AI services behind interfaces from day 1.
Rationale: - Need to swap providers/models without changing orchestration logic - Each service handles its own Logfire instrumentation - Testable in isolation
Interfaces:
ITranscriptionService → AssemblyAI (or Whisper, Deepgram, etc.)
IExtractionService → Gemini Flash (or Claude, GPT, etc.)
IReviewService → Gemini Flash (or Claude, GPT, etc.)
Implementation: src/services/interfaces.ts
3. Logfire for Observability¶
Decision: Use Logfire for both application logging AND LLM traces (not LangSmith).
Rationale: - Single pane of glass for app logs + LLM traces - Reduces tool sprawl - Disabled by default, enabled via parameter
Implementation: - src/lib/logfire.ts wraps Logfire SDK - Each service and pipeline node uses trace() wrapper - logfireEnabled parameter controls activation
4. Entry at Any Step¶
Decision: Each pipeline segment must be invocable independently.
Rationale: - Test extraction resilience with deliberately mangled transcripts - Test review on known-bad extractions - Supports various testing scenarios
Implementation: - skipTranscription, skipExtraction, skipReview, skipEvaluation flags - PipelineInput accepts pre-computed results for any step - Individual runTranscription(), runExtraction(), runReview() functions
5. Fuzzy Tolerance for Evaluation¶
Decision: Use fuzzy matching for extraction evaluation.
Rationale: - 14.10 and 14.1 are equivalent - bid, Bid, BID should all match - Numeric tolerance configurable (default: 1%)
Implementation: - src/services/evaluation/utils/fuzzy-matcher.ts - Levenshtein distance for string similarity - Percentage-based tolerance for numeric values
6. Test Data Storage¶
Decision: Files for test cases, Logfire for results (hybrid approach).
Rationale: - Test cases in repo = version controlled, PR review of changes - Results in Logfire = time-series view, dashboard integration - Clear separation of input (repo) vs output (observability)
Implementation:
test-data/
├── suites/ # Suite definitions (YAML)
├── mags/ # Test cases (JSON + transcript files)
│ └── MAG_001/
│ ├── test-case.json
│ ├── golden-record.json
│ ├── transcripts/
│ └── audio/
├── mangled/ # Edge case tests
└── results/ # Run outputs (JSON)
7. Config Versioning¶
Decision: Track full configuration with every run for reproducibility.
Rationale: - Compare results across different versions - Link to application version (git sha) - Track prompt changes
Implementation: src/services/utils/config-hasher.ts
Tracked fields: - run_id: UUID - git_sha: Current commit - config_hash: sha256(config)[:8] - prompt_hashes: {extraction: sha[:8], review: sha[:8]} - test_suite_hash: Hash of all test case files
8. TTS Caching Strategy¶
Decision: Content-based caching with deterministic folder naming.
Rationale: - Same text + voice + parameters should reuse cached audio - Different voice configurations must have separate cache folders - Avoid redundant ElevenLabs API calls
Implementation:
Cache file naming: {timestamp}_{hash}.mp3 - Hash is SHA256 of text, voiceId, speed, stability, similarityBoost, and language - Ensures deterministic deduplication across runs
TTS folder naming: {provider}_v_{voiceIds}_spd_{speed}_stb_{stability}_sim_{similarity} - Example: elevenlabs_v_abc123_def456_spd_1.00_stb_0.50_sim_0.75 - All parameters included to ensure unique folders per configuration - Voice IDs sorted alphabetically for consistency
Voice-only profiles: When a profile has no audio effects (effects: []), the clean TTS audio is copied to the profiles output folder for consistency.
9. Voice Profile Organization¶
Decision: Organize voice profiles by actual ElevenLabs voice characteristics.
Current voice_accent profiles: - voice_female_american - American female speakers - voice_female_british - British female speakers - voice_female_indian - Indian female speakers - voice_male_spanish - Spanish male speakers - voice_multi_speaker_diverse - Cross-cultural: Spanish male + Indian female
Profile structure:
id: voice_female_american
name: Voice - Female American
category: voice_accent
tts:
speaker_voices:
Speaker 1: <elevenlabs_voice_id>
Speaker 2: <elevenlabs_voice_id>
effects: [] # Voice-only, no audio degradation
10. CLI-First Invocation¶
Decision: CLI as the primary invocation method.
Rationale: - Can run in GitHub Actions for automated eval on PR - No auth/network considerations - Easier to script batch test runs - API can wrap CLI later if needed
Implementation: src/cli/index.ts
Commands: - run - Execute full evaluation pipeline - compare - Compare two runs - generate-audio - Generate audio variants from transcripts - transcribe - Transcribe audio files (AssemblyAI) + evaluate - extract - Extract structured data (Gemini) + evaluate - review - Review extractions (Gemini confidence/rationale) - list - List available test cases and suites
11. Centralized AI Configuration¶
Decision: All AI model settings centralized in a single file.
Rationale: - Single source of truth for model versions - Easy to update models across all services - Consistent provider naming
Implementation: src/lib/ai-config.ts
export const GEMINI_CONFIG = {
DEFAULT_MODEL: 'gemini-2.5-flash',
PROVIDER: 'gemini',
} as const;
export const ASSEMBLYAI_CONFIG = {
PROVIDER: 'assemblyai',
DEFAULT_MODEL: 'default',
} as const;
Directory Structure¶
src/
├── lib/
│ ├── ai-config.ts # Centralized AI model configuration
│ ├── logfire.ts # Logfire integration
│ └── ...
├── services/
│ ├── types.ts # Shared types
│ ├── interfaces.ts # Service interfaces
│ ├── transcription/ # AssemblyAI implementation
│ │ ├── assemblyai.ts # AssemblyAI service
│ │ └── transcription-storage.ts # Results storage
│ ├── extraction/ # Gemini extraction
│ │ ├── gemini.ts # Gemini extraction service
│ │ └── extraction-storage.ts # Results storage
│ ├── review/ # Gemini review
│ │ ├── gemini.ts # Gemini review service
│ │ └── review-storage.ts # Results storage
│ ├── audio-generation/ # Audio generation (from audio-pipeline)
│ ├── evaluation/ # Evaluators
│ │ ├── transcription-evaluator.ts
│ │ ├── extraction-evaluator.ts
│ │ ├── review-evaluator.ts
│ │ └── utils/
│ │ ├── wer-calculator.ts
│ │ ├── domain-extractor.ts
│ │ └── fuzzy-matcher.ts
│ ├── pipeline/ # LangGraph orchestration
│ │ ├── graphs/
│ │ │ ├── transcription-graph.ts
│ │ │ ├── extraction-graph.ts
│ │ │ ├── review-graph.ts
│ │ │ └── full-pipeline-graph.ts
│ │ ├── factory.ts
│ │ └── types.ts
│ └── utils/
│ └── config-hasher.ts
└── cli/
├── index.ts # CLI entry point
├── config-loader.ts # Test data loading
├── types.ts
└── commands/
├── run.ts # Full pipeline command
├── compare.ts # Run comparison
├── generate-audio.ts # Audio generation
├── transcribe.ts # Transcription + eval
├── extract.ts # Extraction + eval
└── review.ts # Review + confidence
test-data/
├── audio-outputs/ # Generated audio files (gitignored)
│ └── run_xxx/
│ ├── manifest.json # Run metadata
│ ├── progress.json # Run progress
│ └── MAG_001/T001/
│ ├── translations/ # Normalized transcripts
│ ├── audio/ # TTS audio (grouped by config)
│ └── profiles/ # Final audio with effects
├── transcription-outputs/ # Transcription results (gitignored)
│ └── transcription_xxx/
│ ├── manifest.json
│ └── MAG_001/T001/profiles/{profile}/
│ ├── transcription.txt
│ ├── evaluation-result.json
│ └── evaluation-comparison.yaml
├── extraction-outputs/ # Extraction results (gitignored)
│ └── extraction_xxx/
│ └── MAG_001/T001/profiles/{profile}/
│ ├── extracted-points.json
│ ├── evaluation-result.json
│ └── evaluation-comparison.yaml
├── review-outputs/ # Review results (gitignored)
│ └── review_xxx/
│ └── MAG_001/T001/profiles/{profile}/
│ └── review-result.json
├── audio-profiles/ # Audio profile definitions
│ ├── bad_connection/ # Connection degradation profiles
│ ├── combined/ # Combined effect profiles
│ ├── noise/ # Background noise profiles
│ ├── office/ # Office environment profiles
│ ├── speed/ # Speech rate profiles
│ ├── telephone/ # Telephone quality profiles
│ └── voice_accent/ # Voice/accent variation profiles
├── mags/ # Test cases
│ └── MAG_001/
│ ├── test-case.json
│ ├── transcripts/
│ │ └── T001/
│ │ ├── transcript.txt
│ │ └── golden-record.json
│ ├── sound-effects.yaml
│ └── sounds/ # MAG-specific sound files
├── tts-cache/ # TTS cache (tracked)
│ └── en/
│ └── {timestamp}_{hash}.mp3 # Content-based naming
├── sound-effects/ # Sound effect files (tracked)
├── suites/ # Test suite definitions
└── mangled/ # Edge case test transcripts
Evaluation Metrics¶
Transcription Accuracy¶
- WER (Word Error Rate) - Industry standard
- CER (Character Error Rate)
- Domain Metrics:
- Price accuracy
- Volume accuracy
- Number accuracy
Extraction Accuracy¶
- Precision - Correct extractions / Total extracted
- Recall - Correct extractions / Total in golden record
- F1 Score - Harmonic mean
- Per-field accuracy breakdown
- False positives/negatives count
Review Quality¶
- ECE (Expected Calibration Error)
- MCE (Maximum Calibration Error)
- Confidence Correlation - Does confidence predict accuracy?
- Over/Underconfidence rates
Out of Scope¶
Per FAS-39 ticket: - Dashboard UI (separate ticket) - Production hardening - Assessment stage (separate from Review) - Profile Management System (own ticket) - Test Weighting System (dashboard ticket) - Multi-run averaging (dashboard ticket)
References¶
- FAS-39 Linear ticket
FAS-39/call-transcript.txt- Original requirements discussionFAS-39/content.md- Ticket description