E2E Eval Pipeline - Feedback & Recommendations¶

What Worked Well¶

1. LangGraph + Service Abstraction¶

Outcome: Clean separation between orchestration and AI service calls.

LangGraph StateGraph provides clear state management
Skip flags enable flexible entry points
Service interfaces make provider swapping straightforward
Each node can be tested independently

Recommendation: Keep this pattern for future pipeline extensions.

2. Fuzzy Matching for Extraction Evaluation¶

Outcome: Practical evaluation that handles real-world variance.

Numeric tolerance (14.10 == 14.1) prevents false negatives
Levenshtein distance for string similarity works well
Per-field weighting (market + price weighted higher) aligns with business importance

Recommendation: Consider adding configurable field weights per market type.

3. Confidence Calibration Metrics (ECE/MCE)¶

Outcome: Meaningful assessment of review quality beyond simple accuracy.

Expected Calibration Error shows how well confidence predicts accuracy
Bucket-based analysis reveals over/underconfidence patterns
Correlation metric provides single summary number

Recommendation: Use ECE as primary review quality metric, with MCE for worst-case analysis.

4. Config Versioning¶

Outcome: Full reproducibility and comparability of runs.

Git SHA + config hash + prompt hashes enable exact reproduction
Test suite hash detects test data changes
Run comparison CLI makes A/B testing straightforward

Recommendation: Consider adding SDK versions (assemblyai, @google/genai) to versioning.

Challenges Encountered¶

1. Type Conflicts Between Modules¶

Issue: PipelineResult and PipelineState names used in both audio-generation and pipeline modules.

Resolution: Renamed pipeline types to EvalPipelineResult and EvalPipelineState for clarity.

Recommendation: Use more specific type names from the start for shared modules.

2. Evaluation Result Type Mismatches¶

Issue: CLI expected different property names than evaluators returned (e.g., f1 vs f1Score, calibrationError vs ece).

Resolution: Fixed CLI to use actual property names from evaluators.

Recommendation: Share types between evaluators and CLI, or use a single result schema.

3. Audio Generation Integration¶

Issue: Audio pipeline expects file paths, CLI needed to coordinate directories.

Resolution: CLI handles directory creation, passes transcript path to pipeline.

Recommendation: Consider adding output directory override to AudioPipeline.run().

Test Case Management Recommendations¶

1. Directory Structure¶

Current structure works well:

test-data/mags/{MAG_ID}/
├── test-case.json      # Metadata
├── golden-record.json  # Ground truth
├── transcripts/
│   └── original.txt
└── audio/
    ├── clean.mp3
    └── business_call.mp3

Recommendation: Keep this structure. It's version-control friendly and self-documenting.

2. Golden Record Creation Process¶

Current: Schema defined, but creation process not formalized.

Recommendation: 1. Create extraction guidelines document (what counts as a "price"?) 2. Use Claude/GPT to draft initial extraction from transcript 3. Human validates and corrects 4. Second human spot-checks critical cases 5. Store provenance (who created, when, methodology version)

3. Test Suite Organization¶

Current: Single suite definition file with test case lists.

Recommendation for scale: - Group by market type (metals, energy, agricultural) - Tag-based filtering (edge-case, regression, standard) - Separate "quick" vs "full" suites for CI integration

4. Edge Case Coverage¶

Current: mangled/ directory for edge cases.

Recommendation: Formalize edge case categories: - no-market-data - Conversations with zero extractable prices - garbled-numbers - Partially corrupted numeric values - multiple-currencies - USD/EUR/GBP in same transcript - ambiguous-type - Unclear if bid/offer/transaction - nested-ranges - "14-15, possibly 14.5-15.5"

Dashboard Ticket Suggestions¶

Based on implementation experience, recommend these follow-up tickets:

1. Results Dashboard (High Priority)¶

Scope: - View run summaries with filtering by date/config - Drill down to per-MAG results - Compare two runs side-by-side - Visualize confidence calibration curves

Data Source: Read from test-data/results/ JSON files or Logfire.

2. Test Suite Manager (Medium Priority)¶

Scope: - CRUD for test cases - Golden record editor with transcript highlighting - Bulk import from CSV/Excel - Duplicate detection

3. Metrics Trends (Medium Priority)¶

Scope: - Track WER/accuracy over time (by git sha) - Regression alerts when metrics drop - Compare models (Gemini vs Claude vs GPT)

4. Profile Manager (Low Priority)¶

Scope: - Create/edit audio generation profiles - Preview effects settings - A/B test profile variants

Logfire Integration Feedback¶

What Works¶

Span-based tracing captures full pipeline flow
Attributes allow filtering by test case, provider, model
Existing src/lib/logfire.ts wrapper integrates cleanly

Gaps Identified¶

Structured test case storage - Logfire is observability-focused, not a database
Recommendation: Keep test cases in files, results summary in Logfire
Query capabilities - Limited compared to SQL
Recommendation: Export to JSON for complex analysis, use Logfire for real-time monitoring
Retention - Check Logfire retention limits for long-term comparison
Recommendation: Archive run summaries to files for permanent storage

Recommended Logfire Usage¶

Data	Storage
Test case definitions	Files (version controlled)
Golden records	Files (version controlled)
Run config	Files + Logfire spans
Per-MAG results	Files + Logfire spans
Aggregate metrics	Files + Logfire metrics
LLM traces	Logfire only
Real-time monitoring	Logfire

Next Steps¶

Immediate (This Sprint)¶

~~Generate audio files for MAG_001~~ (needs ElevenLabs API key)
Run full pipeline end-to-end with real API keys
Validate Logfire traces appear correctly

Short-Term (Next Sprint)¶

Add 2-3 more MAG test cases
Implement CI integration (run on PR)
Create results dashboard ticket

Medium-Term¶

Multi-language support (separate ticket)
Additional transcription providers (Whisper, Deepgram)
Additional extraction models (Claude, GPT)

Summary¶

The E2E eval pipeline implementation successfully: - ✅ Uses LangGraph for orchestration with service abstraction - ✅ Provides WER + domain metrics for transcription evaluation - ✅ Implements fuzzy matching for extraction evaluation - ✅ Includes confidence calibration for review evaluation - ✅ Supports entry at any pipeline step - ✅ Tracks configuration versions for reproducibility - ✅ Provides CLI for running and comparing evaluations

Key recommendations: 1. Keep file-based test data storage (version controlled) 2. Use Logfire for observability, not as primary data store 3. Formalize golden record creation process 4. Build results dashboard as next priority