E2E Eval Pipeline - Feedback & Recommendations¶
What Worked Well¶
1. LangGraph + Service Abstraction¶
Outcome: Clean separation between orchestration and AI service calls.
- LangGraph StateGraph provides clear state management
- Skip flags enable flexible entry points
- Service interfaces make provider swapping straightforward
- Each node can be tested independently
Recommendation: Keep this pattern for future pipeline extensions.
2. Fuzzy Matching for Extraction Evaluation¶
Outcome: Practical evaluation that handles real-world variance.
- Numeric tolerance (14.10 == 14.1) prevents false negatives
- Levenshtein distance for string similarity works well
- Per-field weighting (market + price weighted higher) aligns with business importance
Recommendation: Consider adding configurable field weights per market type.
3. Confidence Calibration Metrics (ECE/MCE)¶
Outcome: Meaningful assessment of review quality beyond simple accuracy.
- Expected Calibration Error shows how well confidence predicts accuracy
- Bucket-based analysis reveals over/underconfidence patterns
- Correlation metric provides single summary number
Recommendation: Use ECE as primary review quality metric, with MCE for worst-case analysis.
4. Config Versioning¶
Outcome: Full reproducibility and comparability of runs.
- Git SHA + config hash + prompt hashes enable exact reproduction
- Test suite hash detects test data changes
- Run comparison CLI makes A/B testing straightforward
Recommendation: Consider adding SDK versions (assemblyai, @google/genai) to versioning.
Challenges Encountered¶
1. Type Conflicts Between Modules¶
Issue: PipelineResult and PipelineState names used in both audio-generation and pipeline modules.
Resolution: Renamed pipeline types to EvalPipelineResult and EvalPipelineState for clarity.
Recommendation: Use more specific type names from the start for shared modules.
2. Evaluation Result Type Mismatches¶
Issue: CLI expected different property names than evaluators returned (e.g., f1 vs f1Score, calibrationError vs ece).
Resolution: Fixed CLI to use actual property names from evaluators.
Recommendation: Share types between evaluators and CLI, or use a single result schema.
3. Audio Generation Integration¶
Issue: Audio pipeline expects file paths, CLI needed to coordinate directories.
Resolution: CLI handles directory creation, passes transcript path to pipeline.
Recommendation: Consider adding output directory override to AudioPipeline.run().
Test Case Management Recommendations¶
1. Directory Structure¶
Current structure works well:
test-data/mags/{MAG_ID}/
├── test-case.json # Metadata
├── golden-record.json # Ground truth
├── transcripts/
│ └── original.txt
└── audio/
├── clean.mp3
└── business_call.mp3
Recommendation: Keep this structure. It's version-control friendly and self-documenting.
2. Golden Record Creation Process¶
Current: Schema defined, but creation process not formalized.
Recommendation: 1. Create extraction guidelines document (what counts as a "price"?) 2. Use Claude/GPT to draft initial extraction from transcript 3. Human validates and corrects 4. Second human spot-checks critical cases 5. Store provenance (who created, when, methodology version)
3. Test Suite Organization¶
Current: Single suite definition file with test case lists.
Recommendation for scale: - Group by market type (metals, energy, agricultural) - Tag-based filtering (edge-case, regression, standard) - Separate "quick" vs "full" suites for CI integration
4. Edge Case Coverage¶
Current: mangled/ directory for edge cases.
Recommendation: Formalize edge case categories: - no-market-data - Conversations with zero extractable prices - garbled-numbers - Partially corrupted numeric values - multiple-currencies - USD/EUR/GBP in same transcript - ambiguous-type - Unclear if bid/offer/transaction - nested-ranges - "14-15, possibly 14.5-15.5"
Dashboard Ticket Suggestions¶
Based on implementation experience, recommend these follow-up tickets:
1. Results Dashboard (High Priority)¶
Scope: - View run summaries with filtering by date/config - Drill down to per-MAG results - Compare two runs side-by-side - Visualize confidence calibration curves
Data Source: Read from test-data/results/ JSON files or Logfire.
2. Test Suite Manager (Medium Priority)¶
Scope: - CRUD for test cases - Golden record editor with transcript highlighting - Bulk import from CSV/Excel - Duplicate detection
3. Metrics Trends (Medium Priority)¶
Scope: - Track WER/accuracy over time (by git sha) - Regression alerts when metrics drop - Compare models (Gemini vs Claude vs GPT)
4. Profile Manager (Low Priority)¶
Scope: - Create/edit audio generation profiles - Preview effects settings - A/B test profile variants
Logfire Integration Feedback¶
What Works¶
- Span-based tracing captures full pipeline flow
- Attributes allow filtering by test case, provider, model
- Existing
src/lib/logfire.tswrapper integrates cleanly
Gaps Identified¶
- Structured test case storage - Logfire is observability-focused, not a database
-
Recommendation: Keep test cases in files, results summary in Logfire
-
Query capabilities - Limited compared to SQL
-
Recommendation: Export to JSON for complex analysis, use Logfire for real-time monitoring
-
Retention - Check Logfire retention limits for long-term comparison
- Recommendation: Archive run summaries to files for permanent storage
Recommended Logfire Usage¶
| Data | Storage |
|---|---|
| Test case definitions | Files (version controlled) |
| Golden records | Files (version controlled) |
| Run config | Files + Logfire spans |
| Per-MAG results | Files + Logfire spans |
| Aggregate metrics | Files + Logfire metrics |
| LLM traces | Logfire only |
| Real-time monitoring | Logfire |
Next Steps¶
Immediate (This Sprint)¶
- ~~Generate audio files for MAG_001~~ (needs ElevenLabs API key)
- Run full pipeline end-to-end with real API keys
- Validate Logfire traces appear correctly
Short-Term (Next Sprint)¶
- Add 2-3 more MAG test cases
- Implement CI integration (run on PR)
- Create results dashboard ticket
Medium-Term¶
- Multi-language support (separate ticket)
- Additional transcription providers (Whisper, Deepgram)
- Additional extraction models (Claude, GPT)
Summary¶
The E2E eval pipeline implementation successfully: - ✅ Uses LangGraph for orchestration with service abstraction - ✅ Provides WER + domain metrics for transcription evaluation - ✅ Implements fuzzy matching for extraction evaluation - ✅ Includes confidence calibration for review evaluation - ✅ Supports entry at any pipeline step - ✅ Tracks configuration versions for reproducibility - ✅ Provides CLI for running and comparing evaluations
Key recommendations: 1. Keep file-based test data storage (version controlled) 2. Use Logfire for observability, not as primary data store 3. Formalize golden record creation process 4. Build results dashboard as next priority