Skip to content

E2E Eval Pipeline - Feedback & Recommendations

What Worked Well

1. LangGraph + Service Abstraction

Outcome: Clean separation between orchestration and AI service calls.

  • LangGraph StateGraph provides clear state management
  • Skip flags enable flexible entry points
  • Service interfaces make provider swapping straightforward
  • Each node can be tested independently

Recommendation: Keep this pattern for future pipeline extensions.

2. Fuzzy Matching for Extraction Evaluation

Outcome: Practical evaluation that handles real-world variance.

  • Numeric tolerance (14.10 == 14.1) prevents false negatives
  • Levenshtein distance for string similarity works well
  • Per-field weighting (market + price weighted higher) aligns with business importance

Recommendation: Consider adding configurable field weights per market type.

3. Confidence Calibration Metrics (ECE/MCE)

Outcome: Meaningful assessment of review quality beyond simple accuracy.

  • Expected Calibration Error shows how well confidence predicts accuracy
  • Bucket-based analysis reveals over/underconfidence patterns
  • Correlation metric provides single summary number

Recommendation: Use ECE as primary review quality metric, with MCE for worst-case analysis.

4. Config Versioning

Outcome: Full reproducibility and comparability of runs.

  • Git SHA + config hash + prompt hashes enable exact reproduction
  • Test suite hash detects test data changes
  • Run comparison CLI makes A/B testing straightforward

Recommendation: Consider adding SDK versions (assemblyai, @google/genai) to versioning.

Challenges Encountered

1. Type Conflicts Between Modules

Issue: PipelineResult and PipelineState names used in both audio-generation and pipeline modules.

Resolution: Renamed pipeline types to EvalPipelineResult and EvalPipelineState for clarity.

Recommendation: Use more specific type names from the start for shared modules.

2. Evaluation Result Type Mismatches

Issue: CLI expected different property names than evaluators returned (e.g., f1 vs f1Score, calibrationError vs ece).

Resolution: Fixed CLI to use actual property names from evaluators.

Recommendation: Share types between evaluators and CLI, or use a single result schema.

3. Audio Generation Integration

Issue: Audio pipeline expects file paths, CLI needed to coordinate directories.

Resolution: CLI handles directory creation, passes transcript path to pipeline.

Recommendation: Consider adding output directory override to AudioPipeline.run().

Test Case Management Recommendations

1. Directory Structure

Current structure works well:

test-data/mags/{MAG_ID}/
├── test-case.json      # Metadata
├── golden-record.json  # Ground truth
├── transcripts/
│   └── original.txt
└── audio/
    ├── clean.mp3
    └── business_call.mp3

Recommendation: Keep this structure. It's version-control friendly and self-documenting.

2. Golden Record Creation Process

Current: Schema defined, but creation process not formalized.

Recommendation: 1. Create extraction guidelines document (what counts as a "price"?) 2. Use Claude/GPT to draft initial extraction from transcript 3. Human validates and corrects 4. Second human spot-checks critical cases 5. Store provenance (who created, when, methodology version)

3. Test Suite Organization

Current: Single suite definition file with test case lists.

Recommendation for scale: - Group by market type (metals, energy, agricultural) - Tag-based filtering (edge-case, regression, standard) - Separate "quick" vs "full" suites for CI integration

4. Edge Case Coverage

Current: mangled/ directory for edge cases.

Recommendation: Formalize edge case categories: - no-market-data - Conversations with zero extractable prices - garbled-numbers - Partially corrupted numeric values - multiple-currencies - USD/EUR/GBP in same transcript - ambiguous-type - Unclear if bid/offer/transaction - nested-ranges - "14-15, possibly 14.5-15.5"

Dashboard Ticket Suggestions

Based on implementation experience, recommend these follow-up tickets:

1. Results Dashboard (High Priority)

Scope: - View run summaries with filtering by date/config - Drill down to per-MAG results - Compare two runs side-by-side - Visualize confidence calibration curves

Data Source: Read from test-data/results/ JSON files or Logfire.

2. Test Suite Manager (Medium Priority)

Scope: - CRUD for test cases - Golden record editor with transcript highlighting - Bulk import from CSV/Excel - Duplicate detection

Scope: - Track WER/accuracy over time (by git sha) - Regression alerts when metrics drop - Compare models (Gemini vs Claude vs GPT)

4. Profile Manager (Low Priority)

Scope: - Create/edit audio generation profiles - Preview effects settings - A/B test profile variants

Logfire Integration Feedback

What Works

  • Span-based tracing captures full pipeline flow
  • Attributes allow filtering by test case, provider, model
  • Existing src/lib/logfire.ts wrapper integrates cleanly

Gaps Identified

  1. Structured test case storage - Logfire is observability-focused, not a database
  2. Recommendation: Keep test cases in files, results summary in Logfire

  3. Query capabilities - Limited compared to SQL

  4. Recommendation: Export to JSON for complex analysis, use Logfire for real-time monitoring

  5. Retention - Check Logfire retention limits for long-term comparison

  6. Recommendation: Archive run summaries to files for permanent storage
Data Storage
Test case definitions Files (version controlled)
Golden records Files (version controlled)
Run config Files + Logfire spans
Per-MAG results Files + Logfire spans
Aggregate metrics Files + Logfire metrics
LLM traces Logfire only
Real-time monitoring Logfire

Next Steps

Immediate (This Sprint)

  1. ~~Generate audio files for MAG_001~~ (needs ElevenLabs API key)
  2. Run full pipeline end-to-end with real API keys
  3. Validate Logfire traces appear correctly

Short-Term (Next Sprint)

  1. Add 2-3 more MAG test cases
  2. Implement CI integration (run on PR)
  3. Create results dashboard ticket

Medium-Term

  1. Multi-language support (separate ticket)
  2. Additional transcription providers (Whisper, Deepgram)
  3. Additional extraction models (Claude, GPT)

Summary

The E2E eval pipeline implementation successfully: - ✅ Uses LangGraph for orchestration with service abstraction - ✅ Provides WER + domain metrics for transcription evaluation - ✅ Implements fuzzy matching for extraction evaluation - ✅ Includes confidence calibration for review evaluation - ✅ Supports entry at any pipeline step - ✅ Tracks configuration versions for reproducibility - ✅ Provides CLI for running and comparing evaluations

Key recommendations: 1. Keep file-based test data storage (version controlled) 2. Use Logfire for observability, not as primary data store 3. Formalize golden record creation process 4. Build results dashboard as next priority