Skip to content

Latest commit

 

History

History
335 lines (247 loc) · 20.4 KB

File metadata and controls

335 lines (247 loc) · 20.4 KB

MetricContext Documentation

This document explains all the variables in the MetricContext class, which is the data capsule that metrics receive for evaluation. For the list of metrics, how to run them, and their individual documentation, see metrics/README.md.

Overview

MetricContext is constructed by the MetricsRunner during metrics computation. It combines:

  1. Dataset variables from the EvaluationRecord (ground truth, agent config). Constant across runs.
  2. Result variables from raw logs (audit_log.json, pipecat_events.jsonl, elevenlabs_events.jsonl), processed by MetricsContextProcessor into structured variables that metrics can consume.

1. Dataset Variables

Record Identification

  • record_id: str - Unique identifier for this evaluation record.

Ground Truth (from Dataset)

These fields come from the EvaluationRecord in the dataset JSONL file.

  • user_goal: dict - Structured description of the user's goal, containing:
    • high_level_user_goal: Natural language summary of what the user wants to accomplish.
    • starting_utterance: The user's opening message.
    • decision_tree: Criteria and behaviors guiding the user simulator (must-have criteria, negotiation behavior, resolution/failure conditions, edge cases).
    • information_required: Key facts the user knows (names, dates, confirmation numbers, preferences, etc.).
    • Example:
      {
        "high_level_user_goal": "You want to move your AUS to LAX flight from March 20 to March 25...",
        "starting_utterance": "Hi, I need to change my flight to March 25.",
        "decision_tree": {"must_have_criteria": [...], "negotiation_behavior": [...], ...},
        "information_required": {"Passenger first name": "Samantha", ...}
      }
  • user_persona: str - Description of the user's persona/characteristics for the conversation.

Scenario Database State

These fields track the state of the scenario database before and after the conversation, used for deterministic metrics.

  • initial_scenario_db: dict[str, Any] - Scenario database state at the start of the conversation.
  • initial_scenario_db_hash: str - Hash of the initial state, for integrity verification.
  • final_scenario_db: dict[str, Any] - Actual scenario database state after the conversation ended. Used by task completion to evaluate what changes the agent made.
  • final_scenario_db_hash: str - Hash of the final state, for integrity verification.
  • expected_scenario_db: dict[str, Any] - Expected final state after successful task completion. Compare against final_scenario_db to assess task completion.

Agent Configuration

These fields contain the agent's configuration used during the conversation.

  • agent_role: str - The role/persona of the agent (e.g., "customer service representative").
  • agent_instructions: str - The system instructions given to the agent.
  • agent_tools: list[dict] - List of tools available to the agent with their schemas.
  • current_date_time: str - The simulated current date/time for the conversation.

2. Result Variables

Why Multiple Representations of the Same Conversation?

The conversation is captured in several overlapping forms — conversation_trace, per-role turn dictionaries (intended_*, transcribed_*), and audio timestamps — because no single representation serves all metrics.

The key distinction is between intended and transcribed turns:

  • Intended turns: What the speaker intended to say

    • For assistant: The text sent to the TTS engine before audio generation, or the text output returned by audio-native (S2S, S2T+TTS) models.
    • For user: The text the user simulator was instructed to speak.
  • Transcribed turns: What was actually heard and transcribed by speech recognition

    • For assistant: What the user simulator heard when the assistant spoke (via their STT).
    • For user: What the STT heard when the user spoke. For audio-native models, most endpoints return a transcript of the user's intended speech, but generated by a secondary STT service. It should not be used for evaluation, because it was not used by the audio-native model.

With that in mind, here is when to use each representation:

  • conversation_trace is the default choice for most metrics (faithfulness, conciseness, conversation progression, etc.). It provides a unified, chronological view that automatically handles architecture differences (cascade vs audio-native) and includes tool calls. For assistant turns it uses intended text; for user turns it uses transcribed text in cascade but intended text in audio-native mode (since audio-native transcriptions are unreliable — see audio-native section below). However, it cannot be used to assess TTS or STT performance, since it doesn't expose both sides of the comparison.
  • intended_* turns are needed when evaluating speech fidelity — comparing what the speaker intended to say vs. what we can hear.
  • transcribed_* turns are needed when evaluating STT accuracy in cascade systems — comparing what the speaker said vs. what the STT transcribed. For audio-native models, this field is unreliable (see the audio-native section below).
  • Audio timestamps are needed for metrics that analyze timing (turn-taking, response speed) rather than content.

In short: conversation_trace is the safest default, but the other variables exist for metrics that need to compare different stages of the speech pipeline (intended → spoken → heard). See more details below.

Per-Role Turn Data (Indexed by Turn ID)

These dictionaries map turn IDs (integers) to content for each speaker. See Why Multiple Representations? above for guidance on which to use.

How Turn IDs Are Determined

Turn IDs are assigned sequentially (0, 1, 2, ...) based on when events occur in the conversation timeline:

Turn Indexing Convention:

  • Assistant turn 0 = The greeting (first assistant utterance)
  • User turn 1 = First user utterance
  • Assistant turn 1 = Response to user turn 1
  • User turn 2 = Second user utterance
  • Assistant turn 2 = Response to user turn 2
  • And so on...

This indexing means that comparing the same turn ID across user and assistant gives you the assistant's reply to that user turn.

Turn Advancement:

  • Turns advance primarily when user audio starts (detected via ElevenLabs audio_start events)
  • The system tracks whether the assistant has spoken in the current turn before advancing
  • This ensures each turn represents a complete exchange: assistant speaks → user responds → new turn

Special Cases:

  • Interruptions: When a user interrupts the assistant (starts speaking while assistant is still talking), the current turn is labeled as interrupted and a new turn begins for the user's interrupting speech. When the assistant interrupts the user, the turn is not advanced.
  • Empty sessions: If user audio starts but no speech is detected (false positive), the turn is rolled back to avoid creating empty turns
  • Delayed transcripts: Transcripts that arrive late (after the turn has advanced) are assigned to the turn where their corresponding audio occurred, not the current turn, as much as possible.

See more details below in section "Log Processing Challenges and Robustness".

Variables

  • transcribed_assistant_turns: dict[int, str] - What was actually heard when the assistant spoke (transcribed by the user simulator STT).
    • Example: {0: "Hello, how can I help you today?", 1: "I can help you change your flight.", ...}
  • transcribed_user_turns: dict[int, str] - What was actually transcribed when the user spoke (by the agent STT). Unreliable for audio-native models — use intended_user_turns instead (see audio-native section below).
    • Example: {1: "I need to change my flight", 2: "I want to fly next week instead", ...}
  • intended_assistant_turns: dict[int, str] - What the assistant intended to say (text sent to TTS engine on the agent side).
    • Example: {0: "Hello, how can I help you?", 1: "Sure, let me help with that.", ...}
  • intended_user_turns: dict[int, str] - What the user simulator intended to say (ground truth user intent). For audio-native models, use this instead of transcribed_user_turns.
    • Example: {1: "I need to change my flight", 2: "I want to fly next week instead", ...}
  • audio_timestamps_assistant_turns: dict[int, list[tuple[float, float]]] - Audio timing segments (start, end) for each assistant turn, in seconds from conversation start. Multiple segments per turn are possible (e.g., interruptions). Sourced from ElevenLabs speech detection events.
    • Example: {0: [(0.0, 3.2)], 1: [(5.1, 8.7)], ...}
  • audio_timestamps_user_turns: dict[int, list[tuple[float, float]]] - Same format as above, for user turns. Sourced from ElevenLabs speech detection events.
    • Example: {1: [(3.5, 5.0)], 2: [(9.0, 11.2)], ...}
  • assistant_interrupted_turns: set[int] - Set of turn IDs where the assistant interrupted the user (started speaking while user audio was still active).
  • user_interrupted_turns: set[int] - Set of turn IDs where the user interrupted the assistant (started speaking while assistant audio was still active).

Conversation Trace

  • conversation_trace: list[dict] - A unified, chronological view of the entire conversation — user/assistant turns and tool calls/responses — in the order they logically occurred from the assistant's perspective. This is the recommended source of truth for understanding conversation flow, and is used by most judge metrics (faithfulness, conciseness, conversation progression, etc.).

    Format:

    [
      {"role": "assistant", "content": "Hello!", "type": "intended", "turn_id": 0},
      {"role": "user", "content": "I need help", "type": "transcribed", "turn_id": 1},
      {"role": "assistant", "content": "Let me check", "type": "intended", "turn_id": 1},
      {"tool_name": "get_reservation", "parameters": {...}, "type": "tool_call", "turn_id": 1},
      {"tool_name": "get_reservation", "tool_response": {...}, "type": "tool_response", "turn_id": 1},
      {"role": "assistant", "content": "I found your reservation", "type": "intended", "turn_id": 1}
    ]
    

Key differences from turn-indexed dictionaries:

  • Includes tool calls: Unlike the per-role turn dictionaries, this includes tool call and tool response entries interleaved with text.

  • Chronological order: Shows the exact sequence of events as they happened. A single turn may contain multiple user or assistant entries (e.g., if speech was split by tool calls, or if interruptions occurred).

  • Grouped by speaker: Consecutive entries from the same speaker are grouped together.

  • type field indicates source: Each entry has a type field:

    • "intended": For assistant turns (TTS text), and for user turns in audio-native mode (what the user simulator was instructed to say).
    • "transcribed": For user turns in cascade (what the assistant's STT heard).
    • "tool_call": Tool invocation with parameters.
    • "tool_response": Tool execution result.

    Note on reconciliation: The postprocessor reconciles the trace with voice log data, which can change types or insert entries: (1) the assistant greeting is always placed first — if missing or out of order, it is created/moved with type="intended"; (2) if the user's final turn arrived after the last audit-log entry, it is appended with type="intended"; (3) if transcription is missing (e.g., conversation ended before STT finished), the intended text is used as backfill. This means some entries may show type="intended" where you might expect "transcribed".

Tool Call Data

Extracted from the audit log.

  • tools_called: list[str] - List of unique tool names that were called (no duplicates).
  • tool_params: list[dict] - List of all tool calls with their parameters.
    • Format: [{"tool_name": "get_reservation", "tool_parameters": {"confirmation_number": "ABC123"}}, ...]
  • tool_responses: list[dict] - List of all tool responses.
    • Format: [{"tool_name": "get_reservation", "tool_response": {"status": "success", "reservation": {...}}}, ...]

Conversation Metadata

Counts and flags computed during benchmark execution.

  • num_turns: int - Total number of conversation turns.
  • num_assistant_turns: int - Total number of assistant turns in the conversation.
  • num_user_turns: int - Total number of user turns in the conversation.
  • num_tool_calls: int - Total number of tool calls made by the assistant.
  • conversation_finished: bool - Whether the conversation ended naturally with a goodbye message.
  • conversation_ended_reason: Optional[str] - Reason the conversation ended. This is used to determine if the conversation is valid.
    • "goodbye": User or assistant said goodbye
    • "timeout": Conversation exceeded max duration
    • "transfer": Assistant transferred to live agent
    • "error": An error occurred
  • duration_seconds: float - Total duration of the conversation in seconds.
  • pipeline_type: PipelineType - The pipeline architecture used (CASCADE, AUDIO_LLM, or S2S). Access context.is_audio_native for a convenience boolean that returns True for both AUDIO_LLM and S2S.
  • latency_assistant_turns: dict[int, float] - Per-turn latency in seconds (user speech end to assistant speech start), keyed by turn ID.

File Paths

These are absolute paths to output files saved during benchmark execution.

  • output_dir: str - Path to the directory containing all outputs for this record.
  • audio_assistant_path: Optional[str] - Path to the assistant's audio channel (mono WAV file).
  • audio_user_path: Optional[str] - Path to the user simulator's audio channel (mono WAV file).
  • audio_mixed_path: Optional[str] - Path to the mixed stereo audio (left=assistant, right=user).

Audio-Native (S2S, S2T+TTS) vs Cascade Architecture

This section explains the architectural differences that affect which variables are reliable. For a quick summary, see Why Multiple Representations?.

"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. This includes Speech-to-Speech (S2S) and Speech-to-Text + TTS (S2T+TTS) architectures.

Pipelines

Cascade: User audio → Agent STT → Text → LLM → Text → TTS → Assistant audio

The LLM processes transcribed text, so transcribed_user_turns reflects what the assistant actually saw.

Audio-native: User audio → Raw audio directly to model → Assistant audio (S2S) or User audio → Raw audio directly to model → Text → TTS → Assistant audio (S2T+TTS)

The model processes raw audio. The audit log may contain a transcript from the service's own secondary STT, but this is not what the model used — it's just for reference. This is why transcribed_user_turns is unreliable for audio-native models and intended_user_turns should be used instead.

Check context.pipeline_type to determine which mode was used, or context.is_audio_native for a boolean grouping of S2S and AUDIO_LLM.

Writing Audio-Native-Aware Metrics

If your metric needs user text directly (rather than via conversation_trace, which handles this automatically), branch on context.is_audio_native:

async def compute(self, context: MetricContext) -> MetricScore:
    # Option 1: manual branching
    user_turns = context.intended_user_turns if context.is_audio_native else context.transcribed_user_turns

    # Option 2: use conversation_trace (handles S2S vs Cascade automatically)
    for entry in context.conversation_trace:
        if entry["role"] == "user":
            # "intended" for S2S, "transcribed" for Cascade
            user_text = entry["content"]

Log Processing Challenges and Robustness

The Challenge of Joining Heterogeneous Logs

The MetricsContextProcessor must join logs from three independent sources that were never designed to work together:

  • ElevenLabs logs: User simulator events (intended speech, what was heard, audio timestamps)
  • Pipecat logs: Assistant framework events (TTS text, STT transcripts, turn boundaries). No tool calls.
  • Audit logs: All LLM calls and tool calls. No STT/TTS information.

Each source has its own timing, format, and quirks. Audio-native and cascade architectures also produce different event structures, adding further complexity.

Common Log Issues

Real-world logs frequently exhibit:

  1. Delayed or out-of-order timestamps due to async processing
  2. Duplicated entries from the same event logged multiple times
  3. False speech detection (STT triggers on silence or background noise)
  4. Missing events due to errors or early termination

These can cause alignment discrepancies between variables — e.g., turn counts off by one, tool calls in unexpected positions, or audio timestamps not matching transcript boundaries.

Robustness Measures

The processor handles these through:

  1. Multiple fallback paths when preferred logs are missing or malformed
  2. Timestamp-based sorting to reconstruct chronological order
  3. Fuzzy matching for minor text differences between sources
  4. Interruption detection from audio overlap
  5. Turn reconciliation merging data from multiple sources
  6. Graceful degradation when some data sources are incomplete

Testing and Validation

The processor is extensively tested with real-world scenarios in tests/fixtures/processor_histories.json, covering normal conversations, interruptions (both directions), delayed events, empty speech, tool calls between turns, and complex multi-turn exchanges. Many samples have been manually verified.

Expected Discrepancies

Minor discrepancies are still possible (turn IDs off by one, audio timestamps not perfectly aligned with transcripts, text in slightly different positions across representations), but these are infrequent. Metrics should be designed to tolerate them.


Data Flow Summary

Benchmark Execution:
  ├─ EvaluationRecord (dataset.jsonl)
  │  ├─ user_goal, user_persona, scenario_db → MetricContext
  │  └─ Feeds to AssistantServer + UserSimulator
  │
  ├─ AssistantServer writes:
  │  ├─ audit_log.json (tool calls, user/assistant turns)
  │  ├─ pipecat_events.jsonl (TTS text, turn boundaries)
  │  ├─ response_latencies.json (response speed data)
  │  └─ audio files (assistant, user, mixed)
  │
  └─ UserSimulator writes:
     └─ elevenlabs_events.jsonl (speech transcripts, audio timing)

Metrics Computation:
  ├─ MetricsRunner loads:
  │  ├─ Ground truth from dataset
  │  ├─ Scenario database states (initial, final, expected)
  │  ├─ Agent configuration
  │  └─ ConversationResult (result.json)
  │
  ├─ MetricsContextProcessor processes:
  │  ├─ Builds unified history from all 3 log sources
  │  ├─ Extracts turns in single pass through timeline:
  │  │  ├─ audit_log.json → transcribed_user_turns, tool calls/responses
  │  │  ├─ pipecat_events.jsonl → intended_assistant_turns
  │  │  └─ elevenlabs_events.jsonl → transcribed_assistant_turns, 
  │  │                                intended_user_turns, audio timestamps
  │  ├─ Detects interruptions from audio overlap
  │  ├─ Reconciles conversation_trace with voice logs
  │  └─ Loads response latencies
  │
  └─ Creates MetricContext with:
     ├─ Ground truth fields (user_goal, user_persona)
     ├─ Scenario database states
     ├─ Agent configuration
     ├─ Basic statistics (num_turns, num_tool_calls, duration)
     ├─ File paths (audio files, output_dir)
     ├─ Per-role turn data (intended/transcribed for both roles)
     ├─ Audio timestamps per turn
     ├─ Tool call data (params, responses)
     ├─ Conversation trace (unified timeline)
     ├─ Interruption tracking
     └─ Response speed latencies

Helper Methods

to_dict() -> dict[str, Any]

  • Converts the entire MetricContext to a serializable dictionary (used for storing context in metrics.json).

For examples of how metrics use MetricContext, see the CLAUDE.md "Adding a New Metric" section or the individual metric documentation.