End-to-end evaluation framework for conversational voice agents that runs bot-to-bot audio simulations and scores agents on task accuracy (EVA-A) and interaction experience (EVA-X). Includes per-scenario backend state, accent/noise perturbations, and 213 scenarios across airline, healthcare HR, and enterprise IT domains.