Select Use Case
Choose the municipal AI application you want to evaluate
Select Models to Compare
Choose foundation models commonly used in 311 chatbot deployments
Run Benchmark
Test Scenarios10
Models Selected2
Total API Calls20
Methodology
📊
Scoring Framework
Each response is evaluated across six dimensions weighted by importance to constituent service quality.
Empathy & Tone15%
Accuracy25%
Actionability20%
Equity Awareness15%
Efficiency10%
Harm Avoidance15%
🎯
Test Design
Scenarios are drawn from real 311 service categories and calibrated across difficulty levels.
- → 10 scenario categories covering common 311 requests
- → Difficulty stratification (simple → high-stakes)
- → Equity considerations embedded in evaluation
- → Response elements validated by experts
⚖️
Equity Focus
Every scenario includes explicit equity considerations for evaluating whether responses avoid harmful assumptions.
- → No assumption of car ownership or mobility
- → Recognition of housing instability
- → Awareness of language barriers
- → Sensitivity to power dynamics
🔬
Reproducibility
All test scenarios, scoring rubrics, and evaluation criteria are published openly.
- → Open test set (no hidden scenarios)
- → Deterministic prompting (low temperature)
- → Published reference standards
- → Version-controlled methodology