Public Bench

Municipal AI Evaluation Lab
MVP Preview

Select Use Case

Choose the municipal AI application you want to evaluate

Select Models to Compare

Choose foundation models commonly used in 311 chatbot deployments

Run Benchmark

Test Scenarios10
Models Selected2
Total API Calls20

Methodology

📊

Scoring Framework

Each response is evaluated across six dimensions weighted by importance to constituent service quality.

Empathy & Tone15%
Accuracy25%
Actionability20%
Equity Awareness15%
Efficiency10%
Harm Avoidance15%
🎯

Test Design

Scenarios are drawn from real 311 service categories and calibrated across difficulty levels.

  • → 10 scenario categories covering common 311 requests
  • → Difficulty stratification (simple → high-stakes)
  • → Equity considerations embedded in evaluation
  • → Response elements validated by experts
⚖️

Equity Focus

Every scenario includes explicit equity considerations for evaluating whether responses avoid harmful assumptions.

  • → No assumption of car ownership or mobility
  • → Recognition of housing instability
  • → Awareness of language barriers
  • → Sensitivity to power dynamics
🔬

Reproducibility

All test scenarios, scoring rubrics, and evaluation criteria are published openly.

  • → Open test set (no hidden scenarios)
  • → Deterministic prompting (low temperature)
  • → Published reference standards
  • → Version-controlled methodology