ACM CAIS · RLEVAL WORKSHOP · 2026
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
A step-level benchmark for agentic LLM routing: at each call, pick the cheapest model tier that still resolves the task from the prefix seen so far. A fast static track of 970 execution-verified labels across five workloads pairs with a live dynamic track on SWE-bench Verified, where a trained router matches unrouted Opus 4.6 while cutting API cost by 53%.