概要
This skill provides a systematic framework for designing and implementing batching schedulers specifically for hardware accelerators like TPUs and custom ASICs. It offers detailed guidance on mathematical analysis for padding budgets, cost-model-driven optimization, and generation-length bucketing strategies. By prioritizing a 'math first, code second' approach, it helps developers navigate the complex trade-offs between compilation costs, padding waste, and P95/P99 latency constraints, ensuring efficient and reliable inference performance through structured parameter search and invariant checking.