Optimizes LLM inference workloads on compilation-based accelerators by balancing request batching, shape selection, and padding overhead to minimize costs while meeting latency requirements.
This skill provides a systematic framework for designing and implementing batching schedulers specifically for hardware accelerators like TPUs and custom ASICs. It offers detailed guidance on mathematical analysis for padding budgets, cost-model-driven optimization, and generation-length bucketing strategies. By prioritizing a 'math first, code second' approach, it helps developers navigate the complex trade-offs between compilation costs, padding waste, and P95/P99 latency constraints, ensuring efficient and reliable inference performance through structured parameter search and invariant checking.
주요 기능
01Continuous structural and metric invariant verification
0216 GitHub stars
03Systematic parameter search for multi-objective optimization
04Generation-length bucketing strategies for request distributions
05Cost-model-driven shape optimization and selection
06Mathematical padding analysis and budget calculation
사용 사례
01Designing batch schedulers for LLM inference on TPUs or custom ASICs
02Balancing throughput and cost metrics against strict P95/P99 latency thresholds
03Optimizing request packing to reduce padding waste and operational costs