Forge: Optimize PyTorch to Fast CUDA/Triton Kernels with AI