概要
This skill provides expert guidance for implementing tensor parallelism patterns in PyTorch, specifically for scaling large-scale models that exceed the memory of a single device. It offers detailed protocols for ColumnParallelLinear and RowParallelLinear layers, ensuring correct weight sharding and the precise execution of collective operations like all-gather and all-reduce. By following these implementation patterns, developers can avoid common distributed computing pitfalls, such as incorrect bias handling or returning incomplete local shards, while ensuring mathematical consistency across parallel ranks.