SNU 확장형고성능컴퓨팅 Final Project
- Name: NVIDIA TITAN RTX
- CUDA Runtime Version: 11.8
- Compute Capability: 7.5
- Total number of SMs: 72
- Max threads per block: 1024
- Max threads per multiprocessor: 1024
- Threads per warp: 32
- Max regs per block: 65536
- Max regs per multiprocessor: 65536
- Total global mem: 24220 MB
- Max shared mem per block: 48 KB
- Shared mem per multiprocessor: 64 KB
- Max warps per multiprocessor: 32
| ID | Kernel Optimization | Communication Optimization | Throughput(sentences/sec) | Commit Number |
|---|---|---|---|---|
| 1 | Naive Conv1D | Sequential Conv1Ds | 686 | 43562e5 |
| 2 | Conv1D to Input Spread and WMMA (Warp Matrix Multiply Accumulate, Use Tensor core) | Sequential Conv1Ds | 4710 | 9635014 |
| 3 | Add a & b SMEM tiling to WMMA Kernel(50% Occupancy) | Sequential Conv1Ds | 9622 | 07acaef |
| 4 | ID3 | Use 4 nodes (MPI Scatter/Gather) | 29484 | afd88a5 |
| 5 | ID3 | Logically Concurrent CONV1Ds | 29437 | 67da912 |
| 6 | Remove c SMEM from WMMA Kernel & Increase WMMA_BLOCKDIM to 1024 (100% Occupancy) | ID5 | 36542 | 583ccbe |
| 7 | ID6 | Split WMMA with Pipelining | 36800 | 3fcb6f3 |
shpc_final_project_report.md
