This repo provides evidence that Apple AMX instructions remain functional on M4 hardware.
Thanks to corsix/amx and dougallj, I was able to verify on a 10-core M4 iMac.
fma16_mat_f16f16_x*y+z (far z)
| ZACs | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
|---|---|---|---|---|---|---|
| 1 per thread | 2009.4 GFLOPS | 2454.8 GFLOPS | 3649.5 GFLOPS | 3653.2 GFLOPS | 3701.1 GFLOPS | 4162.0 GFLOPS |
| 2 per thread | 3986.7 GFLOPS | 4706.4 GFLOPS | 4527.6 GFLOPS | 4571.6 GFLOPS | 4606.0 GFLOPS | 4621.4 GFLOPS |
from corsix/amx/fma.md
fma16 in matrix mode, each Z accumulator being f16[32][32]
| ZACs | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
|---|---|---|---|---|---|---|
| 1 per thread | 1453.0 GFLOPS | 2958.4 GFLOPS | 2705.5 GFLOPS | 3553.5 GFLOPS | 4609.2 GFLOPS | 5268.5 GFLOPS |
| 2 per thread | 2958.9 GFLOPS | 5915.7 GFLOPS | 4862.3 GFLOPS | 5355.6 GFLOPS | 5546.6 GFLOPS | 6263.4 GFLOPS |