Klas2 - further extensions for SIMD and related work #132
Closed
valassi wants to merge 193 commits into
Closed
Conversation
…plify the code.
Prepare to improve kernel launchers by moving c++ event loops further inside.
= 16384
NumThreadsPerBlock = 32
NumIterations = 1
-----------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
Complex type = THRUST::COMPLEX
RanNumb memory layout = AOSOA[4]
Momenta memory layout = AOSOA[4]
Wavefunction GPU memory = LOCAL
Random number generation = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.378591e-03 ) sec
TotalTime[Rambo+ME] (23)= ( 6.737728e-03 ) sec
TotalTime[RndNumGen] (1)= ( 6.408630e-04 ) sec
TotalTime[Rambo] (2)= ( 5.967797e-03 ) sec
TotalTime[MatrixElems] (3)= ( 7.699310e-04 ) sec
MeanTimeInMatrixElems = ( 7.699310e-04 ) sec
[Min,Max]TimeInMatrixElems = [ 7.699310e-04 , 7.699310e-04 ] sec
-----------------------------------------------------------------------
TotalEventsComputed = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.105530e+07 ) sec^-1
EvtsPerSec[Rmb+ME] (23)= ( 7.781377e+07 ) sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.809545e+08 ) sec^-1
***********************************************************************
NumMatrixElements(notNan) = 524288
MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0
[Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0
StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0
MeanWeight = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ]
StdDevWeight = ( 0.000000e+00 )
***********************************************************************
00 CudaFree : 1.084176 sec
0a ProcInit : 0.000522 sec
0b MemAlloc : 0.035510 sec
0c GenCreat : 0.009668 sec
0d SGoodHel : 0.001756 sec
1a GenSeed : 0.000012 sec
1b GenRnGen : 0.000629 sec
2a RamboIni : 0.000041 sec
2b RamboFin : 0.000013 sec
2c CpDTHwgt : 0.000475 sec
2d CpDTHmom : 0.005438 sec
3a SigmaKin : 0.000013 sec
3b CpDTHmes : 0.000757 sec
4a DumpLoop : 0.003222 sec
8a CompStat : 0.003654 sec
9a GenDestr : 0.000053 sec
9b DumpScrn : 0.000229 sec
9c DumpJson : 0.000008 sec
TOTAL : 1.146176 sec
TOTAL (123) : 0.007379 sec
TOTAL (23) : 0.006738 sec
TOTAL (1) : 0.000641 sec
TOTAL (2) : 0.005968 sec
TOTAL (3) : 0.000770 sec
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid = 16384
NumThreadsPerBlock = 32
NumIterations = 1
-----------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
Complex type = STD::COMPLEX
RanNumb memory layout = AOSOA[4]
Momenta memory layout = AOSOA[4]
Random number generation = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.512748e+00 ) sec
TotalTime[Rambo+ME] (23)= ( 1.477347e+00 ) sec
TotalTime[RndNumGen] (1)= ( 3.540115e-02 ) sec
TotalTime[Rambo] (2)= ( 1.121947e-01 ) sec
TotalTime[MatrixElems] (3)= ( 1.365152e+00 ) sec
MeanTimeInMatrixElems = ( 1.365152e+00 ) sec
[Min,Max]TimeInMatrixElems = [ 1.365152e+00 , 1.365152e+00 ] sec
-----------------------------------------------------------------------
TotalEventsComputed = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.465798e+05 ) sec^-1
EvtsPerSec[Rmb+ME] (23)= ( 3.548848e+05 ) sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.840509e+05 ) sec^-1
***********************************************************************
NumMatrixElements(notNan) = 524288
MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0
[Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0
StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0
MeanWeight = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ]
StdDevWeight = ( 0.000000e+00 )
***********************************************************************
0a ProcInit : 0.000329 sec
0b MemAlloc : 0.000044 sec
0c GenCreat : 0.000853 sec
1a GenSeed : 0.000008 sec
1b GenRnGen : 0.035393 sec
2a RamboIni : 0.016423 sec
2b RamboFin : 0.095772 sec
3a SigmaKin : 1.365152 sec
4a DumpLoop : 0.004525 sec
8a CompStat : 0.003041 sec
9a GenDestr : 0.000072 sec
9b DumpScrn : 0.000189 sec
9c DumpJson : 0.000009 sec
TOTAL : 1.521810 sec
TOTAL (123) : 1.512748 sec
TOTAL (23) : 1.477347 sec
TOTAL (1) : 0.035401 sec
TOTAL (2) : 0.112195 sec
TOTAL (3) : 1.365152 sec
***********************************************************************
…cal variable. This has a smaller performance degradation on GPU? Can get it back by using a local variable in calculate_wavefunctions, if needed. ./gcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.408483e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.765105e-03 ) sec TotalTime[RndNumGen] (1)= ( 6.433780e-04 ) sec TotalTime[Rambo] (2)= ( 5.962489e-03 ) sec TotalTime[MatrixElems] (3)= ( 8.026160e-04 ) sec MeanTimeInMatrixElems = ( 8.026160e-04 ) sec [Min,Max]TimeInMatrixElems = [ 8.026160e-04 , 8.026160e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.076861e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.749887e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.532240e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.687299 sec 0a ProcInit : 0.000427 sec 0b MemAlloc : 0.035461 sec 0c GenCreat : 0.012625 sec 0d SGoodHel : 0.001746 sec 1a GenSeed : 0.000010 sec 1b GenRnGen : 0.000633 sec 2a RamboIni : 0.000017 sec 2b RamboFin : 0.000016 sec 2c CpDTHwgt : 0.000506 sec 2d CpDTHmom : 0.005423 sec 3a SigmaKin : 0.000015 sec 3b CpDTHmes : 0.000787 sec 4a DumpLoop : 0.005516 sec 8a CompStat : 0.003650 sec 9a GenDestr : 0.000055 sec 9b DumpScrn : 0.000293 sec 9c DumpJson : 0.000008 sec TOTAL : 0.754489 sec TOTAL (123) : 0.007408 sec TOTAL (23) : 0.006765 sec TOTAL (1) : 0.000643 sec TOTAL (2) : 0.005962 sec TOTAL (3) : 0.000803 sec *********************************************************************** ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.509930e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.475219e+00 ) sec TotalTime[RndNumGen] (1)= ( 3.471084e-02 ) sec TotalTime[Rambo] (2)= ( 1.109848e-01 ) sec TotalTime[MatrixElems] (3)= ( 1.364234e+00 ) sec MeanTimeInMatrixElems = ( 1.364234e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.364234e+00 , 1.364234e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.472268e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.553968e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.843094e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000314 sec 0b MemAlloc : 0.000042 sec 0c GenCreat : 0.000848 sec 1a GenSeed : 0.000008 sec 1b GenRnGen : 0.034703 sec 2a RamboIni : 0.016324 sec 2b RamboFin : 0.094661 sec 3a SigmaKin : 1.364234 sec 4a DumpLoop : 0.004442 sec 8a CompStat : 0.003013 sec 9a GenDestr : 0.000119 sec 9b DumpScrn : 0.000195 sec 9c DumpJson : 0.000007 sec TOTAL : 1.518909 sec TOTAL (123) : 1.509930 sec TOTAL (23) : 1.475219 sec TOTAL (1) : 0.034711 sec TOTAL (2) : 0.110985 sec TOTAL (3) : 1.364234 sec ***********************************************************************
…ions. There are at least two issues here - on both cpu and gpu, dividing by denominators should be done once, not on each hel - the helicity filtering on cpp uses a loop that is buggy (gives up too early)
Note: essentially I am inverting the helicity and event loops, this is the key. *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.218796e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.566758e-03 ) sec TotalTime[RndNumGen] (1)= ( 6.520380e-04 ) sec TotalTime[Rambo] (2)= ( 5.775874e-03 ) sec TotalTime[MatrixElems] (3)= ( 7.908840e-04 ) sec MeanTimeInMatrixElems = ( 7.908840e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.908840e-04 , 7.908840e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.262818e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.983970e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.629139e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.897224 sec 0a ProcInit : 0.000579 sec 0b MemAlloc : 0.037365 sec 0c GenCreat : 0.009812 sec 0d SGoodHel : 0.001844 sec 1a GenSeed : 0.000011 sec 1b GenRnGen : 0.000641 sec 2a RamboIni : 0.000033 sec 2b RamboFin : 0.000012 sec 2c CpDTHwgt : 0.000482 sec 2d CpDTHmom : 0.005249 sec 3a SigmaKin : 0.000013 sec 3b CpDTHmes : 0.000778 sec 4a DumpLoop : 0.005650 sec 8a CompStat : 0.003652 sec 9a GenDestr : 0.000068 sec 9b DumpScrn : 0.000303 sec 9c DumpJson : 0.000007 sec TOTAL : 0.963722 sec TOTAL (123) : 0.007219 sec TOTAL (23) : 0.006567 sec TOTAL (1) : 0.000652 sec TOTAL (2) : 0.005776 sec TOTAL (3) : 0.000791 sec ***********************************************************************
The physics results are correct but performance gets degraded by almost a factor 2. It looks like I am calculating too many helicities. *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 2.463410e+00 ) sec TotalTime[Rambo+ME] (23)= ( 2.428130e+00 ) sec TotalTime[RndNumGen] (1)= ( 3.528046e-02 ) sec TotalTime[Rambo] (2)= ( 1.116047e-01 ) sec TotalTime[MatrixElems] (3)= ( 2.316525e+00 ) sec MeanTimeInMatrixElems = ( 2.316525e+00 ) sec [Min,Max]TimeInMatrixElems = [ 2.316525e+00 , 2.316525e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 2.128302e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 2.159225e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.263252e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000353 sec 0b MemAlloc : 0.000047 sec 0c GenCreat : 0.000870 sec 0d SGoodHel : 0.000162 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.035272 sec 2a RamboIni : 0.016716 sec 2b RamboFin : 0.094888 sec 3a SigmaKin : 2.316525 sec 4a DumpLoop : 0.004477 sec 8a CompStat : 0.002808 sec 9a GenDestr : 0.000077 sec 9b DumpScrn : 0.000211 sec 9c DumpJson : 0.000007 sec TOTAL : 2.472422 sec TOTAL (123) : 2.463410 sec TOTAL (23) : 2.428130 sec TOTAL (1) : 0.035280 sec TOTAL (2) : 0.111605 sec TOTAL (3) : 2.316525 sec ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.451416e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.423717e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.769905e-02 ) sec TotalTime[Rambo] (2)= ( 1.000820e-01 ) sec TotalTime[MatrixElems] (3)= ( 1.323635e+00 ) sec MeanTimeInMatrixElems = ( 1.323635e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.323635e+00 , 1.323635e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.612252e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.682530e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.960972e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000317 sec 0b MemAlloc : 0.028313 sec 0c GenCreat : 0.000863 sec 0d SGoodHel : 0.000164 sec 1a GenSeed : 0.000011 sec 1b GenRnGen : 0.027688 sec 2a RamboIni : 0.006959 sec 2b RamboFin : 0.093123 sec 3a SigmaKin : 1.323635 sec 4a DumpLoop : 0.004347 sec 8a CompStat : 0.003050 sec 9a GenDestr : 0.000076 sec 9b DumpScrn : 0.000253 sec 9c DumpJson : 0.000007 sec TOTAL : 1.488808 sec TOTAL (123) : 1.451416 sec TOTAL (23) : 1.423717 sec TOTAL (1) : 0.027699 sec TOTAL (2) : 0.100082 sec TOTAL (3) : 1.323635 sec ***********************************************************************
…functions. ./gcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.351215e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.713424e-03 ) sec TotalTime[RndNumGen] (1)= ( 6.377910e-04 ) sec TotalTime[Rambo] (2)= ( 5.919468e-03 ) sec TotalTime[MatrixElems] (3)= ( 7.939560e-04 ) sec MeanTimeInMatrixElems = ( 7.939560e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.939560e-04 , 7.939560e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.131991e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.809547e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.603489e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.686566 sec 0a ProcInit : 0.000418 sec 0b MemAlloc : 0.034625 sec 0c GenCreat : 0.012608 sec 0d SGoodHel : 0.001840 sec 1a GenSeed : 0.000013 sec 1b GenRnGen : 0.000625 sec 2a RamboIni : 0.000017 sec 2b RamboFin : 0.000012 sec 2c CpDTHwgt : 0.000512 sec 2d CpDTHmom : 0.005378 sec 3a SigmaKin : 0.000013 sec 3b CpDTHmes : 0.000781 sec 4a DumpLoop : 0.005419 sec 8a CompStat : 0.003564 sec 9a GenDestr : 0.000099 sec 9b DumpScrn : 0.000212 sec 9c DumpJson : 0.000007 sec TOTAL : 0.752710 sec TOTAL (123) : 0.007351 sec TOTAL (23) : 0.006713 sec TOTAL (1) : 0.000638 sec TOTAL (2) : 0.005919 sec TOTAL (3) : 0.000794 sec *********************************************************************** ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.510862e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.483222e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.763983e-02 ) sec TotalTime[Rambo] (2)= ( 9.924585e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.383976e+00 ) sec MeanTimeInMatrixElems = ( 1.383976e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.383976e+00 , 1.383976e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.470125e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.534790e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.788273e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000333 sec 0b MemAlloc : 0.027811 sec 0c GenCreat : 0.000846 sec 0d SGoodHel : 0.000151 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027630 sec 2a RamboIni : 0.006760 sec 2b RamboFin : 0.092486 sec 3a SigmaKin : 1.383976 sec 4a DumpLoop : 0.004520 sec 8a CompStat : 0.003015 sec 9a GenDestr : 0.000075 sec 9b DumpScrn : 0.000257 sec 9c DumpJson : 0.000010 sec TOTAL : 1.547881 sec TOTAL (123) : 1.510862 sec TOTAL (23) : 1.483222 sec TOTAL (1) : 0.027640 sec TOTAL (2) : 0.099246 sec TOTAL (3) : 1.383976 sec ***********************************************************************
…ixx/oxx functions." This reverts commit b63320f.
Note: with this older implementation, there are 55 lines from "objdump -d -C CPPProcess.o | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l"
…hin the ixx/oxx functions."" This reverts commit c075db4. Note: with this newer implementation, there are 126 lines from "objdump -d -C CPPProcess.o | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l" A positive effect of SIMD vectorization on performance is still not there (one would have to migrate also the FFV functions, which requires RRRRIIII), but this is a first proof of concept that the changes go in the right direction
./gcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.653209e-03 ) sec TotalTime[Rambo+ME] (23)= ( 7.009831e-03 ) sec TotalTime[RndNumGen] (1)= ( 6.433780e-04 ) sec TotalTime[Rambo] (2)= ( 6.199672e-03 ) sec TotalTime[MatrixElems] (3)= ( 8.101590e-04 ) sec MeanTimeInMatrixElems = ( 8.101590e-04 ) sec [Min,Max]TimeInMatrixElems = [ 8.101590e-04 , 8.101590e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.850564e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.479324e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.471421e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.687987 sec 0a ProcInit : 0.000422 sec 0b MemAlloc : 0.034919 sec 0c GenCreat : 0.011849 sec 0d SGoodHel : 0.001837 sec 1a GenSeed : 0.000013 sec 1b GenRnGen : 0.000631 sec 2a RamboIni : 0.000019 sec 2b RamboFin : 0.000013 sec 2c CpDTHwgt : 0.000519 sec 2d CpDTHmom : 0.005649 sec 3a SigmaKin : 0.000015 sec 3b CpDTHmes : 0.000795 sec 4a DumpLoop : 0.005283 sec 8a CompStat : 0.003659 sec 9a GenDestr : 0.000051 sec 9b DumpScrn : 0.000226 sec 9c DumpJson : 0.000007 sec TOTAL : 0.753894 sec TOTAL (123) : 0.007653 sec TOTAL (23) : 0.007010 sec TOTAL (1) : 0.000643 sec TOTAL (2) : 0.006200 sec TOTAL (3) : 0.000810 sec *********************************************************************** ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.467684e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.439913e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.777057e-02 ) sec TotalTime[Rambo] (2)= ( 9.937939e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.340534e+00 ) sec MeanTimeInMatrixElems = ( 1.340534e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.340534e+00 , 1.340534e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.572214e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.641109e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.911039e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000319 sec 0b MemAlloc : 0.027103 sec 0c GenCreat : 0.000906 sec 0d SGoodHel : 0.000186 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027762 sec 2a RamboIni : 0.006784 sec 2b RamboFin : 0.092595 sec 3a SigmaKin : 1.340534 sec 4a DumpLoop : 0.004592 sec 8a CompStat : 0.003529 sec 9a GenDestr : 0.000083 sec 9b DumpScrn : 0.000231 sec 9c DumpJson : 0.000011 sec TOTAL : 1.504643 sec TOTAL (123) : 1.467684 sec TOTAL (23) : 1.439913 sec TOTAL (1) : 0.027771 sec TOTAL (2) : 0.099379 sec TOTAL (3) : 1.340534 sec ***********************************************************************
There is more vectorization, but it segfaults... ./check.exe -p 16384 32 1 Segmentation fault (core dumped) objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 216
Using avx2 in -march, valgrind at least tells me it is a General Protection Fault ==481028== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==481028== General Protection Fault ==481028== at 0x40876E: MG5_sm::oxzxxxM0(double const*, int, int, std::complex<double> (*) [4], int, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe) ==481028== by 0x40A48C: Proc::calculate_wavefunctions(int, double const*, double*, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe) ==481028== by 0x40AD16: Proc::sigmaKin_getGoodHel(double const*, double*, bool*, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe) ==481028== by 0x405207: main (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 253 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.271031e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.243470e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.756105e-02 ) sec TotalTime[Rambo] (2)= ( 9.889553e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.144575e+00 ) sec MeanTimeInMatrixElems = ( 1.144575e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.144575e+00 , 1.144575e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.124902e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 4.216329e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 4.580635e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000263 sec 0b MemAlloc : 0.026885 sec 0c GenCreat : 0.000805 sec 0d SGoodHel : 0.000086 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027552 sec 2a RamboIni : 0.006703 sec 2b RamboFin : 0.092193 sec 3a SigmaKin : 1.144575 sec 4a DumpLoop : 0.004472 sec 8a CompStat : 0.003578 sec 9a GenDestr : 0.000075 sec 9b DumpScrn : 0.000161 sec 9c DumpJson : 0.000008 sec TOTAL : 1.307366 sec TOTAL (123) : 1.271031 sec TOTAL (23) : 1.243470 sec TOTAL (1) : 0.027561 sec TOTAL (2) : 0.098896 sec TOTAL (3) : 1.144575 sec ***********************************************************************
…rt it" This reverts commit e3e79c5. Allow a host to build avx512 even if it is unable to run it...
… it cannot run it.
This does prevent a crash on pmpe04, but not on some github CI nodes.
…cted. It looks like this is doing compile time disptaching, not runtime?
… as expected." This reverts commit cadeb13.
…uild (icc only?)" This reverts commit 714f074.
Fix conflicts: epoch2/cuda/ee_mumu/SubProcesses/Makefile
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/runTest.cc
…est.exe. This enables the GPU test (from runTest_cu.o). Before the fix: ./build.avx512/runTest.exe Running main() from /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuQua/test/googletest/googletest/src/gtest_main.cc [==========] Running 0 tests from 0 test suites. [==========] 0 tests from 0 test suites ran. (1 ms total) [ PASSED ] 0 tests. After the fix: ./build.avx512/runTest.exe Running main() from /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuQua/test/googletest/googletest/src/gtest_main.cc [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble [ RUN ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0 [ OK ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0 (956 ms) [----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble (956 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (956 ms total) [ PASSED ] 1 test.
…t.exe. Use a makefile structure much closer to Stephan's original version for tests All tests run successfully now ./build.avx512/runTest.exe Running main() from /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuQua/test/googletest/googletest/src/gtest_main.cc [==========] Running 2 tests from 2 test suites. [----------] Global test environment set-up. [----------] 1 test from EPOCH1_EEMUMU_CPU/MadgraphTestDouble [ RUN ] EPOCH1_EEMUMU_CPU/MadgraphTestDouble.eemumu/0 [ OK ] EPOCH1_EEMUMU_CPU/MadgraphTestDouble.eemumu/0 (24 ms) [----------] 1 test from EPOCH1_EEMUMU_CPU/MadgraphTestDouble (24 ms total) [----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble [ RUN ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0 [ OK ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0 (968 ms) [----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble (968 ms total) [----------] Global test environment tear-down [==========] 2 tests from 2 test suites ran. (992 ms total) [ PASSED ] 2 tests.
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/Makefile Add a comment about bug madgraph5#136
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Klas2 - further extensions for SIMD and related work
This includes minimal changes to allow clang builds (initially used to test SIMD, which however is disabled because compiler vector extensions do not work in that case). This is WIP.
This whole PR is WIP.