Skip to content

Klas2 - further extensions for SIMD and related work #132

Closed
valassi wants to merge 193 commits into
madgraph5:masterfrom
valassi:klas2
Closed

Klas2 - further extensions for SIMD and related work #132
valassi wants to merge 193 commits into
madgraph5:masterfrom
valassi:klas2

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented Mar 19, 2021

Klas2 - further extensions for SIMD and related work

This includes minimal changes to allow clang builds (initially used to test SIMD, which however is disabled because compiler vector extensions do not work in that case). This is WIP.

This whole PR is WIP.

…plify the code.

Prepare to improve kernel launchers by moving c++ event loops further inside.

       = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.378591e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.737728e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.408630e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.967797e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.699310e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.699310e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.699310e-04 ,  7.699310e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.105530e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.781377e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.809545e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     1.084176 sec
0a ProcInit :     0.000522 sec
0b MemAlloc :     0.035510 sec
0c GenCreat :     0.009668 sec
0d SGoodHel :     0.001756 sec
1a GenSeed  :     0.000012 sec
1b GenRnGen :     0.000629 sec
2a RamboIni :     0.000041 sec
2b RamboFin :     0.000013 sec
2c CpDTHwgt :     0.000475 sec
2d CpDTHmom :     0.005438 sec
3a SigmaKin :     0.000013 sec
3b CpDTHmes :     0.000757 sec
4a DumpLoop :     0.003222 sec
8a CompStat :     0.003654 sec
9a GenDestr :     0.000053 sec
9b DumpScrn :     0.000229 sec
9c DumpJson :     0.000008 sec
TOTAL       :     1.146176 sec
TOTAL (123) :     0.007379 sec
TOTAL  (23) :     0.006738 sec
TOTAL   (1) :     0.000641 sec
TOTAL   (2) :     0.005968 sec
TOTAL   (3) :     0.000770 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.512748e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.477347e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.540115e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.121947e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.365152e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.365152e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.365152e+00 ,  1.365152e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.465798e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.548848e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.840509e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000329 sec
0b MemAlloc :     0.000044 sec
0c GenCreat :     0.000853 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.035393 sec
2a RamboIni :     0.016423 sec
2b RamboFin :     0.095772 sec
3a SigmaKin :     1.365152 sec
4a DumpLoop :     0.004525 sec
8a CompStat :     0.003041 sec
9a GenDestr :     0.000072 sec
9b DumpScrn :     0.000189 sec
9c DumpJson :     0.000009 sec
TOTAL       :     1.521810 sec
TOTAL (123) :     1.512748 sec
TOTAL  (23) :     1.477347 sec
TOTAL   (1) :     0.035401 sec
TOTAL   (2) :     0.112195 sec
TOTAL   (3) :     1.365152 sec
***********************************************************************
…cal variable.

This has a smaller performance degradation on GPU?
Can get it back by using a local variable in calculate_wavefunctions, if needed.

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.408483e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.765105e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.433780e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.962489e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 8.026160e-04                 )  sec
MeanTimeInMatrixElems      = ( 8.026160e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 8.026160e-04 ,  8.026160e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.076861e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.749887e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.532240e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.687299 sec
0a ProcInit :     0.000427 sec
0b MemAlloc :     0.035461 sec
0c GenCreat :     0.012625 sec
0d SGoodHel :     0.001746 sec
1a GenSeed  :     0.000010 sec
1b GenRnGen :     0.000633 sec
2a RamboIni :     0.000017 sec
2b RamboFin :     0.000016 sec
2c CpDTHwgt :     0.000506 sec
2d CpDTHmom :     0.005423 sec
3a SigmaKin :     0.000015 sec
3b CpDTHmes :     0.000787 sec
4a DumpLoop :     0.005516 sec
8a CompStat :     0.003650 sec
9a GenDestr :     0.000055 sec
9b DumpScrn :     0.000293 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.754489 sec
TOTAL (123) :     0.007408 sec
TOTAL  (23) :     0.006765 sec
TOTAL   (1) :     0.000643 sec
TOTAL   (2) :     0.005962 sec
TOTAL   (3) :     0.000803 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.509930e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.475219e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.471084e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.109848e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.364234e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.364234e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.364234e+00 ,  1.364234e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.472268e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.553968e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.843094e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000314 sec
0b MemAlloc :     0.000042 sec
0c GenCreat :     0.000848 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.034703 sec
2a RamboIni :     0.016324 sec
2b RamboFin :     0.094661 sec
3a SigmaKin :     1.364234 sec
4a DumpLoop :     0.004442 sec
8a CompStat :     0.003013 sec
9a GenDestr :     0.000119 sec
9b DumpScrn :     0.000195 sec
9c DumpJson :     0.000007 sec
TOTAL       :     1.518909 sec
TOTAL (123) :     1.509930 sec
TOTAL  (23) :     1.475219 sec
TOTAL   (1) :     0.034711 sec
TOTAL   (2) :     0.110985 sec
TOTAL   (3) :     1.364234 sec
***********************************************************************
…ions.

There are at least two issues here
- on both cpu and gpu, dividing by denominators should be done once, not on each hel
- the helicity filtering on cpp uses a loop that is buggy (gives up too early)
Note: essentially I am inverting the helicity and event loops, this is the key.

***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.218796e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.566758e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.520380e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.775874e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.908840e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.908840e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.908840e-04 ,  7.908840e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.262818e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.983970e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.629139e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.897224 sec
0a ProcInit :     0.000579 sec
0b MemAlloc :     0.037365 sec
0c GenCreat :     0.009812 sec
0d SGoodHel :     0.001844 sec
1a GenSeed  :     0.000011 sec
1b GenRnGen :     0.000641 sec
2a RamboIni :     0.000033 sec
2b RamboFin :     0.000012 sec
2c CpDTHwgt :     0.000482 sec
2d CpDTHmom :     0.005249 sec
3a SigmaKin :     0.000013 sec
3b CpDTHmes :     0.000778 sec
4a DumpLoop :     0.005650 sec
8a CompStat :     0.003652 sec
9a GenDestr :     0.000068 sec
9b DumpScrn :     0.000303 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.963722 sec
TOTAL (123) :     0.007219 sec
TOTAL  (23) :     0.006567 sec
TOTAL   (1) :     0.000652 sec
TOTAL   (2) :     0.005776 sec
TOTAL   (3) :     0.000791 sec
***********************************************************************
The physics results are correct but performance gets degraded by almost a factor 2.
It looks like I am calculating too many helicities.

***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.463410e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.428130e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.528046e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.116047e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.316525e+00                 )  sec
MeanTimeInMatrixElems      = ( 2.316525e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.316525e+00 ,  2.316525e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 2.128302e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 2.159225e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.263252e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000353 sec
0b MemAlloc :     0.000047 sec
0c GenCreat :     0.000870 sec
0d SGoodHel :     0.000162 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.035272 sec
2a RamboIni :     0.016716 sec
2b RamboFin :     0.094888 sec
3a SigmaKin :     2.316525 sec
4a DumpLoop :     0.004477 sec
8a CompStat :     0.002808 sec
9a GenDestr :     0.000077 sec
9b DumpScrn :     0.000211 sec
9c DumpJson :     0.000007 sec
TOTAL       :     2.472422 sec
TOTAL (123) :     2.463410 sec
TOTAL  (23) :     2.428130 sec
TOTAL   (1) :     0.035280 sec
TOTAL   (2) :     0.111605 sec
TOTAL   (3) :     2.316525 sec
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.451416e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.423717e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.769905e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.000820e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.323635e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.323635e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.323635e+00 ,  1.323635e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.612252e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.682530e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.960972e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000317 sec
0b MemAlloc :     0.028313 sec
0c GenCreat :     0.000863 sec
0d SGoodHel :     0.000164 sec
1a GenSeed  :     0.000011 sec
1b GenRnGen :     0.027688 sec
2a RamboIni :     0.006959 sec
2b RamboFin :     0.093123 sec
3a SigmaKin :     1.323635 sec
4a DumpLoop :     0.004347 sec
8a CompStat :     0.003050 sec
9a GenDestr :     0.000076 sec
9b DumpScrn :     0.000253 sec
9c DumpJson :     0.000007 sec
TOTAL       :     1.488808 sec
TOTAL (123) :     1.451416 sec
TOTAL  (23) :     1.423717 sec
TOTAL   (1) :     0.027699 sec
TOTAL   (2) :     0.100082 sec
TOTAL   (3) :     1.323635 sec
***********************************************************************
…functions.

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.351215e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.713424e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.377910e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.919468e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.939560e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.939560e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.939560e-04 ,  7.939560e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.131991e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.809547e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.603489e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.686566 sec
0a ProcInit :     0.000418 sec
0b MemAlloc :     0.034625 sec
0c GenCreat :     0.012608 sec
0d SGoodHel :     0.001840 sec
1a GenSeed  :     0.000013 sec
1b GenRnGen :     0.000625 sec
2a RamboIni :     0.000017 sec
2b RamboFin :     0.000012 sec
2c CpDTHwgt :     0.000512 sec
2d CpDTHmom :     0.005378 sec
3a SigmaKin :     0.000013 sec
3b CpDTHmes :     0.000781 sec
4a DumpLoop :     0.005419 sec
8a CompStat :     0.003564 sec
9a GenDestr :     0.000099 sec
9b DumpScrn :     0.000212 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.752710 sec
TOTAL (123) :     0.007351 sec
TOTAL  (23) :     0.006713 sec
TOTAL   (1) :     0.000638 sec
TOTAL   (2) :     0.005919 sec
TOTAL   (3) :     0.000794 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.510862e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.483222e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.763983e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.924585e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.383976e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.383976e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.383976e+00 ,  1.383976e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.470125e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.534790e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.788273e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000333 sec
0b MemAlloc :     0.027811 sec
0c GenCreat :     0.000846 sec
0d SGoodHel :     0.000151 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027630 sec
2a RamboIni :     0.006760 sec
2b RamboFin :     0.092486 sec
3a SigmaKin :     1.383976 sec
4a DumpLoop :     0.004520 sec
8a CompStat :     0.003015 sec
9a GenDestr :     0.000075 sec
9b DumpScrn :     0.000257 sec
9c DumpJson :     0.000010 sec
TOTAL       :     1.547881 sec
TOTAL (123) :     1.510862 sec
TOTAL  (23) :     1.483222 sec
TOTAL   (1) :     0.027640 sec
TOTAL   (2) :     0.099246 sec
TOTAL   (3) :     1.383976 sec
***********************************************************************
Note: with this older implementation, there are 55 lines from
"objdump -d -C CPPProcess.o | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l"
…hin the ixx/oxx functions.""

This reverts commit c075db4.

Note: with this newer implementation, there are 126 lines from
"objdump -d -C CPPProcess.o | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l"

A positive effect of SIMD vectorization on performance is still not there
(one would have to migrate also the FFV functions, which requires RRRRIIII),
but this is a first proof of concept that the changes go in the right direction
./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.653209e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 7.009831e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.433780e-04                 )  sec
TotalTime[Rambo]        (2)= ( 6.199672e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 8.101590e-04                 )  sec
MeanTimeInMatrixElems      = ( 8.101590e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 8.101590e-04 ,  8.101590e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.850564e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.479324e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.471421e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.687987 sec
0a ProcInit :     0.000422 sec
0b MemAlloc :     0.034919 sec
0c GenCreat :     0.011849 sec
0d SGoodHel :     0.001837 sec
1a GenSeed  :     0.000013 sec
1b GenRnGen :     0.000631 sec
2a RamboIni :     0.000019 sec
2b RamboFin :     0.000013 sec
2c CpDTHwgt :     0.000519 sec
2d CpDTHmom :     0.005649 sec
3a SigmaKin :     0.000015 sec
3b CpDTHmes :     0.000795 sec
4a DumpLoop :     0.005283 sec
8a CompStat :     0.003659 sec
9a GenDestr :     0.000051 sec
9b DumpScrn :     0.000226 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.753894 sec
TOTAL (123) :     0.007653 sec
TOTAL  (23) :     0.007010 sec
TOTAL   (1) :     0.000643 sec
TOTAL   (2) :     0.006200 sec
TOTAL   (3) :     0.000810 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.467684e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.439913e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.777057e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.937939e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.340534e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.340534e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.340534e+00 ,  1.340534e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.572214e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.641109e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.911039e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000319 sec
0b MemAlloc :     0.027103 sec
0c GenCreat :     0.000906 sec
0d SGoodHel :     0.000186 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027762 sec
2a RamboIni :     0.006784 sec
2b RamboFin :     0.092595 sec
3a SigmaKin :     1.340534 sec
4a DumpLoop :     0.004592 sec
8a CompStat :     0.003529 sec
9a GenDestr :     0.000083 sec
9b DumpScrn :     0.000231 sec
9c DumpJson :     0.000011 sec
TOTAL       :     1.504643 sec
TOTAL (123) :     1.467684 sec
TOTAL  (23) :     1.439913 sec
TOTAL   (1) :     0.027771 sec
TOTAL   (2) :     0.099379 sec
TOTAL   (3) :     1.340534 sec
***********************************************************************
There is more vectorization, but it segfaults...

./check.exe -p 16384 32 1
  Segmentation fault (core dumped)
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
  216
Using avx2 in -march, valgrind at least tells me it is a General Protection Fault

==481028== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==481028==  General Protection Fault
==481028==    at 0x40876E: MG5_sm::oxzxxxM0(double const*, int, int, std::complex<double> (*) [4], int, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
==481028==    by 0x40A48C: Proc::calculate_wavefunctions(int, double const*, double*, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
==481028==    by 0x40AD16: Proc::sigmaKin_getGoodHel(double const*, double*, bool*, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
==481028==    by 0x405207: main (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
253

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.271031e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.243470e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.756105e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.889553e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.144575e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.144575e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.144575e+00 ,  1.144575e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.124902e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 4.216329e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 4.580635e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000263 sec
0b MemAlloc :     0.026885 sec
0c GenCreat :     0.000805 sec
0d SGoodHel :     0.000086 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027552 sec
2a RamboIni :     0.006703 sec
2b RamboFin :     0.092193 sec
3a SigmaKin :     1.144575 sec
4a DumpLoop :     0.004472 sec
8a CompStat :     0.003578 sec
9a GenDestr :     0.000075 sec
9b DumpScrn :     0.000161 sec
9c DumpJson :     0.000008 sec
TOTAL       :     1.307366 sec
TOTAL (123) :     1.271031 sec
TOTAL  (23) :     1.243470 sec
TOTAL   (1) :     0.027561 sec
TOTAL   (2) :     0.098896 sec
TOTAL   (3) :     1.144575 sec
***********************************************************************
valassi added 21 commits March 24, 2021 19:03
…rt it"

This reverts commit e3e79c5.
Allow a host to build avx512 even if it is unable to run it...
This does prevent a crash on pmpe04, but not on some github CI nodes.
…cted.

It looks like this is doing compile time disptaching, not runtime?
Fix conflicts: epoch2/cuda/ee_mumu/SubProcesses/Makefile
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/runTest.cc
valassi added 3 commits April 5, 2021 12:13
…est.exe.

This enables the GPU test (from runTest_cu.o).

Before the fix:
./build.avx512/runTest.exe
Running main() from /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuQua/test/googletest/googletest/src/gtest_main.cc
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (1 ms total)
[  PASSED  ] 0 tests.

After the fix:
./build.avx512/runTest.exe
Running main() from /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuQua/test/googletest/googletest/src/gtest_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble
[ RUN      ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0
[       OK ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0 (956 ms)
[----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble (956 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (956 ms total)
[  PASSED  ] 1 test.
…t.exe.

Use a makefile structure much closer to Stephan's original version for tests

All tests run successfully now
./build.avx512/runTest.exe
Running main() from /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuQua/test/googletest/googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EPOCH1_EEMUMU_CPU/MadgraphTestDouble
[ RUN      ] EPOCH1_EEMUMU_CPU/MadgraphTestDouble.eemumu/0
[       OK ] EPOCH1_EEMUMU_CPU/MadgraphTestDouble.eemumu/0 (24 ms)
[----------] 1 test from EPOCH1_EEMUMU_CPU/MadgraphTestDouble (24 ms total)

[----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble
[ RUN      ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0
[       OK ] EPOCH1_EEMUMU_GPU/MadgraphTestDouble.eemumu/0 (968 ms)
[----------] 1 test from EPOCH1_EEMUMU_GPU/MadgraphTestDouble (968 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (992 ms total)
[  PASSED  ] 2 tests.
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/Makefile
Add a comment about bug madgraph5#136
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Apr 8, 2021

This PR # 132 (klas2) is OBSOLETE. It is superseded by PR #152 (klas2ep12: all SIMD fixes, plus epoch1/2 merging as per issue #139 and PR #152). I am closing this now.

The PR #72 (klas), which this PR #132 (klas2) was itself superseding, is also obsolete and has been closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant