Skip to content

klas - kernel launchers and SIMD vectorization#72

Closed
valassi wants to merge 155 commits into
madgraph5:masterfrom
valassi:klas
Closed

klas - kernel launchers and SIMD vectorization#72
valassi wants to merge 155 commits into
madgraph5:masterfrom
valassi:klas

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented Nov 29, 2020

See #71
This is work in progress... comments welcome (preferably on the PR, or also directly on the code)

…plify the code.

Prepare to improve kernel launchers by moving c++ event loops further inside.

       = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.378591e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.737728e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.408630e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.967797e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.699310e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.699310e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.699310e-04 ,  7.699310e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.105530e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.781377e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.809545e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     1.084176 sec
0a ProcInit :     0.000522 sec
0b MemAlloc :     0.035510 sec
0c GenCreat :     0.009668 sec
0d SGoodHel :     0.001756 sec
1a GenSeed  :     0.000012 sec
1b GenRnGen :     0.000629 sec
2a RamboIni :     0.000041 sec
2b RamboFin :     0.000013 sec
2c CpDTHwgt :     0.000475 sec
2d CpDTHmom :     0.005438 sec
3a SigmaKin :     0.000013 sec
3b CpDTHmes :     0.000757 sec
4a DumpLoop :     0.003222 sec
8a CompStat :     0.003654 sec
9a GenDestr :     0.000053 sec
9b DumpScrn :     0.000229 sec
9c DumpJson :     0.000008 sec
TOTAL       :     1.146176 sec
TOTAL (123) :     0.007379 sec
TOTAL  (23) :     0.006738 sec
TOTAL   (1) :     0.000641 sec
TOTAL   (2) :     0.005968 sec
TOTAL   (3) :     0.000770 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.512748e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.477347e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.540115e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.121947e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.365152e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.365152e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.365152e+00 ,  1.365152e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.465798e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.548848e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.840509e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000329 sec
0b MemAlloc :     0.000044 sec
0c GenCreat :     0.000853 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.035393 sec
2a RamboIni :     0.016423 sec
2b RamboFin :     0.095772 sec
3a SigmaKin :     1.365152 sec
4a DumpLoop :     0.004525 sec
8a CompStat :     0.003041 sec
9a GenDestr :     0.000072 sec
9b DumpScrn :     0.000189 sec
9c DumpJson :     0.000009 sec
TOTAL       :     1.521810 sec
TOTAL (123) :     1.512748 sec
TOTAL  (23) :     1.477347 sec
TOTAL   (1) :     0.035401 sec
TOTAL   (2) :     0.112195 sec
TOTAL   (3) :     1.365152 sec
***********************************************************************
…cal variable.

This has a smaller performance degradation on GPU?
Can get it back by using a local variable in calculate_wavefunctions, if needed.

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.408483e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.765105e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.433780e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.962489e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 8.026160e-04                 )  sec
MeanTimeInMatrixElems      = ( 8.026160e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 8.026160e-04 ,  8.026160e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.076861e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.749887e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.532240e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.687299 sec
0a ProcInit :     0.000427 sec
0b MemAlloc :     0.035461 sec
0c GenCreat :     0.012625 sec
0d SGoodHel :     0.001746 sec
1a GenSeed  :     0.000010 sec
1b GenRnGen :     0.000633 sec
2a RamboIni :     0.000017 sec
2b RamboFin :     0.000016 sec
2c CpDTHwgt :     0.000506 sec
2d CpDTHmom :     0.005423 sec
3a SigmaKin :     0.000015 sec
3b CpDTHmes :     0.000787 sec
4a DumpLoop :     0.005516 sec
8a CompStat :     0.003650 sec
9a GenDestr :     0.000055 sec
9b DumpScrn :     0.000293 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.754489 sec
TOTAL (123) :     0.007408 sec
TOTAL  (23) :     0.006765 sec
TOTAL   (1) :     0.000643 sec
TOTAL   (2) :     0.005962 sec
TOTAL   (3) :     0.000803 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.509930e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.475219e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.471084e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.109848e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.364234e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.364234e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.364234e+00 ,  1.364234e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.472268e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.553968e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.843094e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000314 sec
0b MemAlloc :     0.000042 sec
0c GenCreat :     0.000848 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.034703 sec
2a RamboIni :     0.016324 sec
2b RamboFin :     0.094661 sec
3a SigmaKin :     1.364234 sec
4a DumpLoop :     0.004442 sec
8a CompStat :     0.003013 sec
9a GenDestr :     0.000119 sec
9b DumpScrn :     0.000195 sec
9c DumpJson :     0.000007 sec
TOTAL       :     1.518909 sec
TOTAL (123) :     1.509930 sec
TOTAL  (23) :     1.475219 sec
TOTAL   (1) :     0.034711 sec
TOTAL   (2) :     0.110985 sec
TOTAL   (3) :     1.364234 sec
***********************************************************************
…ions.

There are at least two issues here
- on both cpu and gpu, dividing by denominators should be done once, not on each hel
- the helicity filtering on cpp uses a loop that is buggy (gives up too early)
Note: essentially I am inverting the helicity and event loops, this is the key.

***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.218796e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.566758e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.520380e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.775874e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.908840e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.908840e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.908840e-04 ,  7.908840e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.262818e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.983970e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.629139e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.897224 sec
0a ProcInit :     0.000579 sec
0b MemAlloc :     0.037365 sec
0c GenCreat :     0.009812 sec
0d SGoodHel :     0.001844 sec
1a GenSeed  :     0.000011 sec
1b GenRnGen :     0.000641 sec
2a RamboIni :     0.000033 sec
2b RamboFin :     0.000012 sec
2c CpDTHwgt :     0.000482 sec
2d CpDTHmom :     0.005249 sec
3a SigmaKin :     0.000013 sec
3b CpDTHmes :     0.000778 sec
4a DumpLoop :     0.005650 sec
8a CompStat :     0.003652 sec
9a GenDestr :     0.000068 sec
9b DumpScrn :     0.000303 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.963722 sec
TOTAL (123) :     0.007219 sec
TOTAL  (23) :     0.006567 sec
TOTAL   (1) :     0.000652 sec
TOTAL   (2) :     0.005776 sec
TOTAL   (3) :     0.000791 sec
***********************************************************************
The physics results are correct but performance gets degraded by almost a factor 2.
It looks like I am calculating too many helicities.

***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.463410e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.428130e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.528046e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.116047e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.316525e+00                 )  sec
MeanTimeInMatrixElems      = ( 2.316525e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.316525e+00 ,  2.316525e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 2.128302e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 2.159225e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.263252e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000353 sec
0b MemAlloc :     0.000047 sec
0c GenCreat :     0.000870 sec
0d SGoodHel :     0.000162 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.035272 sec
2a RamboIni :     0.016716 sec
2b RamboFin :     0.094888 sec
3a SigmaKin :     2.316525 sec
4a DumpLoop :     0.004477 sec
8a CompStat :     0.002808 sec
9a GenDestr :     0.000077 sec
9b DumpScrn :     0.000211 sec
9c DumpJson :     0.000007 sec
TOTAL       :     2.472422 sec
TOTAL (123) :     2.463410 sec
TOTAL  (23) :     2.428130 sec
TOTAL   (1) :     0.035280 sec
TOTAL   (2) :     0.111605 sec
TOTAL   (3) :     2.316525 sec
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.451416e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.423717e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.769905e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.000820e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.323635e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.323635e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.323635e+00 ,  1.323635e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.612252e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.682530e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.960972e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000317 sec
0b MemAlloc :     0.028313 sec
0c GenCreat :     0.000863 sec
0d SGoodHel :     0.000164 sec
1a GenSeed  :     0.000011 sec
1b GenRnGen :     0.027688 sec
2a RamboIni :     0.006959 sec
2b RamboFin :     0.093123 sec
3a SigmaKin :     1.323635 sec
4a DumpLoop :     0.004347 sec
8a CompStat :     0.003050 sec
9a GenDestr :     0.000076 sec
9b DumpScrn :     0.000253 sec
9c DumpJson :     0.000007 sec
TOTAL       :     1.488808 sec
TOTAL (123) :     1.451416 sec
TOTAL  (23) :     1.423717 sec
TOTAL   (1) :     0.027699 sec
TOTAL   (2) :     0.100082 sec
TOTAL   (3) :     1.323635 sec
***********************************************************************
…functions.

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.351215e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.713424e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.377910e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.919468e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.939560e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.939560e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.939560e-04 ,  7.939560e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.131991e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.809547e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.603489e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.686566 sec
0a ProcInit :     0.000418 sec
0b MemAlloc :     0.034625 sec
0c GenCreat :     0.012608 sec
0d SGoodHel :     0.001840 sec
1a GenSeed  :     0.000013 sec
1b GenRnGen :     0.000625 sec
2a RamboIni :     0.000017 sec
2b RamboFin :     0.000012 sec
2c CpDTHwgt :     0.000512 sec
2d CpDTHmom :     0.005378 sec
3a SigmaKin :     0.000013 sec
3b CpDTHmes :     0.000781 sec
4a DumpLoop :     0.005419 sec
8a CompStat :     0.003564 sec
9a GenDestr :     0.000099 sec
9b DumpScrn :     0.000212 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.752710 sec
TOTAL (123) :     0.007351 sec
TOTAL  (23) :     0.006713 sec
TOTAL   (1) :     0.000638 sec
TOTAL   (2) :     0.005919 sec
TOTAL   (3) :     0.000794 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.510862e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.483222e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.763983e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.924585e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.383976e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.383976e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.383976e+00 ,  1.383976e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.470125e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.534790e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.788273e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000333 sec
0b MemAlloc :     0.027811 sec
0c GenCreat :     0.000846 sec
0d SGoodHel :     0.000151 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027630 sec
2a RamboIni :     0.006760 sec
2b RamboFin :     0.092486 sec
3a SigmaKin :     1.383976 sec
4a DumpLoop :     0.004520 sec
8a CompStat :     0.003015 sec
9a GenDestr :     0.000075 sec
9b DumpScrn :     0.000257 sec
9c DumpJson :     0.000010 sec
TOTAL       :     1.547881 sec
TOTAL (123) :     1.510862 sec
TOTAL  (23) :     1.483222 sec
TOTAL   (1) :     0.027640 sec
TOTAL   (2) :     0.099246 sec
TOTAL   (3) :     1.383976 sec
***********************************************************************
Note: with this older implementation, there are 55 lines from
"objdump -d -C CPPProcess.o | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l"
…hin the ixx/oxx functions.""

This reverts commit c075db4.

Note: with this newer implementation, there are 126 lines from
"objdump -d -C CPPProcess.o | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l"

A positive effect of SIMD vectorization on performance is still not there
(one would have to migrate also the FFV functions, which requires RRRRIIII),
but this is a first proof of concept that the changes go in the right direction
./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.653209e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 7.009831e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.433780e-04                 )  sec
TotalTime[Rambo]        (2)= ( 6.199672e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 8.101590e-04                 )  sec
MeanTimeInMatrixElems      = ( 8.101590e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 8.101590e-04 ,  8.101590e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.850564e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.479324e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.471421e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.687987 sec
0a ProcInit :     0.000422 sec
0b MemAlloc :     0.034919 sec
0c GenCreat :     0.011849 sec
0d SGoodHel :     0.001837 sec
1a GenSeed  :     0.000013 sec
1b GenRnGen :     0.000631 sec
2a RamboIni :     0.000019 sec
2b RamboFin :     0.000013 sec
2c CpDTHwgt :     0.000519 sec
2d CpDTHmom :     0.005649 sec
3a SigmaKin :     0.000015 sec
3b CpDTHmes :     0.000795 sec
4a DumpLoop :     0.005283 sec
8a CompStat :     0.003659 sec
9a GenDestr :     0.000051 sec
9b DumpScrn :     0.000226 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.753894 sec
TOTAL (123) :     0.007653 sec
TOTAL  (23) :     0.007010 sec
TOTAL   (1) :     0.000643 sec
TOTAL   (2) :     0.006200 sec
TOTAL   (3) :     0.000810 sec
***********************************************************************

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.467684e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.439913e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.777057e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.937939e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.340534e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.340534e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.340534e+00 ,  1.340534e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.572214e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.641109e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.911039e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000319 sec
0b MemAlloc :     0.027103 sec
0c GenCreat :     0.000906 sec
0d SGoodHel :     0.000186 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027762 sec
2a RamboIni :     0.006784 sec
2b RamboFin :     0.092595 sec
3a SigmaKin :     1.340534 sec
4a DumpLoop :     0.004592 sec
8a CompStat :     0.003529 sec
9a GenDestr :     0.000083 sec
9b DumpScrn :     0.000231 sec
9c DumpJson :     0.000011 sec
TOTAL       :     1.504643 sec
TOTAL (123) :     1.467684 sec
TOTAL  (23) :     1.439913 sec
TOTAL   (1) :     0.027771 sec
TOTAL   (2) :     0.099379 sec
TOTAL   (3) :     1.340534 sec
***********************************************************************
There is more vectorization, but it segfaults...

./check.exe -p 16384 32 1
  Segmentation fault (core dumped)
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
  216
Using avx2 in -march, valgrind at least tells me it is a General Protection Fault

==481028== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==481028==  General Protection Fault
==481028==    at 0x40876E: MG5_sm::oxzxxxM0(double const*, int, int, std::complex<double> (*) [4], int, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
==481028==    by 0x40A48C: Proc::calculate_wavefunctions(int, double const*, double*, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
==481028==    by 0x40AD16: Proc::sigmaKin_getGoodHel(double const*, double*, bool*, int) (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
==481028==    by 0x405207: main (in /afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.exe)
valassi added 6 commits March 19, 2021 17:57
Performance is 1.25E6, slightly better than gcc9 1.15E6 but lower than Fortran 1.50E6

time ./build.none/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = clang 10.0.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.234199e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 6.911213e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.229851e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.849719e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 5.061495e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.217912e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.214358e-01 ,  4.223094e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.696825e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 9.103258e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.243004e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000383 sec
0b MemAlloc :     0.070821 sec
0c GenCreat :     0.000904 sec
0d SGoodHel :     0.000438 sec
1a GenSeed  :     0.000030 sec
1b GenRnGen :     0.322956 sec
2a RamboIni :     0.081141 sec
2b RamboFin :     1.768578 sec
3a SigmaKin :     5.061495 sec
4a DumpLoop :     0.074358 sec
8a CompStat :     0.084354 sec
9a GenDestr :     0.000020 sec
9b DumpScrn :     0.009514 sec
9c DumpJson :     0.000002 sec
TOTAL       :     7.474991 sec
TOTAL (123) :     7.234199 sec
TOTAL  (23) :     6.911214 sec
TOTAL   (1) :     0.322985 sec
TOTAL   (2) :     1.849719 sec
TOTAL   (3) :     5.061495 sec
***********************************************************************
real    0m7.499s
user    0m7.376s
sys     0m0.121s

time ./build.none/gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 9.123791e-02                 )  sec
TotalTime[Rambo+ME]    (23) = ( 8.373227e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.505641e-03                 )  sec
TotalTime[Rambo]        (2) = ( 7.402575e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 9.706521e-03                 )  sec
MeanTimeInMatrixElems       = ( 8.088767e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 8.009510e-04 ,  8.176020e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.895660e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 7.513777e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.481680e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.802752 sec
0a ProcInit :     0.000472 sec
0b MemAlloc :     0.032316 sec
0c GenCreat :     0.009958 sec
0d SGoodHel :     0.002051 sec
1a GenSeed  :     0.000017 sec
1b GenRnGen :     0.007489 sec
2a RamboIni :     0.000106 sec
2b RamboFin :     0.000051 sec
2c CpDTHwgt :     0.006522 sec
2d CpDTHmom :     0.067347 sec
3a SigmaKin :     0.000081 sec
3b CpDTHmes :     0.009625 sec
4a DumpLoop :     0.079669 sec
8a CompStat :     0.046016 sec
9a GenDestr :     0.000063 sec
9b DumpScrn :     0.000268 sec
9c DumpJson :     0.000002 sec
TOTAL       :     1.064805 sec
TOTAL (123) :     0.091238 sec
TOTAL  (23) :     0.083732 sec
TOTAL   (1) :     0.007506 sec
TOTAL   (2) :     0.074026 sec
TOTAL   (3) :     0.009707 sec
***********************************************************************
real    0m1.365s
user    0m0.447s
sys     0m0.478s
valassi added 4 commits March 20, 2021 19:01
…hout SIMD!

Now AVX=none with gcc9 is 1.28E6, it was 1.15E6 (remember fortran is 1.50E6).
It means that with AVX=none gcc9 and clang10 are completely comparable.

Note however that the speedup between AVX=none and AVX=avx2 is lower than 4:
4.40E6 / 1.28E6 is only 3.4, we can do better...

time ./build.none/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.160223e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 6.836318e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.239050e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.939587e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 4.896731e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.080609e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.074413e-01 ,  4.092229e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.786676e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 9.202989e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.284828e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000369 sec
0b MemAlloc :     0.070329 sec
0c GenCreat :     0.000909 sec
0d SGoodHel :     0.000105 sec
1a GenSeed  :     0.000026 sec
1b GenRnGen :     0.323879 sec
2a RamboIni :     0.077785 sec
2b RamboFin :     1.861802 sec
3a SigmaKin :     4.896730 sec
4a DumpLoop :     0.073871 sec
8a CompStat :     0.025105 sec
9a GenDestr :     0.000082 sec
9b DumpScrn :     0.008952 sec
9c DumpJson :     0.000006 sec
TOTAL       :     7.339950 sec
TOTAL (123) :     7.160223 sec
TOTAL  (23) :     6.836318 sec
TOTAL   (1) :     0.323905 sec
TOTAL   (2) :     1.939587 sec
TOTAL   (3) :     4.896730 sec
***********************************************************************
real    0m7.362s
user    0m7.236s
sys     0m0.123s

time ./build.avx2/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Internal loops fptype_sv    = VECTOR[4] (AVX2)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 3.598255e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 3.275359e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.228953e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.845746e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.429614e+00                 )  sec
MeanTimeInMatrixElems       = ( 1.191345e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.187156e-01 ,  1.201074e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.748474e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.920844e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 4.400809e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000379 sec
0b MemAlloc :     0.070129 sec
0c GenCreat :     0.000908 sec
0d SGoodHel :     0.000100 sec
1a GenSeed  :     0.000025 sec
1b GenRnGen :     0.322871 sec
2a RamboIni :     0.110108 sec
2b RamboFin :     1.735638 sec
3a SigmaKin :     1.429614 sec
4a DumpLoop :     0.075421 sec
8a CompStat :     0.024105 sec
9a GenDestr :     0.000091 sec
9b DumpScrn :     0.008895 sec
9c DumpJson :     0.000002 sec
TOTAL       :     3.778286 sec
TOTAL (123) :     3.598255 sec
TOTAL  (23) :     3.275360 sec
TOTAL   (1) :     0.322895 sec
TOTAL   (2) :     1.845746 sec
TOTAL   (3) :     1.429614 sec
***********************************************************************
real    0m3.799s
user    0m3.677s
sys     0m0.120s
Fix conflicts in epoch1/cuda/ee_mumu/src/Makefile
valassi added 9 commits March 20, 2021 20:19
…omp.sp

Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile

NB: this commit is about using -lgomp for both gcc and clang (removing an if),
however in this branch -lgomp was already used (the if was a noop),
because of a mistake in a previous merge commit.
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile

This merge essentially merges the clang PR madgraph5#134
/cvmfs/sft.cern.ch/lcg/releases/gcc/10.1.0-6f386/x86_64-centos7/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/include  -Wall -Wshadow -Wextra -fopenmp -DMGONGPU_COMMONRAND_ONHOST -ffast-math   -c runTest.cc -o build.none/runTest.o
runTest.cc: In member function ‘virtual double CPUTest::getMomentum(std::size_t, unsigned int, unsigned int) const’:
runTest.cc:86:98: error: invalid types ‘double[const long unsigned int]’ for array subscript
   86 | ::npar * mgOnGpu::np4 + particle * mgOnGpu::np4 + component][ieppM];
      |                                                             ^

Note however that the test segfaults on gcc10, but not on gcc9
This happens in both none and avx2
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Mar 22, 2021

While waiting to merge the PR, I also included a few patches that were in the further branch klas2. Without them, make AVX="none" was failing to build. This is still ready to merge.

@valassi
Copy link
Copy Markdown
Member Author

valassi commented Apr 8, 2021

This PR # 72 is OBSOLETE. It is superseded by PR #132 (additional SIMD fixes), which is itself superseded by PR #152 (all SIMD fixes, plus epoch1/2 merging as per issue #139 and PR #152).

I copied a few comments to PR #132.

I am closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants