Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps)#151
Conversation
Also comment out the code in HelAmps_sm.h/cc, but keep the files as they are
…tional changes For the moment: only M-X clean in emacs
…tional changes For the moment: manual change of formatting (pointers, newlines, indentation...)
…s): no functional changes yet
Process = EPOCH1_EEMUMU Process = EPOCH2_EEMUMU
…ode changes As mentioned bfeore, epoch1 is slightly faster[slower] than epoch2 in C++[Cuda] Process = EPOCH1_EEMUMU MatrixElements compiler = gcc (GCC) 9.2.0 EvtsPerSec[MatrixElems] (3) = ( 1.152193e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.005549 sec real 0m8.038s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU MatrixElements compiler = nvcc 11.0.221 EvtsPerSec[MatrixElems] (3) = ( 5.600203e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.899564 sec real 0m1.208s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU MatrixElements compiler = gcc (GCC) 9.2.0 EvtsPerSec[MatrixElems] (3) = ( 1.096791e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.236883 sec real 0m8.267s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU MatrixElements compiler = nvcc 11.0.221 EvtsPerSec[MatrixElems] (3) = ( 6.087068e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.834394 sec real 0m1.143s -------------------------------------------------------------------------
…poch1 and epoch2 Process = EPOCH1_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.150654e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.016891 sec real 0m8.048s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 5.515364e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.837137 sec real 0m1.144s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.095637e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.274270 sec real 0m8.304s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.100476e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.005009 sec real 0m1.317s -------------------------------------------------------------------------
…(remove M0 as in epoch2)
…vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx from epoch2 This is meant to ease line-by-line code comparisons NB: opzxxx is used in epoch2 but is unused in epoch1!! (oxzxxx is used twice...?!)
… in imzxxx, ixzxxx, opzxxx, oxzxxx Add it also in epoch1 in opzxxx (still commented out) - same indentation eases code comparison No functional changes so far yet Process = EPOCH1_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.155993e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 7.980323 sec real 0m8.011s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 5.516421e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.100832 sec real 0m1.410s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.095220e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.271548 sec real 0m8.302s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.115229e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.017229 sec real 0m1.325s -------------------------------------------------------------------------
…mment Keep the meaning consistent to waht was there in the same epoch
…nges Use the same comments and indentation whenever possible
…2_3, FFV4_0, FFV4_3
…, FFV4_0, FFV4_3 as in epoch2
…nges Use the same comments and code formatting whenever possible
…_getGoodHel() as in epoch1
…l changes Use the same comments and code formatting whenever possible
…ugFinalise() out of the event loop
…anges Use the same comments and code formatting whenever possible Process = EPOCH1_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.156029e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 7.981517 sec real 0m8.011s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 5.544006e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.834080 sec real 0m1.142s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.098745e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.252090 sec real 0m8.281s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.074557e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.833006 sec real 0m1.141s -------------------------------------------------------------------------
…ast (as in epoch2) instead of ME>MELast Minor non functional change in epoch2 too
… 'const int' This is for nhel and nsf in the ixxx and oxxx functions
…mass' by 'const fptype mass' This is both in epoch1 and epoch2, but only in unused comented out code
This may hopefully give some performance, and will help in vectorization Results are correct, but no performance gain at all ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.045520e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.498546 sec real 0m8.525s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.185252e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.845187 sec real 0m1.149s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.132483e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.033916 sec real 0m8.059s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.599302e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.800913 sec real 0m1.106s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Results are correct, and it is faster even if with more registers? ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.054736e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.438149 sec real 0m8.464s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.410104e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.164075 sec real 0m1.470s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 188 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.133087e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.042058 sec real 0m8.067s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.643386e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.798354 sec real 0m1.102s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Saves four registers somehow, keeping the same speed ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.052308e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.457272 sec real 0m8.483s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.403422e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.139938 sec real 0m1.446s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.131357e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.043483 sec real 0m8.070s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.828939e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.801243 sec real 0m1.106s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
…xx functions consistently
This will help in vectorization - but this function is unused!
…xxx (unused function)
…xx (must be cross checked!)
CPPProcess.cc:319:38: warning: ‘*’ in boolean context, suggest ‘&&’ instead [-Wint-in-bool-context]
319 | vc[4] = cxmake( 0., nsvahl * ( pvec3 < 0 ) ? - abs( sqh ) : abs( sqh ) );
| ~~~~~~~^~~~~~~~~~~~~~~
CPPProcess.cc:338:32: warning: ‘*’ in boolean context, suggest ‘&&’ instead [-Wint-in-bool-context]
338 | vc[4] = cxmake( 0, nsv * ( pvec3 < 0 ) ? -abs( sqh ) : abs( sqh ) );
| ~~~~^~~~~~~~~~~~~~~
Will now copy them over to epoch2, and later vectorize them ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.049947e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.483554 sec real 0m8.510s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.192699e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.814137 sec real 0m1.127s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.131853e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.055795 sec real 0m8.082s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.567374e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.816820 sec real 0m1.123s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
…of ixxxxx Back to the usual performance. Will copy to epoch2 and later vectorize ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.132769e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.048635 sec real 0m8.075s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.547247e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.811278 sec real 0m1.124s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.130969e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.065505 sec real 0m8.093s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.638448e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.819528 sec real 0m1.125s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
…ROM EPOCH1 Same performance of course, code is again almost identical ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.133299e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.039318 sec real 0m8.066s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.930205e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.030438 sec real 0m1.349s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.130380e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.059123 sec real 0m8.086s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.937883e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.013147 sec real 0m1.321s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Rearrange order of 'using std::min' and assumptions comments Prepare to add an event loop brace in all functions
…d an evt loop brace Do this in a consistent way for all xxx functions. This will be needed to vectorize the interface of those including if branches (while keeping an ieppV loop in a non vectorized implementation) The brace is not needed for xxx functions that can be vectorized (but not that a brace already is there for those, from my earlier changes...)
Hel_amps.h is essentially empty but is needed to build in epoch2 Eventually will move back all xxx and FFV functions here, after vectorization
…ings) and copy to epoch1 This solves the remaining 4 diffs in the cc file and a few more in the .h file. The CPPProcess is nw IDENTICAL in epoch2 and epoch1
…nk it in both epochs
…elow perf (as epoch1)
…y adds a comment)
…mupmum This is because I modified it in the vectorization branch in epoch1. I want to have both branches identical before I apply the vectorization PR. Note that epoch1 are now IDENTICAL except for epoch_process_id.h
…)! *** Last commit: move epoch_process_id.h to SubProcesses in both epoch1 and epoch2 (*) There is only one difference left: Output of "diff -r --no-dereference epoch1/cuda/ee_mumu epoch2/cuda/ee_mumu": diff -r --no-dereference epoch1/cuda/ee_mumu/SubProcesses/epoch_process_id.h epoch2/cuda/ee_mumu/SubProcesses/epoch_process_id.h 4c4 < #define MG_EPOCH_PROCESS_ID EPOCH1_EEMUMU --- > #define MG_EPOCH_PROCESS_ID EPOCH2_EEMUMU The following is the BASELINE PERFORMANCE before vectorization. For epoch2 this will remain the same - further changes will only be in epoch1. ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.050711 sec real 0m8.079s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.233023 sec real 0m1.552s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.059035 sec real 0m8.086s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.177079 sec real 0m1.485s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
|
I have finally completed this PR #151. Now epoch1 and epoch2 (before vectorization) are strictly identical. I will then go on and develop on top of epoch1 (vectorization and more), while keeping epoch2 as a pre-vectorization reference. This is a summary of what this contains: In general
Changes in file directory structure
Changes in file formatting, cosmetics, minor content issues
Code cleanup potentially affecting performance
Changes in XXX functions
Changes in FFV functions
Changes in sigmakin or calculate_wavefunction
Changes in CPPProcess other than XXX, FFV or formatting
Printouts and performance tools
TODO EVENTUALLY (after vectorization: add to the running list from previous PRs)
|
|
My CURRENT BASELINE PERFORMANCE (before vectorization) after this PR is described in 1c25007 |
This is the third (final?) PR for issue #139: the idea is to make sure that epoch1 has the same code as, or at least includes relevant code from, epoch2. This is because all my next PRs (vectorization, heterogeneous) are based on epoch1.
This is the followup to PR #140 and #149.
I keep it as WIP for the moment.