Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps) by valassi · Pull Request #151 · madgraph5/madgraph4gpu

valassi · 2021-04-01T15:39:59Z

This is the third (final?) PR for issue #139: the idea is to make sure that epoch1 has the same code as, or at least includes relevant code from, epoch2. This is because all my next PRs (vectorization, heterogeneous) are based on epoch1.

This is the followup to PR #140 and #149.

I keep it as WIP for the moment.

…ional changes

Also comment out the code in HelAmps_sm.h/cc, but keep the files as they are

…tional changes For the moment: only M-X clean in emacs

…tional changes For the moment: manual change of formatting (pointers, newlines, indentation...)

…s): no functional changes yet

Process = EPOCH1_EEMUMU Process = EPOCH2_EEMUMU

…ode changes As mentioned bfeore, epoch1 is slightly faster[slower] than epoch2 in C++[Cuda] Process = EPOCH1_EEMUMU MatrixElements compiler = gcc (GCC) 9.2.0 EvtsPerSec[MatrixElems] (3) = ( 1.152193e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.005549 sec real 0m8.038s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU MatrixElements compiler = nvcc 11.0.221 EvtsPerSec[MatrixElems] (3) = ( 5.600203e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.899564 sec real 0m1.208s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU MatrixElements compiler = gcc (GCC) 9.2.0 EvtsPerSec[MatrixElems] (3) = ( 1.096791e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.236883 sec real 0m8.267s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU MatrixElements compiler = nvcc 11.0.221 EvtsPerSec[MatrixElems] (3) = ( 6.087068e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.834394 sec real 0m1.143s -------------------------------------------------------------------------

…poch1 and epoch2 Process = EPOCH1_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.150654e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.016891 sec real 0m8.048s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 5.515364e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.837137 sec real 0m1.144s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.095637e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.274270 sec real 0m8.304s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.100476e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.005009 sec real 0m1.317s -------------------------------------------------------------------------

…(remove M0 as in epoch2)

…vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx from epoch2 This is meant to ease line-by-line code comparisons NB: opzxxx is used in epoch2 but is unused in epoch1!! (oxzxxx is used twice...?!)

…comments

… in imzxxx, ixzxxx, opzxxx, oxzxxx Add it also in epoch1 in opzxxx (still commented out) - same indentation eases code comparison No functional changes so far yet Process = EPOCH1_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.155993e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 7.980323 sec real 0m8.011s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 5.516421e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.100832 sec real 0m1.410s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.095220e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.271548 sec real 0m8.302s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.115229e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.017229 sec real 0m1.325s -------------------------------------------------------------------------

…mment Keep the meaning consistent to waht was there in the same epoch

…nges Use the same comments and indentation whenever possible

…2_3, FFV4_0, FFV4_3

…, FFV4_0, FFV4_3 as in epoch2

…nges Use the same comments and code formatting whenever possible

…_getGoodHel() as in epoch1

…l changes Use the same comments and code formatting whenever possible

…ugFinalise() out of the event loop

…anges Use the same comments and code formatting whenever possible Process = EPOCH1_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.156029e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 7.981517 sec real 0m8.011s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 5.544006e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.834080 sec real 0m1.142s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP EvtsPerSec[MatrixElems] (3) = ( 1.098745e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.252090 sec real 0m8.281s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.074557e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.833006 sec real 0m1.141s -------------------------------------------------------------------------

…ast (as in epoch2) instead of ME>MELast Minor non functional change in epoch2 too

… 'const int' This is for nhel and nsf in the ixxx and oxxx functions

…mass' by 'const fptype mass' This is both in epoch1 and epoch2, but only in unused comented out code

This may hopefully give some performance, and will help in vectorization Results are correct, but no performance gain at all ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.045520e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.498546 sec real 0m8.525s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.185252e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.845187 sec real 0m1.149s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.132483e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.033916 sec real 0m8.059s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.599302e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.800913 sec real 0m1.106s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

Results are correct, and it is faster even if with more registers? ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.054736e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.438149 sec real 0m8.464s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.410104e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.164075 sec real 0m1.470s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 188 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.133087e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.042058 sec real 0m8.067s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.643386e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.798354 sec real 0m1.102s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

Saves four registers somehow, keeping the same speed ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.052308e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.457272 sec real 0m8.483s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.403422e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.139938 sec real 0m1.446s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.131357e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.043483 sec real 0m8.070s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.828939e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.801243 sec real 0m1.106s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

…xx functions consistently

This will help in vectorization - but this function is unused!

…xxx (unused function)

…xx (must be cross checked!) CPPProcess.cc:319:38: warning: ‘*’ in boolean context, suggest ‘&&’ instead [-Wint-in-bool-context] 319 | vc[4] = cxmake( 0., nsvahl * ( pvec3 < 0 ) ? - abs( sqh ) : abs( sqh ) ); | ~~~~~~~^~~~~~~~~~~~~~~ CPPProcess.cc:338:32: warning: ‘*’ in boolean context, suggest ‘&&’ instead [-Wint-in-bool-context] 338 | vc[4] = cxmake( 0, nsv * ( pvec3 < 0 ) ? -abs( sqh ) : abs( sqh ) ); | ~~~~^~~~~~~~~~~~~~~

Will now copy them over to epoch2, and later vectorize them ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.049947e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.483554 sec real 0m8.510s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.192699e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.814137 sec real 0m1.127s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.131853e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.055795 sec real 0m8.082s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.567374e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.816820 sec real 0m1.123s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

…of ixxxxx Back to the usual performance. Will copy to epoch2 and later vectorize ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.132769e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.048635 sec real 0m8.075s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.547247e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.811278 sec real 0m1.124s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.130969e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.065505 sec real 0m8.093s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.638448e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 0.819528 sec real 0m1.125s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

…ROM EPOCH1 Same performance of course, code is again almost identical ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.133299e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.039318 sec real 0m8.066s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.930205e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.030438 sec real 0m1.349s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.130380e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.059123 sec real 0m8.086s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.937883e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.013147 sec real 0m1.321s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

Rearrange order of 'using std::min' and assumptions comments Prepare to add an event loop brace in all functions

…d an evt loop brace Do this in a consistent way for all xxx functions. This will be needed to vectorize the interface of those including if branches (while keeping an ieppV loop in a non vectorized implementation) The brace is not needed for xxx functions that can be vectorized (but not that a brace already is there for those, from my earlier changes...)

Hel_amps.h is essentially empty but is needed to build in epoch2 Eventually will move back all xxx and FFV functions here, after vectorization

…ings) and copy to epoch1 This solves the remaining 4 diffs in the cc file and a few more in the .h file. The CPPProcess is nw IDENTICAL in epoch2 and epoch1

…as in epoch2

…nk it in both epochs

…elow perf (as epoch1)

…y adds a comment)

…mupmum This is because I modified it in the vectorization branch in epoch1. I want to have both branches identical before I apply the vectorization PR. Note that epoch1 are now IDENTICAL except for epoch_process_id.h

…)! *** Last commit: move epoch_process_id.h to SubProcesses in both epoch1 and epoch2 (*) There is only one difference left: Output of "diff -r --no-dereference epoch1/cuda/ee_mumu epoch2/cuda/ee_mumu": diff -r --no-dereference epoch1/cuda/ee_mumu/SubProcesses/epoch_process_id.h epoch2/cuda/ee_mumu/SubProcesses/epoch_process_id.h 4c4 < #define MG_EPOCH_PROCESS_ID EPOCH1_EEMUMU --- > #define MG_EPOCH_PROCESS_ID EPOCH2_EEMUMU The following is the BASELINE PERFORMANCE before vectorization. For epoch2 this will remain the same - further changes will only be in epoch1. ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.050711 sec real 0m8.079s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.233023 sec real 0m1.552s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 8.059035 sec real 0m8.086s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08 ) sec^-1 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 TOTAL : 1.177079 sec real 0m1.485s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

valassi · 2021-04-08T17:24:17Z

I have finally completed this PR #151. Now epoch1 and epoch2 (before vectorization) are strictly identical. I will then go on and develop on top of epoch1 (vectorization and more), while keeping epoch2 as a pre-vectorization reference.

This is a summary of what this contains:

In general

proceed iteratively, with detailed code/performance comparisons at each step
merge into the vectorization ("klas2ep12") branch frequently to prepare for that eventually
first use the same function names and formatting to allow side-by-side comparisons, then modify both a lot
going on, I almost rewrote from scratch the XXX and FFV functions starting from those of epoch2
eventually, epoch1 and epoch2 are now IDENTICAL (except for the "epoch_process_id" tag)

Changes in file directory structure

HelAmps_sm.h/cc : epoch2 comment out code but keep files as they were initially, also copy them as-is in epoch1
CPPProcess.cc : epoch2 embedded epoch2 HelAmps_sm.cc initially as-is, then started modifying the contents in parallel to epoch1
performance scripts, headers etc: try to move everything to Subprocesses and add links in PSigma, as in epoch2
(WARNING: not yet done for epoch1 runTest.cc, where I have big changes for vectorization; temporarely moved it also for epoch2)

Changes in file formatting, cosmetics, minor content issues

CPPProcess.cc : epoch2 (and epoch1) add mgDebug calls (plus minor fixese when already there) and returns on void

Code cleanup potentially affecting performance

CPPProcess.cc : epoch2 "clean up" the code, also to simplify vectorization later on
(e.g. define variables only when used, and initialise them immediately; use c++11 zero initialization...)

Changes in XXX functions

CPPProcess.cc : epoch1 rename imzxxxM0, ixzxxxM0, oxzxxxM0 (remove M0 as in epoch2)
CPPProcess.cc : epoch1 add (commented out) ixxxxx, ipzxxx, vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx as-is from epoch2
CPPProcess.cc : epoch1/2, rewrite ASSUMPTIONS comment using the explanation given in epoch2
CPPProcess.cc : epoch2 interface change, replace 'const int&' by 'const int' (nhel and nsf)
CPPProcess.cc : epoch2 interface change, replace 'const fptype&' by 'const fptype' (e.g. masses)
CPPProcess.cc : epoch1/2 interface change, replace 'cxtype fi[6]' by 'cxtype* fi'
CPPProcess.cc : epoch1/2 move the definition of nwf=5 and nw6=6 from mgOnGpuConfig.h to CPPProcess.cc
CPPProcess.cc : epoch1 interface change, replace fis/fos by fi/fo as in epoch2
(generally try to use Olivier's epoch2 naming/structure, eventually rewrite them from scratch starting from epoch2)
CPPProcess.cc : epoch1 use opzxxx instead of oxzxxx as in epoch2!
CPPProcess.cc : epoch1 eventually rewrite all xxx functions from scratch starting from epoch2 modified as above
CPPProcess.cc : epoch1/2 ensure ixxxxx/oxxxxx are ok, test these instead of the mass=0 functions, functionally ok but slower
CPPProcess.cc : epoch1/2 uncomment all xxx functions and ensure they build (including ipzxxx, vxxxxx, sxxxxx, omzxxx)
CPPProcess.cc : epoch1/2 replace p0123 by pvec0123 in all xxx functions consistently
CPPProcess.cc : epoch1/2 remove two build warnings from sxxxxx (unused function)
CPPProcess.cc : epoch1/2 remove two builds warnings in vxxxxx (ternary operator), WARNING must be cross checked!
CPPProcess.cc : epoch1/2 (eventually) add START/END LOOP and extra brace in all xxx functions
(the extra brace is because later on in some cases fptype_v is handled by expliciting loop on ieppV without SIMD)

Changes in FFV functions

CPPProcess.cc : epoch1 add (initially commented out, later uncommented) FFV2_0, FFV2_3, FFV4_0, FFV4_3 as in epoch2
CPPProcess.cc : epoch1/2 eventually ensure all FFV functions build including FFV2_0, FFV2_3, FFV4_0, FFV4_3
CPPProcess.cc : epoch1 use GPU constant memory for couplings as in epoch2
(WARNING: this increases registers without affecting throughput, from a performance view point hardcoding them is better,
but from a usability view point these "constants" must be read from user configuration files - hence do use constant memory)
CPPProcess.cc : epoch2 use fptype (was hardcoded double) for masses/couplings as in epoch1
CPPProcess.cc : epoch1 rewrite all FFV functions from scratch starting from epoch2 (with few modifications as above)
(this includes the FFV1_0, FFV1P0_3, FFV2_4_0 and FFV2_4_3 that I had previously modified in epoch1 - I discard those!)
CPPProcess.cc : epoch1/2 gain 1% by inverting order of two statements in FFV2_4_0 (?!)
CPPProcess.cc : epoch1/2 gain ~1% by inverting order of two statements in FFV2_4_3 (?!)
CPPProcess.cc : epoch1/2 replace ".real()" by "cxreal()", more portable (cucomplex) and makes vectorization easier

Changes in sigmakin or calculate_wavefunction

CPPProcess.cc : epoch1 helicity filtering bug fix ME!=MELast (as in epoch2) instead of ME>MELast
CPPProcess.cc : epoch1 add using namespace MG5_sm in calculate_wavefunction, as in epoch2
CPPProcess.cc : epoch1 use amp[1] instead of amp[2] as in epoch2, add to running sum after each FFV call
(WARNING! this decreases epoch1 C++ performance by ~5% to the same level as epoch2, but is useful
for "2 to many" processes as it probably saves some registers; I was not yet monitoring registers)
CPPProcess.cc : epoch1/2 use "jamp -= amp0" instead of "jamp += (-amp0)"
(surprisingly, this gains back some of the ppereformance previously lost in epoch1)
CPPProcess.cc : epoch2 add OMP multi-threading as in epoch1
CPPProcess.cc : epoch2 use fptype (was hardcoded double) for colors as in epoch1
CPPProcess.cc : epoch1 use opzxxx instead of oxzxxx as in epoch2!

Changes in CPPProcess other than XXX, FFV or formatting

CPPProcess.cc : epoch1 remove duplicate code defining tHel in the same way for c++ and cuda
CPPProcess.cc : epoch2 remove 'static' from 'const int tHel" as in epoch1
CPPProcess.cc : epoch2 clean up class structure (fix warnings) and copy to epoch1

Printouts and performance tools

check.cc : epoch1/2, add Epoch/Process/Language printout (using epoch_process_id header)
throughput12.sh : epoch1/2 add script to compare ep2/ep1 performances as the code changes
throughput12.sh : epoch1/2 repeat c++ test also with full OMP threads (if -omp is specified)
throughput12.sh : dump sigmaKin register usage from ncu
profile.sh : epoch1 a few fixes for ncu

TODO EVENTUALLY (after vectorization: add to the running list from previous PRs)

remove XXX and FFV functions from CPPProcess, move them only to HelAmps
move runTests.cc to Subprocesses and add a link in the PSigma directory

valassi · 2021-04-08T17:33:31Z

My CURRENT BASELINE PERFORMANCE (before vectorization) after this PR is described in 1c25007

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.050711 sec
real    0m8.079s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.233023 sec
real    0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059035 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.177079 sec
real    0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

valassi added 7 commits April 1, 2021 14:43

[ep2to2ep1] CPPProcess.h - cosmetics (code formatting) only, no funct…

f20f28a

…ional changes

[ep2to2ep1] CPPProcess.cc - embed HelAmps_sm.cc in ep2 as-is for ep1

d9847f1

Also comment out the code in HelAmps_sm.h/cc, but keep the files as they are

[ep2to2ep1] CPPProcess.cc - fix a build error (missing std:: for string)

7f6f24f

[ep2to2ep1] CPPProcess.cc - cosmetics (code formatting) only, no func…

874ced4

…tional changes For the moment: only M-X clean in emacs

[ep2to2ep1] CPPProcess.cc - cosmetics (code formatting) only, no func…

1dfeb39

…tional changes For the moment: manual change of formatting (pointers, newlines, indentation...)

[ep2to2ep1] CPPProcess.cc - epoch2 add mgDebug (and return in void fn…

9dc7ae2

…s): no functional changes yet

Merge remote-tracking branch 'upstream/master' into ep2to2ep1

8c0ebd7

valassi marked this pull request as draft April 1, 2021 15:40

valassi added 10 commits April 2, 2021 19:12

[ep2to2ep1] CPPProcess.cc - epoch2 Comment out unused functions

f435070

[ep2to2ep1] check.cc Add Procee printout in both epoch1 and epoch2

73bafc2

Process = EPOCH1_EEMUMU Process = EPOCH2_EEMUMU

[ep2to2ep1] CPPProcess.cc epoch1 rename imzxxxM0, ixzxxxM0, oxzxxxM0 …

431464c

…(remove M0 as in epoch2)

[ep2to2ep1] CPPProcess.cc epoch1 add (commented out) ixxxxx, ipzxxx, …

e2d41b2

…vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx from epoch2 This is meant to ease line-by-line code comparisons NB: opzxxx is used in epoch2 but is unused in epoch1!! (oxzxxx is used twice...?!)

[ep2to2ep1] CPPProcess.cc : epoch1 and epoch2 improve END/START LOOP …

0692409

…comments

[ep2to2ep1] CPPProcess.cc : epoch1 and epoch2, rewrite ASSUMPTIONS co…

8af64b0

…mment Keep the meaning consistent to waht was there in the same epoch

[ep2to2ep1] CPPProcess.cc : epoch and epoch2, more non-functional cha…

a921db4

…nges Use the same comments and indentation whenever possible

This was referenced Apr 2, 2021

klas2 (SIMD CPU) + epoch1/epoch2 #152

Closed

het + epoch1/epoch2 #153

Closed

valassi added 10 commits April 4, 2021 18:37

[ep2to2ep1] CPPProcess.cc : epoch2 comment out the unused FFV2_0, FFV…

3520f9b

…2_3, FFV4_0, FFV4_3

[ep2to2ep1] CPPProcess.cc : epoch1 add (commented out) FFV2_0, FFV2_3…

0bb516e

…, FFV4_0, FFV4_3 as in epoch2

[ep2to2ep1] CPPProcess.cc : epoch and epoch2, more non-functional cha…

270451d

…nges Use the same comments and code formatting whenever possible

[ep2to2ep1] CPPProcess.cc : epoch2 move getcompiler() before sigmaKin…

6d9d7c6

…_getGoodHel() as in epoch1

[ep2to2ep1] CPPProcess.cc : epoch and epoch2, even more non-functiona…

b260ae2

…l changes Use the same comments and code formatting whenever possible

[ep2to2ep1] CPPProcess.cc : epoch2 non-functional bug fix, move mgDeb…

1d37eaa

…ugFinalise() out of the event loop

[ep2to2ep1] CPPProcess.cc : epoch1 functional change, bug fix ME!=MEL…

f396c40

…ast (as in epoch2) instead of ME>MELast Minor non functional change in epoch2 too

[ep2to2ep1] CPPProcess.cc : interface change, replace 'const int&' by…

7a4f13b

… 'const int' This is for nhel and nsf in the ixxx and oxxx functions

[ep2to2ep1] CPPProcess.cc : interface change, replace 'const fptype& …

5a32379

…mass' by 'const fptype mass' This is both in epoch1 and epoch2, but only in unused comented out code

valassi added 22 commits April 7, 2021 18:37

[ep2to2ep1] CPPProcess.cc epoch1 : replace p0123 by pvec0123 in all x…

44bd786

…xx functions consistently

[ep2to2ep1] CPPProcess.cc epoch1 : "clean up" vxxxxx a bit

8d42bf5

This will help in vectorization - but this function is unused!

[ep2to2ep1] CPPProcess.cc epoch1 : remove two build warnings from sxx…

34cb983

…xxx (unused function)

[ep2to2ep1] CPPProcess.cc epoch1/2 further cleanup xxx functions

496c269

Rearrange order of 'using std::min' and assumptions comments Prepare to add an event loop brace in all functions

[ep2to2ep1] epoch1 src : copy Hel_amps.h/cc as-is from epoch2

31b061f

Hel_amps.h is essentially empty but is needed to build in epoch2 Eventually will move back all xxx and FFV functions here, after vectorization

[ep2to2ep1] CPPProcess.cc epoch2 : clean up class structure (fix warn…

beac776

…ings) and copy to epoch1 This solves the remaining 4 diffs in the cc file and a few more in the .h file. The CPPProcess is nw IDENTICAL in epoch2 and epoch1

[ep2to2ep1] epoch1 : move Memory.h nvtx.h timermap.h to SubProcesses …

bc2cf87

…as in epoch2

[ep2to2ep1] profile.sh epoch2, copy the few latest patches from epoch1

0d52604

[ep2to2ep1] epoch1 : move profile.sh to SubProcesses as in epoch2, li…

5497bd6

…nk it in both epochs

[ep2to2ep1] epoch1/2 : move perf.py to SubProcesses (as epoch2) but b…

d594eae

…elow perf (as epoch1)

[ep2to2ep1] epoch1 param_card.dat : copy the version from epoch2 (onl…

ea44d56

…y adds a comment)

[ep2to2ep1] epoch2 : add the same throughput12.sh script as in epoch1

c91a9da

valassi marked this pull request as ready for review April 8, 2021 16:02

valassi mentioned this pull request Apr 8, 2021

Move latest eemumu developments from epoch1 to epoch2 ("merge" epoch2 into epoch1) #139

Closed

valassi merged commit e6c867f into madgraph5:master Apr 8, 2021

This was referenced Apr 8, 2021

kernel launchers and SIMD vectorization #71

Closed

WIP - het + klas2 + epoch1/epoch2 (Heterogeneous standalone application: GPU + SIMD CPU) #159

Closed

valassi mentioned this pull request Apr 23, 2021

vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps)#151

Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps)#151
valassi merged 97 commits into
madgraph5:masterfrom
valassi:ep2to2ep1

valassi commented Apr 1, 2021

Uh oh!

valassi commented Apr 8, 2021

Uh oh!

valassi commented Apr 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valassi commented Apr 1, 2021

Uh oh!

valassi commented Apr 8, 2021

Uh oh!

valassi commented Apr 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant