Skip to content

Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps)#151

Merged
valassi merged 97 commits into
madgraph5:masterfrom
valassi:ep2to2ep1
Apr 8, 2021
Merged

Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps)#151
valassi merged 97 commits into
madgraph5:masterfrom
valassi:ep2to2ep1

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented Apr 1, 2021

This is the third (final?) PR for issue #139: the idea is to make sure that epoch1 has the same code as, or at least includes relevant code from, epoch2. This is because all my next PRs (vectorization, heterogeneous) are based on epoch1.

This is the followup to PR #140 and #149.

I keep it as WIP for the moment.

@valassi valassi marked this pull request as draft April 1, 2021 15:40
valassi added 10 commits April 2, 2021 19:12
Process                     = EPOCH1_EEMUMU
Process                     = EPOCH2_EEMUMU
…ode changes

As mentioned bfeore, epoch1 is slightly faster[slower] than epoch2 in C++[Cuda]

Process                     = EPOCH1_EEMUMU
MatrixElements compiler     = gcc (GCC) 9.2.0
EvtsPerSec[MatrixElems] (3) = ( 1.152193e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.005549 sec
real    0m8.038s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU
MatrixElements compiler     = nvcc 11.0.221
EvtsPerSec[MatrixElems] (3) = ( 5.600203e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.899564 sec
real    0m1.208s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU
MatrixElements compiler     = gcc (GCC) 9.2.0
EvtsPerSec[MatrixElems] (3) = ( 1.096791e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.236883 sec
real    0m8.267s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU
MatrixElements compiler     = nvcc 11.0.221
EvtsPerSec[MatrixElems] (3) = ( 6.087068e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.834394 sec
real    0m1.143s
-------------------------------------------------------------------------
…poch1 and epoch2

Process                     = EPOCH1_EEMUMU_CPP
EvtsPerSec[MatrixElems] (3) = ( 1.150654e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.016891 sec
real    0m8.048s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 5.515364e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.837137 sec
real    0m1.144s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
EvtsPerSec[MatrixElems] (3) = ( 1.095637e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.274270 sec
real    0m8.304s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.100476e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.005009 sec
real    0m1.317s
-------------------------------------------------------------------------
…vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx from epoch2

This is meant to ease line-by-line code comparisons
NB: opzxxx is used in epoch2 but is unused in epoch1!! (oxzxxx is used twice...?!)
… in imzxxx, ixzxxx, opzxxx, oxzxxx

Add it also in epoch1 in opzxxx (still commented out) - same indentation eases code comparison
No functional changes so far yet

Process                     = EPOCH1_EEMUMU_CPP
EvtsPerSec[MatrixElems] (3) = ( 1.155993e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     7.980323 sec
real    0m8.011s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 5.516421e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.100832 sec
real    0m1.410s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
EvtsPerSec[MatrixElems] (3) = ( 1.095220e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.271548 sec
real    0m8.302s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.115229e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.017229 sec
real    0m1.325s
-------------------------------------------------------------------------
…mment

Keep the meaning consistent to waht was there in the same epoch
…nges

Use the same comments and indentation whenever possible
This was referenced Apr 2, 2021
valassi added 10 commits April 4, 2021 18:37
…nges

Use the same comments and code formatting whenever possible
…l changes

Use the same comments and code formatting whenever possible
…anges

Use the same comments and code formatting whenever possible

Process                     = EPOCH1_EEMUMU_CPP
EvtsPerSec[MatrixElems] (3) = ( 1.156029e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     7.981517 sec
real    0m8.011s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 5.544006e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.834080 sec
real    0m1.142s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
EvtsPerSec[MatrixElems] (3) = ( 1.098745e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.252090 sec
real    0m8.281s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.074557e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.833006 sec
real    0m1.141s
-------------------------------------------------------------------------
…ast (as in epoch2) instead of ME>MELast

Minor non functional change in epoch2 too
… 'const int'

This is for nhel and nsf in the ixxx and oxxx functions
…mass' by 'const fptype mass'

This is both in epoch1 and epoch2, but only in unused comented out code
valassi added 22 commits April 7, 2021 18:37
This may hopefully give some performance, and will help in vectorization

Results are correct, but no performance gain at all

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.045520e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.498546 sec
real    0m8.525s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.185252e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.845187 sec
real    0m1.149s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132483e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.033916 sec
real    0m8.059s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.599302e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.800913 sec
real    0m1.106s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Results are correct, and it is faster even if with more registers?

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.054736e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.438149 sec
real    0m8.464s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.410104e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.164075 sec
real    0m1.470s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 188
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133087e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.042058 sec
real    0m8.067s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.643386e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.798354 sec
real    0m1.102s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Saves four registers somehow, keeping the same speed

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.052308e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.457272 sec
real    0m8.483s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.403422e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.139938 sec
real    0m1.446s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.131357e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.043483 sec
real    0m8.070s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.828939e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.801243 sec
real    0m1.106s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
This will help in vectorization - but this function is unused!
…xx (must be cross checked!)

CPPProcess.cc:319:38: warning: ‘*’ in boolean context, suggest ‘&&’ instead [-Wint-in-bool-context]
  319 |           vc[4] = cxmake( 0., nsvahl * ( pvec3 < 0 ) ? - abs( sqh ) : abs( sqh ) );
      |                               ~~~~~~~^~~~~~~~~~~~~~~
CPPProcess.cc:338:32: warning: ‘*’ in boolean context, suggest ‘&&’ instead [-Wint-in-bool-context]
  338 |         vc[4] = cxmake( 0, nsv * ( pvec3 < 0 ) ? -abs( sqh ) : abs( sqh ) );
      |                            ~~~~^~~~~~~~~~~~~~~
Will now copy them over to epoch2, and later vectorize them

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.049947e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.483554 sec
real    0m8.510s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.192699e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.814137 sec
real    0m1.127s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 184
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.131853e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.055795 sec
real    0m8.082s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.567374e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.816820 sec
real    0m1.123s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
…of ixxxxx

Back to the usual performance. Will copy to epoch2 and later vectorize

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132769e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.048635 sec
real    0m8.075s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.547247e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.811278 sec
real    0m1.124s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.130969e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.065505 sec
real    0m8.093s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.638448e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     0.819528 sec
real    0m1.125s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
…ROM EPOCH1

Same performance of course, code is again almost identical

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133299e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.039318 sec
real    0m8.066s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.930205e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.030438 sec
real    0m1.349s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.130380e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059123 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.937883e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.013147 sec
real    0m1.321s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Rearrange order of 'using std::min' and assumptions comments
Prepare to add an event loop brace in all functions
…d an evt loop brace

Do this in a consistent way for all xxx functions.
This will be needed to vectorize the interface of those including if branches
(while keeping an ieppV loop in a non vectorized implementation)
The brace is not needed for xxx functions that can be vectorized
(but not that a brace already is there for those, from my earlier changes...)
Hel_amps.h is essentially empty but is needed to build in epoch2

Eventually will move back all xxx and FFV functions here, after vectorization
…ings) and copy to epoch1

This solves the remaining 4 diffs in the cc file and a few more in the .h file.
The CPPProcess is nw IDENTICAL in epoch2 and epoch1
…mupmum

This is because I modified it in the vectorization branch in epoch1.
I want to have both branches identical before I apply the vectorization PR.

Note that epoch1 are now IDENTICAL except for epoch_process_id.h
…)! ***

Last commit: move epoch_process_id.h to SubProcesses in both epoch1 and epoch2

(*) There is only one difference left:
Output of "diff -r --no-dereference epoch1/cuda/ee_mumu epoch2/cuda/ee_mumu":
diff -r --no-dereference epoch1/cuda/ee_mumu/SubProcesses/epoch_process_id.h epoch2/cuda/ee_mumu/SubProcesses/epoch_process_id.h
4c4
< #define MG_EPOCH_PROCESS_ID EPOCH1_EEMUMU
---
> #define MG_EPOCH_PROCESS_ID EPOCH2_EEMUMU

The following is the BASELINE PERFORMANCE before vectorization.
For epoch2 this will remain the same - further changes will only be in epoch1.

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.050711 sec
real    0m8.079s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.233023 sec
real    0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059035 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.177079 sec
real    0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
@valassi valassi marked this pull request as ready for review April 8, 2021 16:02
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Apr 8, 2021

I have finally completed this PR #151. Now epoch1 and epoch2 (before vectorization) are strictly identical. I will then go on and develop on top of epoch1 (vectorization and more), while keeping epoch2 as a pre-vectorization reference.

This is a summary of what this contains:

In general

  • proceed iteratively, with detailed code/performance comparisons at each step
  • merge into the vectorization ("klas2ep12") branch frequently to prepare for that eventually
  • first use the same function names and formatting to allow side-by-side comparisons, then modify both a lot
  • going on, I almost rewrote from scratch the XXX and FFV functions starting from those of epoch2
  • eventually, epoch1 and epoch2 are now IDENTICAL (except for the "epoch_process_id" tag)

Changes in file directory structure

  • HelAmps_sm.h/cc : epoch2 comment out code but keep files as they were initially, also copy them as-is in epoch1
  • CPPProcess.cc : epoch2 embedded epoch2 HelAmps_sm.cc initially as-is, then started modifying the contents in parallel to epoch1
  • performance scripts, headers etc: try to move everything to Subprocesses and add links in PSigma, as in epoch2
    (WARNING: not yet done for epoch1 runTest.cc, where I have big changes for vectorization; temporarely moved it also for epoch2)

Changes in file formatting, cosmetics, minor content issues

  • CPPProcess.cc : epoch2 (and epoch1) add mgDebug calls (plus minor fixese when already there) and returns on void

Code cleanup potentially affecting performance

  • CPPProcess.cc : epoch2 "clean up" the code, also to simplify vectorization later on
    (e.g. define variables only when used, and initialise them immediately; use c++11 zero initialization...)

Changes in XXX functions

  • CPPProcess.cc : epoch1 rename imzxxxM0, ixzxxxM0, oxzxxxM0 (remove M0 as in epoch2)
  • CPPProcess.cc : epoch1 add (commented out) ixxxxx, ipzxxx, vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx as-is from epoch2
  • CPPProcess.cc : epoch1/2, rewrite ASSUMPTIONS comment using the explanation given in epoch2
  • CPPProcess.cc : epoch2 interface change, replace 'const int&' by 'const int' (nhel and nsf)
  • CPPProcess.cc : epoch2 interface change, replace 'const fptype&' by 'const fptype' (e.g. masses)
  • CPPProcess.cc : epoch1/2 interface change, replace 'cxtype fi[6]' by 'cxtype* fi'
  • CPPProcess.cc : epoch1/2 move the definition of nwf=5 and nw6=6 from mgOnGpuConfig.h to CPPProcess.cc
  • CPPProcess.cc : epoch1 interface change, replace fis/fos by fi/fo as in epoch2
    (generally try to use Olivier's epoch2 naming/structure, eventually rewrite them from scratch starting from epoch2)
  • CPPProcess.cc : epoch1 use opzxxx instead of oxzxxx as in epoch2!
  • CPPProcess.cc : epoch1 eventually rewrite all xxx functions from scratch starting from epoch2 modified as above
  • CPPProcess.cc : epoch1/2 ensure ixxxxx/oxxxxx are ok, test these instead of the mass=0 functions, functionally ok but slower
  • CPPProcess.cc : epoch1/2 uncomment all xxx functions and ensure they build (including ipzxxx, vxxxxx, sxxxxx, omzxxx)
  • CPPProcess.cc : epoch1/2 replace p0123 by pvec0123 in all xxx functions consistently
  • CPPProcess.cc : epoch1/2 remove two build warnings from sxxxxx (unused function)
  • CPPProcess.cc : epoch1/2 remove two builds warnings in vxxxxx (ternary operator), WARNING must be cross checked!
  • CPPProcess.cc : epoch1/2 (eventually) add START/END LOOP and extra brace in all xxx functions
    (the extra brace is because later on in some cases fptype_v is handled by expliciting loop on ieppV without SIMD)

Changes in FFV functions

  • CPPProcess.cc : epoch1 add (initially commented out, later uncommented) FFV2_0, FFV2_3, FFV4_0, FFV4_3 as in epoch2
  • CPPProcess.cc : epoch1/2 eventually ensure all FFV functions build including FFV2_0, FFV2_3, FFV4_0, FFV4_3
  • CPPProcess.cc : epoch1 use GPU constant memory for couplings as in epoch2
    (WARNING: this increases registers without affecting throughput, from a performance view point hardcoding them is better,
    but from a usability view point these "constants" must be read from user configuration files - hence do use constant memory)
  • CPPProcess.cc : epoch2 use fptype (was hardcoded double) for masses/couplings as in epoch1
  • CPPProcess.cc : epoch1 rewrite all FFV functions from scratch starting from epoch2 (with few modifications as above)
    (this includes the FFV1_0, FFV1P0_3, FFV2_4_0 and FFV2_4_3 that I had previously modified in epoch1 - I discard those!)
  • CPPProcess.cc : epoch1/2 gain 1% by inverting order of two statements in FFV2_4_0 (?!)
  • CPPProcess.cc : epoch1/2 gain ~1% by inverting order of two statements in FFV2_4_3 (?!)
  • CPPProcess.cc : epoch1/2 replace ".real()" by "cxreal()", more portable (cucomplex) and makes vectorization easier

Changes in sigmakin or calculate_wavefunction

  • CPPProcess.cc : epoch1 helicity filtering bug fix ME!=MELast (as in epoch2) instead of ME>MELast
  • CPPProcess.cc : epoch1 add using namespace MG5_sm in calculate_wavefunction, as in epoch2
  • CPPProcess.cc : epoch1 use amp[1] instead of amp[2] as in epoch2, add to running sum after each FFV call
    (WARNING! this decreases epoch1 C++ performance by ~5% to the same level as epoch2, but is useful
    for "2 to many" processes as it probably saves some registers; I was not yet monitoring registers)
  • CPPProcess.cc : epoch1/2 use "jamp -= amp0" instead of "jamp += (-amp0)"
    (surprisingly, this gains back some of the ppereformance previously lost in epoch1)
  • CPPProcess.cc : epoch2 add OMP multi-threading as in epoch1
  • CPPProcess.cc : epoch2 use fptype (was hardcoded double) for colors as in epoch1
  • CPPProcess.cc : epoch1 use opzxxx instead of oxzxxx as in epoch2!

Changes in CPPProcess other than XXX, FFV or formatting

  • CPPProcess.cc : epoch1 remove duplicate code defining tHel in the same way for c++ and cuda
  • CPPProcess.cc : epoch2 remove 'static' from 'const int tHel" as in epoch1
  • CPPProcess.cc : epoch2 clean up class structure (fix warnings) and copy to epoch1

Printouts and performance tools

  • check.cc : epoch1/2, add Epoch/Process/Language printout (using epoch_process_id header)
  • throughput12.sh : epoch1/2 add script to compare ep2/ep1 performances as the code changes
  • throughput12.sh : epoch1/2 repeat c++ test also with full OMP threads (if -omp is specified)
  • throughput12.sh : dump sigmaKin register usage from ncu
  • profile.sh : epoch1 a few fixes for ncu

TODO EVENTUALLY (after vectorization: add to the running list from previous PRs)

  • remove XXX and FFV functions from CPPProcess, move them only to HelAmps
  • move runTests.cc to Subprocesses and add a link in the PSigma directory

@valassi
Copy link
Copy Markdown
Member Author

valassi commented Apr 8, 2021

My CURRENT BASELINE PERFORMANCE (before vectorization) after this PR is described in 1c25007

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.050711 sec
real    0m8.079s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.233023 sec
real    0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059035 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.177079 sec
real    0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant