Skip to content

Heterogeneous MadGraph: parallel CPU+GPU executions#87

Closed
valassi wants to merge 38 commits into
madgraph5:masterfrom
valassi:het
Closed

Heterogeneous MadGraph: parallel CPU+GPU executions#87
valassi wants to merge 38 commits into
madgraph5:masterfrom
valassi:het

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented Dec 4, 2020

See a more complete description in #85

I created a simple prototype. The point here was mainly a proof of concept, also trying to sort out the build (which may be useful for addressing #83).

The current prototype runs exactly the same number of events with exactly the same random numbers in parallel on CPU (with OMP threads) and on GPU. Both computations give the same sets of events, which are not yet combined. As the GPU is much faster, essentiually the net effect is a computation that lasts as long as the CPU version, but does double events (because the same events are also on the GPU), so the throghput doubles.

This clearly needs a lot more work (especially the optimization is tricky), but it's a useful prrof of concept.

PS I forgot to mention: this includes and supersedes #82. As in that one, the build of runTest (#83) is disabled here. I would suggest however to fix it after including these changes, which give a possible direction for how to combine cuda and c++ modules. One option for #83 is to keep a single runTest.exe as it is now, but make it much clearer that real modules are either c++ with gcc or cuda with nvcc, and then only thin layers integrating both are compiled with nvcc. Essentially, I only had to add -lgomp to build the combined module, probably the same is ok for runTest.

./hcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND DEVICE (CUDA code)
Wavefunction GPU memory    = LOCAL
-----------------------------------------------------------------------
NumIterations              = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.312938e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.714818e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 5.981200e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.945168e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.696500e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.696500e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.696500e-04 ,  7.696500e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288 (nan=0)
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.169321e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.807926e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.812032e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
(GPU) 00 CudaFree :     0.872603 sec
(GPU) 0a ProcInit :     0.000233 sec
(GPU) 0b MemAlloc :     0.035793 sec
(GPU) 0c GenCreat :     0.009814 sec
(GPU) 0d SGoodHel :     0.001759 sec
(GPU) 1a GenSeed  :     0.000009 sec
(GPU) 1b GenRnGen :     0.000589 sec
(GPU) 2a RamboIni :     0.000022 sec
(GPU) 2b RamboFin :     0.000014 sec
(GPU) 2c CpDTHwgt :     0.000502 sec
(GPU) 2d CpDTHmom :     0.005407 sec
(GPU) 3a SigmaKin :     0.000014 sec
(GPU) 3b CpDTHmes :     0.000756 sec
(GPU) 4a DumpLoop :     0.004765 sec
(GPU) 8a CompStat :     0.003540 sec
(GPU) 9a GenDestr :     0.000048 sec
(GPU) 9b DumpScrn :     0.000044 sec
(GPU) 9c DumpJson :     0.000007 sec
(GPU) TOTAL       :     0.935918 sec
(GPU) TOTAL (123) :     0.007313 sec
(GPU) TOTAL  (23) :     0.006715 sec
(GPU) TOTAL   (1) :     0.000598 sec
(GPU) TOTAL   (2) :     0.005945 sec
(GPU) TOTAL   (3) :     0.000770 sec
***********************************************************************
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
OMP threads / maxthreads   = 4 / 4
-----------------------------------------------------------------------
NumIterations              = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 5.725351e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 5.449318e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.760323e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.914417e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 4.457877e-01                 )  sec
MeanTimeInMatrixElems      = ( 4.457877e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 4.457877e-01 ,  4.457877e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288 (nan=0)
EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.157308e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 9.621167e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.176094e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
(CPU) 0a ProcInit :     0.000331 sec
(CPU) 0b MemAlloc :     0.025358 sec
(CPU) 0c GenCreat :     0.000915 sec
(CPU) 1a GenSeed  :     0.000009 sec
(CPU) 1b GenRnGen :     0.027595 sec
(CPU) 2a RamboIni :     0.006872 sec
(CPU) 2b RamboFin :     0.092273 sec
(CPU) 3a SigmaKin :     0.445788 sec
(CPU) 4a DumpLoop :     0.004605 sec
(CPU) 8a CompStat :     0.003633 sec
(CPU) 9a GenDestr :     0.000094 sec
(CPU) 9b DumpScrn :     0.004946 sec
(CPU) 9c DumpJson :     0.000008 sec
(CPU) TOTAL       :     0.612425 sec
(CPU) TOTAL (123) :     0.572535 sec
(CPU) TOTAL  (23) :     0.544932 sec
(CPU) TOTAL   (1) :     0.027603 sec
(CPU) TOTAL   (2) :     0.099144 sec
(CPU) TOTAL   (3) :     0.445788 sec
***********************************************************************
-----------------------------------------------------------------------
TotalTime[Rnd+Rmb+ME] (123)= ( 5.798480e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 5.516467e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.820135e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.050893e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 4.465573e-01                 )  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 1048576
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.808364e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.900811e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.348133e+06                 )  sec^-1
-----------------------------------------------------------------------
@valassi valassi marked this pull request as draft December 4, 2020 14:14
This makes me realise the calculation is clearly wrong: one should
add throughputs, not times (the wall time is the same on CPU and GPU!)
Decrease the GPU multiplier from 100 to 70 (itscrd03, with 4 OMP threads).

****************************************************************************
(GPU) NumBlocksPerGrid           = 16384
(GPU) NumThreadsPerBlock         = 32
(GPU) NumIterations              = 700
----------------------------------------------------------------------------
(GPU) FP precision               = DOUBLE
(GPU) Complex type               = THRUST::COMPLEX
(GPU) RanNumb memory layout      = AOSOA[4]
(GPU) Momenta memory layout      = AOSOA[4]
(GPU) Random number generation   = CURAND DEVICE (CUDA code)
(GPU) Wavefunction GPU memory    = LOCAL
----------------------------------------------------------------------------
(GPU) NumIterations              = 700
(GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00                 )  sec
(GPU) TotalTime[Rambo+ME]    (23)= ( 5.378664e+00                 )  sec
(GPU) TotalTime[RndNumGen]    (1)= ( 4.571957e-01                 )  sec
(GPU) TotalTime[Rambo]        (2)= ( 4.823339e+00                 )  sec
(GPU) TotalTime[MatrixElems]  (3)= ( 5.553249e-01                 )  sec
(GPU) MeanTimeInMatrixElems      = ( 7.933214e-04                 )  sec
(GPU) [Min,Max]TimeInMatrixElems = [ 7.126600e-04 ,  1.121232e-02 ]  sec
----------------------------------------------------------------------------
(GPU) TotalEventsComputed        = 367001600 (nan=0)
(GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07                 )  sec^-1
(GPU) EvtsPerSec[Rmb+ME]     (23)= ( 6.823286e+07                 )  sec^-1
(GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08                 )  sec^-1
****************************************************************************
(GPU) NumMatrixElements(notNan)  = 367001600
(GPU) MeanMatrixElemValue        = ( 1.371705e-02 +- 4.280686e-07 )  GeV^0
(GPU) [Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374926e-02 ]  GeV^0
(GPU) StdDevMatrixElemValue      = ( 8.200632e-03                 )  GeV^0
(GPU) MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
(GPU) [Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
(GPU) StdDevWeight               = ( 0.000000e+00                 )
****************************************************************************
(GPU) 00 CudaFree :     1.040999 sec
(GPU) 0a ProcInit :     0.000255 sec
(GPU) 0b MemAlloc :     0.043748 sec
(GPU) 0c GenCreat :     0.035068 sec
(GPU) 0d SGoodHel :     0.001759 sec
(GPU) 1a GenSeed  :     0.005864 sec
(GPU) 1b GenRnGen :     0.451332 sec
(GPU) 2a RamboIni :     0.010404 sec
(GPU) 2b RamboFin :     0.008416 sec
(GPU) 2c CpDTHwgt :     0.369433 sec
(GPU) 2d CpDTHmom :     4.435089 sec
(GPU) 3a SigmaKin :     0.010518 sec
(GPU) 3b CpDTHmes :     0.544807 sec
(GPU) 4a DumpLoop :     2.404587 sec
(GPU) 8a CompStat :     2.632091 sec
(GPU) 9a GenDestr :     0.000196 sec
(GPU) 9b DumpScrn :     0.000071 sec
(GPU) 9c DumpJson :     0.000007 sec
(GPU) TOTAL       :    11.994643 sec
(GPU) TOTAL (123) :     5.835862 sec
(GPU) TOTAL  (23) :     5.378666 sec
(GPU) TOTAL   (1) :     0.457196 sec
(GPU) TOTAL   (2) :     4.823342 sec
(GPU) TOTAL   (3) :     0.555325 sec
****************************************************************************
****************************************************************************
(CPU) NumBlocksPerGrid           = 16384
(CPU) NumThreadsPerBlock         = 32
(CPU) NumIterations              = 10
----------------------------------------------------------------------------
(CPU) FP precision               = DOUBLE
(CPU) Complex type               = STD::COMPLEX
(CPU) RanNumb memory layout      = AOSOA[4]
(CPU) Momenta memory layout      = AOSOA[4]
(CPU) Random number generation   = CURAND (C++ code)
(CPU) OMP threads / maxthreads   = 4 / 4
----------------------------------------------------------------------------
(CPU) NumIterations              = 10
(CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00                 )  sec
(CPU) TotalTime[Rambo+ME]    (23)= ( 4.964864e+00                 )  sec
(CPU) TotalTime[RndNumGen]    (1)= ( 2.805332e-01                 )  sec
(CPU) TotalTime[Rambo]        (2)= ( 1.005407e+00                 )  sec
(CPU) TotalTime[MatrixElems]  (3)= ( 3.959456e+00                 )  sec
(CPU) MeanTimeInMatrixElems      = ( 3.959456e-01                 )  sec
(CPU) [Min,Max]TimeInMatrixElems = [ 2.837697e-01 ,  4.298637e-01 ]  sec
----------------------------------------------------------------------------
(CPU) TotalEventsComputed        = 5242880 (nan=0)
(CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05                 )  sec^-1
(CPU) EvtsPerSec[Rmb+ME]     (23)= ( 1.055997e+06                 )  sec^-1
(CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06                 )  sec^-1
****************************************************************************
(CPU) NumMatrixElements(notNan)  = 5242880
(CPU) MeanMatrixElemValue        = ( 1.372304e-02 +- 3.581814e-06 )  GeV^0
(CPU) [Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
(CPU) StdDevMatrixElemValue      = ( 8.201400e-03                 )  GeV^0
(CPU) MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
(CPU) [Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
(CPU) StdDevWeight               = ( 0.000000e+00                 )
****************************************************************************
(CPU) 0a ProcInit :     0.015581 sec
(CPU) 0b MemAlloc :     0.028460 sec
(CPU) 0c GenCreat :     0.000978 sec
(CPU) 1a GenSeed  :     0.000084 sec
(CPU) 1b GenRnGen :     0.280449 sec
(CPU) 2a RamboIni :     0.072770 sec
(CPU) 2b RamboFin :     0.932637 sec
(CPU) 3a SigmaKin :     3.959456 sec
(CPU) 4a DumpLoop :     0.034727 sec
(CPU) 8a CompStat :     0.038073 sec
(CPU) 9a GenDestr :     0.000141 sec
(CPU) 9b DumpScrn :     0.061204 sec
(CPU) 9c DumpJson :     0.000011 sec
(CPU) TOTAL       :     5.424570 sec
(CPU) TOTAL (123) :     5.245397 sec
(CPU) TOTAL  (23) :     4.964864 sec
(CPU) TOTAL   (1) :     0.280533 sec
(CPU) TOTAL   (2) :     1.005407 sec
(CPU) TOTAL   (3) :     3.959456 sec
****************************************************************************
----------------------------------------------------------------------------
(GPU) TotalEventsComputed        = 367001600
(GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00                 )  sec
(GPU) TotalTime[Rambo+ME]    (23)= ( 5.378664e+00                 )  sec
(GPU) TotalTime[RndNumGen]    (1)= ( 4.571957e-01                 )  sec
(GPU) TotalTime[Rambo]        (2)= ( 4.823339e+00                 )  sec
(GPU) TotalTime[MatrixElems]  (3)= ( 5.553249e-01                 )  sec
----------------------------------------------------------------------------
(GPU) TotalEventsComputed        = 367001600
(GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07                 )  sec^-1
(GPU) EvtsPerSec[Rmb+ME]     (23)= ( 6.823286e+07                 )  sec^-1
(GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08                 )  sec^-1
****************************************************************************
----------------------------------------------------------------------------
(CPU) TotalEventsComputed        = 5242880
(CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00                 )  sec
(CPU) TotalTime[Rambo+ME]    (23)= ( 4.964864e+00                 )  sec
(CPU) TotalTime[RndNumGen]    (1)= ( 2.805332e-01                 )  sec
(CPU) TotalTime[Rambo]        (2)= ( 1.005407e+00                 )  sec
(CPU) TotalTime[MatrixElems]  (3)= ( 3.959456e+00                 )  sec
----------------------------------------------------------------------------
(CPU) TotalEventsComputed        = 5242880
(CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05                 )  sec^-1
(CPU) EvtsPerSec[Rmb+ME]     (23)= ( 1.055997e+06                 )  sec^-1
(CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06                 )  sec^-1
****************************************************************************
(HET) TotalEventsComputed        = 372244480
(HET) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.388685e+07                 )  sec^-1
(HET) EvtsPerSec[Rmb+ME]     (23)= ( 6.928886e+07                 )  sec^-1
(HET) EvtsPerSec[MatrixElems] (3)= ( 6.622013e+08                 )  sec^-1
****************************************************************************
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 4, 2020

I have just pushed more changes.

  • I added a quick hack to process 70 times more events on the GPU, and use differenr random seeds
  • I do not compute combined "physics" (average ME)
  • I fixed the calculation of the combined throughput, one must add thoughputs, not times and events (the wall time is the same)

More is needed eventually, but I think this can be considered for merging. I remove the draft status.

This is an example

****************************************************************************
----------------------------------------------------------------------------
(GPU) TotalEventsComputed        = 367001600
(GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00                 )  sec
(GPU) TotalTime[Rambo+ME]    (23)= ( 5.378664e+00                 )  sec
(GPU) TotalTime[RndNumGen]    (1)= ( 4.571957e-01                 )  sec
(GPU) TotalTime[Rambo]        (2)= ( 4.823339e+00                 )  sec
(GPU) TotalTime[MatrixElems]  (3)= ( 5.553249e-01                 )  sec
----------------------------------------------------------------------------
(GPU) TotalEventsComputed        = 367001600
(GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07                 )  sec^-1
(GPU) EvtsPerSec[Rmb+ME]     (23)= ( 6.823286e+07                 )  sec^-1
(GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08                 )  sec^-1
****************************************************************************
----------------------------------------------------------------------------
(CPU) TotalEventsComputed        = 5242880
(CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00                 )  sec
(CPU) TotalTime[Rambo+ME]    (23)= ( 4.964864e+00                 )  sec
(CPU) TotalTime[RndNumGen]    (1)= ( 2.805332e-01                 )  sec
(CPU) TotalTime[Rambo]        (2)= ( 1.005407e+00                 )  sec
(CPU) TotalTime[MatrixElems]  (3)= ( 3.959456e+00                 )  sec
----------------------------------------------------------------------------
(CPU) TotalEventsComputed        = 5242880
(CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05                 )  sec^-1
(CPU) EvtsPerSec[Rmb+ME]     (23)= ( 1.055997e+06                 )  sec^-1
(CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06                 )  sec^-1
****************************************************************************
(HET) TotalEventsComputed        = 372244480
(HET) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.388685e+07                 )  sec^-1
(HET) EvtsPerSec[Rmb+ME]     (23)= ( 6.928886e+07                 )  sec^-1
(HET) EvtsPerSec[MatrixElems] (3)= ( 6.622013e+08                 )  sec^-1
****************************************************************************

@valassi valassi marked this pull request as ready for review December 4, 2020 17:16
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 4, 2020

PS Consider approving/merging #84 first (OMP multi threading), as that is included here too.

@valassi valassi closed this Apr 1, 2021
@valassi valassi deleted the het branch April 1, 2021 10:30
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Apr 1, 2021

Hm I deleted and recreated the branch, I thought this would be picked up. I am reopening this as

git push -f origin 9026441:het
[reopen PR]
git push -f origin 52edc92:het

valassi added 2 commits April 1, 2021 17:34
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Apr 9, 2021

This PR is now obsolete and I will close it.

It is replaced by PR #159.

I copied a few relevant comments to the general issue #85.

@valassi valassi closed this Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants