Heterogeneous MadGraph: parallel CPU+GPU executions#87
Closed
valassi wants to merge 38 commits into
Closed
Conversation
./hcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND DEVICE (CUDA code) Wavefunction GPU memory = LOCAL ----------------------------------------------------------------------- NumIterations = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.312938e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.714818e-03 ) sec TotalTime[RndNumGen] (1)= ( 5.981200e-04 ) sec TotalTime[Rambo] (2)= ( 5.945168e-03 ) sec TotalTime[MatrixElems] (3)= ( 7.696500e-04 ) sec MeanTimeInMatrixElems = ( 7.696500e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.696500e-04 , 7.696500e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 (nan=0) EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.169321e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.807926e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.812032e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** (GPU) 00 CudaFree : 0.872603 sec (GPU) 0a ProcInit : 0.000233 sec (GPU) 0b MemAlloc : 0.035793 sec (GPU) 0c GenCreat : 0.009814 sec (GPU) 0d SGoodHel : 0.001759 sec (GPU) 1a GenSeed : 0.000009 sec (GPU) 1b GenRnGen : 0.000589 sec (GPU) 2a RamboIni : 0.000022 sec (GPU) 2b RamboFin : 0.000014 sec (GPU) 2c CpDTHwgt : 0.000502 sec (GPU) 2d CpDTHmom : 0.005407 sec (GPU) 3a SigmaKin : 0.000014 sec (GPU) 3b CpDTHmes : 0.000756 sec (GPU) 4a DumpLoop : 0.004765 sec (GPU) 8a CompStat : 0.003540 sec (GPU) 9a GenDestr : 0.000048 sec (GPU) 9b DumpScrn : 0.000044 sec (GPU) 9c DumpJson : 0.000007 sec (GPU) TOTAL : 0.935918 sec (GPU) TOTAL (123) : 0.007313 sec (GPU) TOTAL (23) : 0.006715 sec (GPU) TOTAL (1) : 0.000598 sec (GPU) TOTAL (2) : 0.005945 sec (GPU) TOTAL (3) : 0.000770 sec *********************************************************************** *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) OMP threads / maxthreads = 4 / 4 ----------------------------------------------------------------------- NumIterations = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 5.725351e-01 ) sec TotalTime[Rambo+ME] (23)= ( 5.449318e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.760323e-02 ) sec TotalTime[Rambo] (2)= ( 9.914417e-02 ) sec TotalTime[MatrixElems] (3)= ( 4.457877e-01 ) sec MeanTimeInMatrixElems = ( 4.457877e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.457877e-01 , 4.457877e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 (nan=0) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.157308e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 9.621167e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.176094e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** (CPU) 0a ProcInit : 0.000331 sec (CPU) 0b MemAlloc : 0.025358 sec (CPU) 0c GenCreat : 0.000915 sec (CPU) 1a GenSeed : 0.000009 sec (CPU) 1b GenRnGen : 0.027595 sec (CPU) 2a RamboIni : 0.006872 sec (CPU) 2b RamboFin : 0.092273 sec (CPU) 3a SigmaKin : 0.445788 sec (CPU) 4a DumpLoop : 0.004605 sec (CPU) 8a CompStat : 0.003633 sec (CPU) 9a GenDestr : 0.000094 sec (CPU) 9b DumpScrn : 0.004946 sec (CPU) 9c DumpJson : 0.000008 sec (CPU) TOTAL : 0.612425 sec (CPU) TOTAL (123) : 0.572535 sec (CPU) TOTAL (23) : 0.544932 sec (CPU) TOTAL (1) : 0.027603 sec (CPU) TOTAL (2) : 0.099144 sec (CPU) TOTAL (3) : 0.445788 sec *********************************************************************** ----------------------------------------------------------------------- TotalTime[Rnd+Rmb+ME] (123)= ( 5.798480e-01 ) sec TotalTime[Rambo+ME] (23)= ( 5.516467e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.820135e-02 ) sec TotalTime[Rambo] (2)= ( 1.050893e-01 ) sec TotalTime[MatrixElems] (3)= ( 4.465573e-01 ) sec ----------------------------------------------------------------------- TotalEventsComputed = 1048576 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.808364e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.900811e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.348133e+06 ) sec^-1 -----------------------------------------------------------------------
This was referenced Dec 4, 2020
This makes me realise the calculation is clearly wrong: one should add throughputs, not times (the wall time is the same on CPU and GPU!)
Decrease the GPU multiplier from 100 to 70 (itscrd03, with 4 OMP threads). **************************************************************************** (GPU) NumBlocksPerGrid = 16384 (GPU) NumThreadsPerBlock = 32 (GPU) NumIterations = 700 ---------------------------------------------------------------------------- (GPU) FP precision = DOUBLE (GPU) Complex type = THRUST::COMPLEX (GPU) RanNumb memory layout = AOSOA[4] (GPU) Momenta memory layout = AOSOA[4] (GPU) Random number generation = CURAND DEVICE (CUDA code) (GPU) Wavefunction GPU memory = LOCAL ---------------------------------------------------------------------------- (GPU) NumIterations = 700 (GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00 ) sec (GPU) TotalTime[Rambo+ME] (23)= ( 5.378664e+00 ) sec (GPU) TotalTime[RndNumGen] (1)= ( 4.571957e-01 ) sec (GPU) TotalTime[Rambo] (2)= ( 4.823339e+00 ) sec (GPU) TotalTime[MatrixElems] (3)= ( 5.553249e-01 ) sec (GPU) MeanTimeInMatrixElems = ( 7.933214e-04 ) sec (GPU) [Min,Max]TimeInMatrixElems = [ 7.126600e-04 , 1.121232e-02 ] sec ---------------------------------------------------------------------------- (GPU) TotalEventsComputed = 367001600 (nan=0) (GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07 ) sec^-1 (GPU) EvtsPerSec[Rmb+ME] (23)= ( 6.823286e+07 ) sec^-1 (GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08 ) sec^-1 **************************************************************************** (GPU) NumMatrixElements(notNan) = 367001600 (GPU) MeanMatrixElemValue = ( 1.371705e-02 +- 4.280686e-07 ) GeV^0 (GPU) [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374926e-02 ] GeV^0 (GPU) StdDevMatrixElemValue = ( 8.200632e-03 ) GeV^0 (GPU) MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) (GPU) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] (GPU) StdDevWeight = ( 0.000000e+00 ) **************************************************************************** (GPU) 00 CudaFree : 1.040999 sec (GPU) 0a ProcInit : 0.000255 sec (GPU) 0b MemAlloc : 0.043748 sec (GPU) 0c GenCreat : 0.035068 sec (GPU) 0d SGoodHel : 0.001759 sec (GPU) 1a GenSeed : 0.005864 sec (GPU) 1b GenRnGen : 0.451332 sec (GPU) 2a RamboIni : 0.010404 sec (GPU) 2b RamboFin : 0.008416 sec (GPU) 2c CpDTHwgt : 0.369433 sec (GPU) 2d CpDTHmom : 4.435089 sec (GPU) 3a SigmaKin : 0.010518 sec (GPU) 3b CpDTHmes : 0.544807 sec (GPU) 4a DumpLoop : 2.404587 sec (GPU) 8a CompStat : 2.632091 sec (GPU) 9a GenDestr : 0.000196 sec (GPU) 9b DumpScrn : 0.000071 sec (GPU) 9c DumpJson : 0.000007 sec (GPU) TOTAL : 11.994643 sec (GPU) TOTAL (123) : 5.835862 sec (GPU) TOTAL (23) : 5.378666 sec (GPU) TOTAL (1) : 0.457196 sec (GPU) TOTAL (2) : 4.823342 sec (GPU) TOTAL (3) : 0.555325 sec **************************************************************************** **************************************************************************** (CPU) NumBlocksPerGrid = 16384 (CPU) NumThreadsPerBlock = 32 (CPU) NumIterations = 10 ---------------------------------------------------------------------------- (CPU) FP precision = DOUBLE (CPU) Complex type = STD::COMPLEX (CPU) RanNumb memory layout = AOSOA[4] (CPU) Momenta memory layout = AOSOA[4] (CPU) Random number generation = CURAND (C++ code) (CPU) OMP threads / maxthreads = 4 / 4 ---------------------------------------------------------------------------- (CPU) NumIterations = 10 (CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00 ) sec (CPU) TotalTime[Rambo+ME] (23)= ( 4.964864e+00 ) sec (CPU) TotalTime[RndNumGen] (1)= ( 2.805332e-01 ) sec (CPU) TotalTime[Rambo] (2)= ( 1.005407e+00 ) sec (CPU) TotalTime[MatrixElems] (3)= ( 3.959456e+00 ) sec (CPU) MeanTimeInMatrixElems = ( 3.959456e-01 ) sec (CPU) [Min,Max]TimeInMatrixElems = [ 2.837697e-01 , 4.298637e-01 ] sec ---------------------------------------------------------------------------- (CPU) TotalEventsComputed = 5242880 (nan=0) (CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05 ) sec^-1 (CPU) EvtsPerSec[Rmb+ME] (23)= ( 1.055997e+06 ) sec^-1 (CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06 ) sec^-1 **************************************************************************** (CPU) NumMatrixElements(notNan) = 5242880 (CPU) MeanMatrixElemValue = ( 1.372304e-02 +- 3.581814e-06 ) GeV^0 (CPU) [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 (CPU) StdDevMatrixElemValue = ( 8.201400e-03 ) GeV^0 (CPU) MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) (CPU) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] (CPU) StdDevWeight = ( 0.000000e+00 ) **************************************************************************** (CPU) 0a ProcInit : 0.015581 sec (CPU) 0b MemAlloc : 0.028460 sec (CPU) 0c GenCreat : 0.000978 sec (CPU) 1a GenSeed : 0.000084 sec (CPU) 1b GenRnGen : 0.280449 sec (CPU) 2a RamboIni : 0.072770 sec (CPU) 2b RamboFin : 0.932637 sec (CPU) 3a SigmaKin : 3.959456 sec (CPU) 4a DumpLoop : 0.034727 sec (CPU) 8a CompStat : 0.038073 sec (CPU) 9a GenDestr : 0.000141 sec (CPU) 9b DumpScrn : 0.061204 sec (CPU) 9c DumpJson : 0.000011 sec (CPU) TOTAL : 5.424570 sec (CPU) TOTAL (123) : 5.245397 sec (CPU) TOTAL (23) : 4.964864 sec (CPU) TOTAL (1) : 0.280533 sec (CPU) TOTAL (2) : 1.005407 sec (CPU) TOTAL (3) : 3.959456 sec **************************************************************************** ---------------------------------------------------------------------------- (GPU) TotalEventsComputed = 367001600 (GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00 ) sec (GPU) TotalTime[Rambo+ME] (23)= ( 5.378664e+00 ) sec (GPU) TotalTime[RndNumGen] (1)= ( 4.571957e-01 ) sec (GPU) TotalTime[Rambo] (2)= ( 4.823339e+00 ) sec (GPU) TotalTime[MatrixElems] (3)= ( 5.553249e-01 ) sec ---------------------------------------------------------------------------- (GPU) TotalEventsComputed = 367001600 (GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07 ) sec^-1 (GPU) EvtsPerSec[Rmb+ME] (23)= ( 6.823286e+07 ) sec^-1 (GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08 ) sec^-1 **************************************************************************** ---------------------------------------------------------------------------- (CPU) TotalEventsComputed = 5242880 (CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00 ) sec (CPU) TotalTime[Rambo+ME] (23)= ( 4.964864e+00 ) sec (CPU) TotalTime[RndNumGen] (1)= ( 2.805332e-01 ) sec (CPU) TotalTime[Rambo] (2)= ( 1.005407e+00 ) sec (CPU) TotalTime[MatrixElems] (3)= ( 3.959456e+00 ) sec ---------------------------------------------------------------------------- (CPU) TotalEventsComputed = 5242880 (CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05 ) sec^-1 (CPU) EvtsPerSec[Rmb+ME] (23)= ( 1.055997e+06 ) sec^-1 (CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06 ) sec^-1 **************************************************************************** (HET) TotalEventsComputed = 372244480 (HET) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.388685e+07 ) sec^-1 (HET) EvtsPerSec[Rmb+ME] (23)= ( 6.928886e+07 ) sec^-1 (HET) EvtsPerSec[MatrixElems] (3)= ( 6.622013e+08 ) sec^-1 ****************************************************************************
Member
Author
|
I have just pushed more changes.
More is needed eventually, but I think this can be considered for merging. I remove the draft status. This is an example |
Member
Author
|
PS Consider approving/merging #84 first (OMP multi threading), as that is included here too. |
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
Member
Author
|
Hm I deleted and recreated the branch, I thought this would be picked up. I am reopening this as |
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
This was referenced Apr 2, 2021
Closed
Closed
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See a more complete description in #85
I created a simple prototype. The point here was mainly a proof of concept, also trying to sort out the build (which may be useful for addressing #83).
The current prototype runs exactly the same number of events with exactly the same random numbers in parallel on CPU (with OMP threads) and on GPU. Both computations give the same sets of events, which are not yet combined. As the GPU is much faster, essentiually the net effect is a computation that lasts as long as the CPU version, but does double events (because the same events are also on the GPU), so the throghput doubles.
This clearly needs a lot more work (especially the optimization is tricky), but it's a useful prrof of concept.
PS I forgot to mention: this includes and supersedes #82. As in that one, the build of runTest (#83) is disabled here. I would suggest however to fix it after including these changes, which give a possible direction for how to combine cuda and c++ modules. One option for #83 is to keep a single runTest.exe as it is now, but make it much clearer that real modules are either c++ with gcc or cuda with nvcc, and then only thin layers integrating both are compiled with nvcc. Essentially, I only had to add -lgomp to build the combined module, probably the same is ok for runTest.