Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck.#84
Conversation
The build fails because runTest.exe mixes c++ and cuda objects (madgraph5#83): /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/bin/g++ -O3 -std=c++11 -I. -I../../src -I../../../../../tools/ -I../../../../../tools//googletest/googletest/include/ -DUSE_NVTX -Wall -Wshadow -fopenmp -I/usr/local/cuda-11.1/include/ -c runTest.cc -o runTest.o ln -sf runTest.cc runTest_tmp.cu /usr/local/cuda-11.1/bin/nvcc -o runTest.exe CPPProcess.o runTest.o gCPPProcess.o runTest_tmp.cu -O3 -std=c++14 -I. -I../../src -I../../../../../tools/ -I../../../../../tools//googletest/googletest/include/ -I/usr/local/cuda-11.1/include/ -DUSE_NVTX -arch=compute_70 -use_fast_math -lineinfo -ldl -L../../lib -lmodel_sm -L../../../../../tools//googletest/build/lib// -lgtest -lgtest_main -L/usr/local/cuda-11.1/lib64/ -lcuda -lcurand -lcuda CPPProcess.o: In function `Proc::sigmaKin(double const*, double*, int) [clone ._omp_fn.0]': CPPProcess.cc:(.text+0x2312): undefined reference to `omp_get_num_threads' CPPProcess.cc:(.text+0x2319): undefined reference to `omp_get_thread_num' CPPProcess.o: In function `Proc::sigmaKin(double const*, double*, int)': CPPProcess.cc:(.text+0x263f): undefined reference to `GOMP_parallel' collect2: error: ld returned 1 exit status make: *** [Makefile:98: runTest.exe] Error 1
This allows testing multithreading (madgraph5#82) using OpenMP as suggested by @hageboeck. Very nice, one line immediately gained a factor 4 in throughput! This is itscrd03 (a VM), note `nproc` is 1, but `nproc --all` is 4 (difference?). Going even further, eg to 8, remains flat in throughput: 1: EvtsPerSec[MatrixElems] (3)= ( 3.841438e+05 ) sec^-1 4: EvtsPerSec[MatrixElems] (3)= ( 1.515065e+06 ) sec^-1 8: EvtsPerSec[MatrixElems] (3)= ( 1.516047e+06 ) sec^-1 export OMP_NUM_THREADS=1; ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.491819e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.464077e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.774263e-02 ) sec TotalTime[Rambo] (2)= ( 9.925424e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.364822e+00 ) sec MeanTimeInMatrixElems = ( 1.364822e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.364822e+00 , 1.364822e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.514421e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.581015e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.841438e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000371 sec 0b MemAlloc : 0.027894 sec 0c GenCreat : 0.000852 sec 1a GenSeed : 0.000010 sec 1b GenRnGen : 0.027733 sec 2a RamboIni : 0.006828 sec 2b RamboFin : 0.092427 sec 3a SigmaKin : 1.364822 sec 4a DumpLoop : 0.004408 sec 8a CompStat : 0.003551 sec 9a GenDestr : 0.000109 sec 9b DumpScrn : 0.000214 sec 9c DumpJson : 0.000007 sec TOTAL : 1.529226 sec TOTAL (123) : 1.491819 sec TOTAL (23) : 1.464077 sec TOTAL (1) : 0.027743 sec TOTAL (2) : 0.099254 sec TOTAL (3) : 1.364822 sec *********************************************************************** export OMP_NUM_THREADS=4; ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 4.734046e-01 ) sec TotalTime[Rambo+ME] (23)= ( 4.457583e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.764625e-02 ) sec TotalTime[Rambo] (2)= ( 9.970848e-02 ) sec TotalTime[MatrixElems] (3)= ( 3.460498e-01 ) sec MeanTimeInMatrixElems = ( 3.460498e-01 ) sec [Min,Max]TimeInMatrixElems = [ 3.460498e-01 , 3.460498e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.107484e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.176171e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.515065e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000325 sec 0b MemAlloc : 0.027604 sec 0c GenCreat : 0.000858 sec 1a GenSeed : 0.000008 sec 1b GenRnGen : 0.027638 sec 2a RamboIni : 0.006976 sec 2b RamboFin : 0.092732 sec 3a SigmaKin : 0.346050 sec 4a DumpLoop : 0.004402 sec 8a CompStat : 0.003604 sec 9a GenDestr : 0.000070 sec 9b DumpScrn : 0.000225 sec 9c DumpJson : 0.000007 sec TOTAL : 0.510500 sec TOTAL (123) : 0.473405 sec TOTAL (23) : 0.445758 sec TOTAL (1) : 0.027646 sec TOTAL (2) : 0.099708 sec TOTAL (3) : 0.346050 sec *********************************************************************** export OMP_NUM_THREADS=8; ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 4.726587e-01 ) sec TotalTime[Rambo+ME] (23)= ( 4.449646e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.769402e-02 ) sec TotalTime[Rambo] (2)= ( 9.913886e-02 ) sec TotalTime[MatrixElems] (3)= ( 3.458258e-01 ) sec MeanTimeInMatrixElems = ( 3.458258e-01 ) sec [Min,Max]TimeInMatrixElems = [ 3.458258e-01 , 3.458258e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.109232e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.178269e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.516047e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000345 sec 0b MemAlloc : 0.027377 sec 0c GenCreat : 0.000850 sec 1a GenSeed : 0.000008 sec 1b GenRnGen : 0.027686 sec 2a RamboIni : 0.006767 sec 2b RamboFin : 0.092372 sec 3a SigmaKin : 0.345826 sec 4a DumpLoop : 0.004346 sec 8a CompStat : 0.003605 sec 9a GenDestr : 0.000067 sec 9b DumpScrn : 0.000208 sec 9c DumpJson : 0.000008 sec TOTAL : 0.509464 sec TOTAL (123) : 0.472659 sec TOTAL (23) : 0.444965 sec TOTAL (1) : 0.027694 sec TOTAL (2) : 0.099139 sec TOTAL (3) : 0.345826 sec ***********************************************************************
|
This PR includes two things:
It is not surprising that the CI complains for a failed check: it tries to execute runTest.exe but runTest.exe was not built. It would be best to fix #83 first, separating two sets of .o and .exe for c++ and cuda. |
|
I have now also added a printout. And I disabled OMP if OMP_NUM_THREADS is not set (for backward compatibility, and to make the behavious more explicit). |
@valassi to use your own words against you, the x4 speedup is not correct. You have to compare with 4 instances of the single threaded code :) I believe that is what you said to me when we were discussing my performance figures :) |
Ah! Then I understand why we were not understanding each other! ;-) OK: a factor x4 speedup with 4 OMP threads, with respect to a single thread, all in one copy ;-) Anyway, this is precisely the type of studies we do in the benchmarking WG, so we will certainly make both options. The idea is to plot throughput vs "level of parallelism", whichever way you achieve it, be it with 4 1xMT copies, with 1 4xMT copy, with 2 2xMT copies (then typically the three dots more or less overlap on throughput vs parallelism plots. |
|
As far as I understand the Open MP version will run slightly slower due to the overhead of setting up the threads. In any case, as you pointed out, this is just the multi-core scenario which should be well understood. |
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile (add both -Wextra and -fopenmp)
|
I have tested this code and it seems to work. I don't get a huge speedup. It is ~8.5s for 4M events with 1 OMP Thread and 9.5s with 4 OMP Threads. The code currently does not let you run with less than 4 threads so the so maybe it is using more than 1 thread even if OMP Thread is set to 1. Looks good to merge. |
|
It looks like OneAPI now supports OpenMP and offloading to supported devices |
hageboeck
left a comment
There was a problem hiding this comment.
Hi Andrea,
looks almost good to me.
The comment-out-comment-in commits in the Makefile should be squashed.
Maybe you can also remove all those merge commits, so the feature applies with only three commits or so, but that's mostly cosmetics.
|
I am finally merging this. Thanks for all the feeback. I also fixed a few bugs at the end. Reminder: if OMP_NUM_THREADS is not set, this is reset to 1, for backward compatibility. If you want to set it to the maximum, just check "nproc --all". |
No description provided.