Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck. by valassi · Pull Request #84 · madgraph5/madgraph4gpu

valassi · 2020-12-03T16:10:15Z

No description provided.

The build fails because runTest.exe mixes c++ and cuda objects (madgraph5#83): /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/bin/g++ -O3 -std=c++11 -I. -I../../src -I../../../../../tools/ -I../../../../../tools//googletest/googletest/include/ -DUSE_NVTX -Wall -Wshadow -fopenmp -I/usr/local/cuda-11.1/include/ -c runTest.cc -o runTest.o ln -sf runTest.cc runTest_tmp.cu /usr/local/cuda-11.1/bin/nvcc -o runTest.exe CPPProcess.o runTest.o gCPPProcess.o runTest_tmp.cu -O3 -std=c++14 -I. -I../../src -I../../../../../tools/ -I../../../../../tools//googletest/googletest/include/ -I/usr/local/cuda-11.1/include/ -DUSE_NVTX -arch=compute_70 -use_fast_math -lineinfo -ldl -L../../lib -lmodel_sm -L../../../../../tools//googletest/build/lib// -lgtest -lgtest_main -L/usr/local/cuda-11.1/lib64/ -lcuda -lcurand -lcuda CPPProcess.o: In function `Proc::sigmaKin(double const*, double*, int) [clone ._omp_fn.0]': CPPProcess.cc:(.text+0x2312): undefined reference to `omp_get_num_threads' CPPProcess.cc:(.text+0x2319): undefined reference to `omp_get_thread_num' CPPProcess.o: In function `Proc::sigmaKin(double const*, double*, int)': CPPProcess.cc:(.text+0x263f): undefined reference to `GOMP_parallel' collect2: error: ld returned 1 exit status make: *** [Makefile:98: runTest.exe] Error 1

@hageboeck

This allows testing multithreading (madgraph5#82) using OpenMP as suggested by @hageboeck. Very nice, one line immediately gained a factor 4 in throughput! This is itscrd03 (a VM), note `nproc` is 1, but `nproc --all` is 4 (difference?). Going even further, eg to 8, remains flat in throughput: 1: EvtsPerSec[MatrixElems] (3)= ( 3.841438e+05 ) sec^-1 4: EvtsPerSec[MatrixElems] (3)= ( 1.515065e+06 ) sec^-1 8: EvtsPerSec[MatrixElems] (3)= ( 1.516047e+06 ) sec^-1 export OMP_NUM_THREADS=1; ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.491819e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.464077e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.774263e-02 ) sec TotalTime[Rambo] (2)= ( 9.925424e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.364822e+00 ) sec MeanTimeInMatrixElems = ( 1.364822e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.364822e+00 , 1.364822e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.514421e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.581015e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.841438e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000371 sec 0b MemAlloc : 0.027894 sec 0c GenCreat : 0.000852 sec 1a GenSeed : 0.000010 sec 1b GenRnGen : 0.027733 sec 2a RamboIni : 0.006828 sec 2b RamboFin : 0.092427 sec 3a SigmaKin : 1.364822 sec 4a DumpLoop : 0.004408 sec 8a CompStat : 0.003551 sec 9a GenDestr : 0.000109 sec 9b DumpScrn : 0.000214 sec 9c DumpJson : 0.000007 sec TOTAL : 1.529226 sec TOTAL (123) : 1.491819 sec TOTAL (23) : 1.464077 sec TOTAL (1) : 0.027743 sec TOTAL (2) : 0.099254 sec TOTAL (3) : 1.364822 sec *********************************************************************** export OMP_NUM_THREADS=4; ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 4.734046e-01 ) sec TotalTime[Rambo+ME] (23)= ( 4.457583e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.764625e-02 ) sec TotalTime[Rambo] (2)= ( 9.970848e-02 ) sec TotalTime[MatrixElems] (3)= ( 3.460498e-01 ) sec MeanTimeInMatrixElems = ( 3.460498e-01 ) sec [Min,Max]TimeInMatrixElems = [ 3.460498e-01 , 3.460498e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.107484e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.176171e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.515065e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000325 sec 0b MemAlloc : 0.027604 sec 0c GenCreat : 0.000858 sec 1a GenSeed : 0.000008 sec 1b GenRnGen : 0.027638 sec 2a RamboIni : 0.006976 sec 2b RamboFin : 0.092732 sec 3a SigmaKin : 0.346050 sec 4a DumpLoop : 0.004402 sec 8a CompStat : 0.003604 sec 9a GenDestr : 0.000070 sec 9b DumpScrn : 0.000225 sec 9c DumpJson : 0.000007 sec TOTAL : 0.510500 sec TOTAL (123) : 0.473405 sec TOTAL (23) : 0.445758 sec TOTAL (1) : 0.027646 sec TOTAL (2) : 0.099708 sec TOTAL (3) : 0.346050 sec *********************************************************************** export OMP_NUM_THREADS=8; ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 4.726587e-01 ) sec TotalTime[Rambo+ME] (23)= ( 4.449646e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.769402e-02 ) sec TotalTime[Rambo] (2)= ( 9.913886e-02 ) sec TotalTime[MatrixElems] (3)= ( 3.458258e-01 ) sec MeanTimeInMatrixElems = ( 3.458258e-01 ) sec [Min,Max]TimeInMatrixElems = [ 3.458258e-01 , 3.458258e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.109232e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.178269e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.516047e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000345 sec 0b MemAlloc : 0.027377 sec 0c GenCreat : 0.000850 sec 1a GenSeed : 0.000008 sec 1b GenRnGen : 0.027686 sec 2a RamboIni : 0.006767 sec 2b RamboFin : 0.092372 sec 3a SigmaKin : 0.345826 sec 4a DumpLoop : 0.004346 sec 8a CompStat : 0.003605 sec 9a GenDestr : 0.000067 sec 9b DumpScrn : 0.000208 sec 9c DumpJson : 0.000008 sec TOTAL : 0.509464 sec TOTAL (123) : 0.472659 sec TOTAL (23) : 0.444965 sec TOTAL (1) : 0.027694 sec TOTAL (2) : 0.099139 sec TOTAL (3) : 0.345826 sec ***********************************************************************

valassi · 2020-12-03T16:34:41Z

This PR includes two things:

a simple implementation of multithreading proposed in Implement parallel loops in C++ version of CUDA code #82 (one line in cpp and one line in Makefile!), achieving immediately a x4 speedup on a 4-core machine, using the openmp suggested by @hageboeck with the build instructions of @lfield
a workaround for Improve #52: separate test executable for c++ and cuda #83, commenting out the build of runTest.exe

It is not surprising that the CI complains for a failed check: it tries to execute runTest.exe but runTest.exe was not built.

It would be best to fix #83 first, separating two sets of .o and .exe for c++ and cuda.

…age printout.

valassi · 2020-12-03T17:43:23Z

I have now also added a printout.

And I disabled OMP if OMP_NUM_THREADS is not set (for backward compatibility, and to make the behavious more explicit).

lfield · 2020-12-03T19:57:48Z

* a simple implementation of multithreading proposed in #82 (one line in cpp and one line in Makefile!), achieving immediately a x4 speedup on a 4-core machine, using the openmp suggested by @hageboeck with the build instructions of @lfield

@valassi to use your own words against you, the x4 speedup is not correct. You have to compare with 4 instances of the single threaded code :) I believe that is what you said to me when we were discussing my performance figures :)

valassi · 2020-12-03T20:51:15Z

* a simple implementation of multithreading proposed in #82 (one line in cpp and one line in Makefile!), achieving immediately a x4 speedup on a 4-core machine, using the openmp suggested by @hageboeck with the build instructions of @lfield
@valassi to use your own words against you, the x4 speedup is not correct. You have to compare with 4 instances of the single threaded code :) I believe that is what you said to me when we were discussing my performance figures :)

Ah! Then I understand why we were not understanding each other! ;-)
No I do not think I ever meant to say this to you, but we certainly misunderstood each other then.

OK: a factor x4 speedup with 4 OMP threads, with respect to a single thread, all in one copy ;-)
Then I assume we would get the same factor x4, maybe a bit less, by running 4 instance 1xMT.

Anyway, this is precisely the type of studies we do in the benchmarking WG, so we will certainly make both options. The idea is to plot throughput vs "level of parallelism", whichever way you achieve it, be it with 4 1xMT copies, with 1 4xMT copy, with 2 2xMT copies (then typically the three dots more or less overlap on throughput vs parallelism plots.

lfield · 2020-12-03T21:36:07Z

As far as I understand the Open MP version will run slightly slower due to the overhead of setting up the threads. In any case, as you pointed out, this is just the multi-core scenario which should be well understood.

…"//" in paths

valassi · 2020-12-04T15:33:48Z

I reenabled the tests. It was as simple as adding -lgomp. This is something that nvcc digests, while it does not understand -fopenmp. (This is the solution I found for implementing heterogeneous Madgraph in PR #87, which includes this PR #84).

This is now ready to be merged. Any objections? Thanks

Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile (add both -Wextra and -fopenmp)

lfield · 2020-12-07T08:57:15Z

I have tested this code and it seems to work. I don't get a huge speedup. It is ~8.5s for 4M events with 1 OMP Thread and 9.5s with 4 OMP Threads. The code currently does not let you run with less than 4 threads so the so maybe it is using more than 1 thread even if OMP Thread is set to 1. Looks good to merge.

lfield · 2020-12-08T14:34:45Z

It looks like OneAPI now supports OpenMP and offloading to supported devices

-fopenmp-targets=spir64

hageboeck

Hi Andrea,
looks almost good to me.

The comment-out-comment-in commits in the Makefile should be squashed.
Maybe you can also remove all those merge commits, so the feature applies with only three commits or so, but that's mostly cosmetics.

…e start

…ctly > 0

valassi · 2020-12-09T20:01:56Z

I am finally merging this. Thanks for all the feeback. I also fixed a few bugs at the end.

Reminder: if OMP_NUM_THREADS is not set, this is reset to 1, for backward compatibility. If you want to set it to the maximum, just check "nproc --all".

valassi added 2 commits December 3, 2020 16:55

valassi requested review from hageboeck, lfield and roiser December 3, 2020 16:29

valassi marked this pull request as draft December 3, 2020 16:31

valassi added 3 commits December 3, 2020 18:03

Minor fix: avoid echoing "echo" from "make info"

6673271

Add a printout for ${OMP_NUM_THREADS} and $(nproc --all)

53f330e

Disable OMP if OMP_NUM_THREADS is not set. Improve result dump and us…

1d39f23

…age printout.

valassi added 3 commits December 4, 2020 16:05

Reenable tests. Fix link by adding -lgomp. Addresses issue madgraph5#83.

c1e580d

Minor improvement: remove trailing "/" from TOOLSDIR to avoid double …

16a1cb7

…"//" in paths

Improve Makefile: remove undefined/unneeded CPPFLAGS

637271c

valassi changed the title ~~Draft PR for multi-threading (#82). Uses OpenMP as suggested by @hageboeck. Contains a workaround, diabling runTest.exe(#83).~~ Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck. Dec 4, 2020

valassi marked this pull request as ready for review December 4, 2020 15:31

valassi requested a review from oliviermattelaer December 4, 2020 15:32

valassi self-assigned this Dec 4, 2020

valassi mentioned this pull request Dec 4, 2020

Heterogeneous MadGraph: parallel CPU+GPU executions #87

Closed

valassi added 2 commits December 6, 2020 16:36

Merge remote-tracking branch 'upstream/master' into issue83

389abff

Merge remote-tracking branch 'upstream/master' into issue83

e5abd92

Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile (add both -Wextra and -fopenmp)

lfield approved these changes Dec 7, 2020

View reviewed changes

valassi added 2 commits December 8, 2020 20:35

Merge remote-tracking branch 'upstream/master' into issue83

8ee40bd

Merge remote-tracking branch 'upstream/master' into issue83

9619b3e

Merge remote-tracking branch 'glav/issue83' into issue83

7aab94b

hageboeck requested changes Dec 9, 2020

View reviewed changes

Comment thread epoch1/cuda/ee_mumu/SubProcesses/Makefile

valassi added 11 commits December 9, 2020 17:16

Add -Wextra also in src/Makefile

4c8aa77

Add -fopenmp also in src/Makefile

3067437

BUG FIXES: use omp_set_num_threads instead of setenv; move this to th…

dc6228e

…e start

BUG FIX in OMP parameter printout

dce3080

If OMP_NUM_THREADS is set, require a string with only digits and stri…

23b893f

…ctly > 0

Improve the printouts

e5a3391

Merge remote-tracking branch 'glav/issue83' into issue83

3b6f150

Add back CPPFLAGS as suggested by @hageboek

37666e2

Merge remote-tracking branch 'glav/issue83' into issue83

c398447

BUG FIX in getenv printout

b2103a5

Merge remote-tracking branch 'glav/issue83' into issue83

6c2668f

valassi merged commit 504de38 into madgraph5:master Dec 9, 2020

valassi deleted the issue83 branch December 9, 2020 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck.#84

Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck.#84
valassi merged 24 commits into
madgraph5:masterfrom
valassi:issue83

valassi commented Dec 3, 2020

Uh oh!

valassi commented Dec 3, 2020

Uh oh!

valassi commented Dec 3, 2020

Uh oh!

lfield commented Dec 3, 2020

Uh oh!

valassi commented Dec 3, 2020

Uh oh!

lfield commented Dec 3, 2020

Uh oh!

valassi commented Dec 4, 2020 •

edited

Loading

Uh oh!

lfield commented Dec 7, 2020 •

edited

Loading

Uh oh!

lfield commented Dec 8, 2020 •

edited

Loading

Uh oh!

hageboeck left a comment

Uh oh!

Uh oh!

valassi commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

valassi commented Dec 3, 2020

Uh oh!

valassi commented Dec 3, 2020

Uh oh!

valassi commented Dec 3, 2020

Uh oh!

lfield commented Dec 3, 2020

Uh oh!

valassi commented Dec 3, 2020

Uh oh!

lfield commented Dec 3, 2020

Uh oh!

valassi commented Dec 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfield commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfield commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hageboeck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

valassi commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

valassi commented Dec 4, 2020 •

edited

Loading

lfield commented Dec 7, 2020 •

edited

Loading

lfield commented Dec 8, 2020 •

edited

Loading