Skip to content

Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck.#84

Merged
valassi merged 24 commits into
madgraph5:masterfrom
valassi:issue83
Dec 9, 2020
Merged

Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck.#84
valassi merged 24 commits into
madgraph5:masterfrom
valassi:issue83

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented Dec 3, 2020

No description provided.

The build fails because runTest.exe mixes c++ and cuda objects (madgraph5#83):

/cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/bin/g++  -O3 -std=c++11 -I. -I../../src -I../../../../../tools/ -I../../../../../tools//googletest/googletest/include/ -DUSE_NVTX -Wall -Wshadow -fopenmp  -I/usr/local/cuda-11.1/include/ -c runTest.cc -o runTest.o
ln -sf runTest.cc runTest_tmp.cu
/usr/local/cuda-11.1/bin/nvcc -o runTest.exe CPPProcess.o runTest.o gCPPProcess.o runTest_tmp.cu  -O3 -std=c++14 -I. -I../../src -I../../../../../tools/ -I../../../../../tools//googletest/googletest/include/ -I/usr/local/cuda-11.1/include/ -DUSE_NVTX -arch=compute_70 -use_fast_math -lineinfo  -ldl -L../../lib -lmodel_sm -L../../../../../tools//googletest/build/lib// -lgtest -lgtest_main -L/usr/local/cuda-11.1/lib64/ -lcuda -lcurand -lcuda
CPPProcess.o: In function `Proc::sigmaKin(double const*, double*, int) [clone ._omp_fn.0]':
CPPProcess.cc:(.text+0x2312): undefined reference to `omp_get_num_threads'
CPPProcess.cc:(.text+0x2319): undefined reference to `omp_get_thread_num'
CPPProcess.o: In function `Proc::sigmaKin(double const*, double*, int)':
CPPProcess.cc:(.text+0x263f): undefined reference to `GOMP_parallel'
collect2: error: ld returned 1 exit status
make: *** [Makefile:98: runTest.exe] Error 1
This allows testing multithreading (madgraph5#82) using OpenMP as suggested by @hageboeck.

Very nice, one line immediately gained a factor 4 in throughput!
This is itscrd03 (a VM), note `nproc` is 1, but `nproc --all` is 4 (difference?).
Going even further, eg to 8, remains flat in throughput:

1: EvtsPerSec[MatrixElems] (3)= ( 3.841438e+05                 )  sec^-1
4: EvtsPerSec[MatrixElems] (3)= ( 1.515065e+06                 )  sec^-1
8: EvtsPerSec[MatrixElems] (3)= ( 1.516047e+06                 )  sec^-1

export OMP_NUM_THREADS=1; ./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.491819e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.464077e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.774263e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.925424e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.364822e+00                 )  sec
MeanTimeInMatrixElems      = ( 1.364822e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.364822e+00 ,  1.364822e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.514421e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.581015e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.841438e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000371 sec
0b MemAlloc :     0.027894 sec
0c GenCreat :     0.000852 sec
1a GenSeed  :     0.000010 sec
1b GenRnGen :     0.027733 sec
2a RamboIni :     0.006828 sec
2b RamboFin :     0.092427 sec
3a SigmaKin :     1.364822 sec
4a DumpLoop :     0.004408 sec
8a CompStat :     0.003551 sec
9a GenDestr :     0.000109 sec
9b DumpScrn :     0.000214 sec
9c DumpJson :     0.000007 sec
TOTAL       :     1.529226 sec
TOTAL (123) :     1.491819 sec
TOTAL  (23) :     1.464077 sec
TOTAL   (1) :     0.027743 sec
TOTAL   (2) :     0.099254 sec
TOTAL   (3) :     1.364822 sec
***********************************************************************

export OMP_NUM_THREADS=4; ./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 4.734046e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 4.457583e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.764625e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.970848e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 3.460498e-01                 )  sec
MeanTimeInMatrixElems      = ( 3.460498e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 3.460498e-01 ,  3.460498e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.107484e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.176171e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.515065e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000325 sec
0b MemAlloc :     0.027604 sec
0c GenCreat :     0.000858 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.027638 sec
2a RamboIni :     0.006976 sec
2b RamboFin :     0.092732 sec
3a SigmaKin :     0.346050 sec
4a DumpLoop :     0.004402 sec
8a CompStat :     0.003604 sec
9a GenDestr :     0.000070 sec
9b DumpScrn :     0.000225 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.510500 sec
TOTAL (123) :     0.473405 sec
TOTAL  (23) :     0.445758 sec
TOTAL   (1) :     0.027646 sec
TOTAL   (2) :     0.099708 sec
TOTAL   (3) :     0.346050 sec
***********************************************************************

export OMP_NUM_THREADS=8; ./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 4.726587e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 4.449646e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.769402e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.913886e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 3.458258e-01                 )  sec
MeanTimeInMatrixElems      = ( 3.458258e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 3.458258e-01 ,  3.458258e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.109232e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.178269e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.516047e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000345 sec
0b MemAlloc :     0.027377 sec
0c GenCreat :     0.000850 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.027686 sec
2a RamboIni :     0.006767 sec
2b RamboFin :     0.092372 sec
3a SigmaKin :     0.345826 sec
4a DumpLoop :     0.004346 sec
8a CompStat :     0.003605 sec
9a GenDestr :     0.000067 sec
9b DumpScrn :     0.000208 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.509464 sec
TOTAL (123) :     0.472659 sec
TOTAL  (23) :     0.444965 sec
TOTAL   (1) :     0.027694 sec
TOTAL   (2) :     0.099139 sec
TOTAL   (3) :     0.345826 sec
***********************************************************************
@valassi valassi marked this pull request as draft December 3, 2020 16:31
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 3, 2020

This PR includes two things:

It is not surprising that the CI complains for a failed check: it tries to execute runTest.exe but runTest.exe was not built.

It would be best to fix #83 first, separating two sets of .o and .exe for c++ and cuda.

@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 3, 2020

I have now also added a printout.

And I disabled OMP if OMP_NUM_THREADS is not set (for backward compatibility, and to make the behavious more explicit).

@lfield
Copy link
Copy Markdown
Contributor

lfield commented Dec 3, 2020

* a simple implementation of multithreading proposed in #82 (one line in cpp and one line in Makefile!), achieving immediately a x4 speedup on a 4-core machine, using the openmp suggested by @hageboeck with the build instructions of @lfield

@valassi to use your own words against you, the x4 speedup is not correct. You have to compare with 4 instances of the single threaded code :) I believe that is what you said to me when we were discussing my performance figures :)

@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 3, 2020

* a simple implementation of multithreading proposed in #82 (one line in cpp and one line in Makefile!), achieving immediately a x4 speedup on a 4-core machine, using the openmp suggested by @hageboeck with the build instructions of @lfield

@valassi to use your own words against you, the x4 speedup is not correct. You have to compare with 4 instances of the single threaded code :) I believe that is what you said to me when we were discussing my performance figures :)

Ah! Then I understand why we were not understanding each other! ;-)
No I do not think I ever meant to say this to you, but we certainly misunderstood each other then.

OK: a factor x4 speedup with 4 OMP threads, with respect to a single thread, all in one copy ;-)
Then I assume we would get the same factor x4, maybe a bit less, by running 4 instance 1xMT.

Anyway, this is precisely the type of studies we do in the benchmarking WG, so we will certainly make both options. The idea is to plot throughput vs "level of parallelism", whichever way you achieve it, be it with 4 1xMT copies, with 1 4xMT copy, with 2 2xMT copies (then typically the three dots more or less overlap on throughput vs parallelism plots.

@lfield
Copy link
Copy Markdown
Contributor

lfield commented Dec 3, 2020

As far as I understand the Open MP version will run slightly slower due to the overhead of setting up the threads. In any case, as you pointed out, this is just the multi-core scenario which should be well understood.

@valassi valassi changed the title Draft PR for multi-threading (#82). Uses OpenMP as suggested by @hageboeck. Contains a workaround, diabling runTest.exe(#83). Implement multi-threading (#82). Uses OpenMP as suggested by @hageboeck. Dec 4, 2020
@valassi valassi marked this pull request as ready for review December 4, 2020 15:31
@valassi valassi self-assigned this Dec 4, 2020
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 4, 2020

I reenabled the tests. It was as simple as adding -lgomp. This is something that nvcc digests, while it does not understand -fopenmp. (This is the solution I found for implementing heterogeneous Madgraph in PR #87, which includes this PR #84).

This is now ready to be merged. Any objections? Thanks

Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile (add both -Wextra and -fopenmp)
@lfield
Copy link
Copy Markdown
Contributor

lfield commented Dec 7, 2020

I have tested this code and it seems to work. I don't get a huge speedup. It is ~8.5s for 4M events with 1 OMP Thread and 9.5s with 4 OMP Threads. The code currently does not let you run with less than 4 threads so the so maybe it is using more than 1 thread even if OMP Thread is set to 1. Looks good to merge.

@lfield
Copy link
Copy Markdown
Contributor

lfield commented Dec 8, 2020

It looks like OneAPI now supports OpenMP and offloading to supported devices

-fopenmp-targets=spir64

Copy link
Copy Markdown
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrea,
looks almost good to me.

The comment-out-comment-in commits in the Makefile should be squashed.
Maybe you can also remove all those merge commits, so the feature applies with only three commits or so, but that's mostly cosmetics.

Comment thread epoch1/cuda/ee_mumu/SubProcesses/Makefile
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Dec 9, 2020

I am finally merging this. Thanks for all the feeback. I also fixed a few bugs at the end.

Reminder: if OMP_NUM_THREADS is not set, this is reset to 1, for backward compatibility. If you want to set it to the maximum, just check "nproc --all".

@valassi valassi merged commit 504de38 into madgraph5:master Dec 9, 2020
@valassi valassi deleted the issue83 branch December 9, 2020 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants