Add feature to produce common random numbers. by hageboeck · Pull Request #45 · madgraph5/madgraph4gpu

hageboeck · 2020-11-06T14:28:23Z

Add a header in tools/ for creating reproducible common random numbers using c++11.
Modify eemumu_AV such that it can make use of those.

Note that to increase the compiler coverage and to decrease the probability of errors, this uses

#define FLAG true/false
...
if (FLAG)

This way, the compiler checks both code paths for correct syntax, but no assembly is generated for the inactive path.

- Add CommonRandomNumbers.h, a header to create deterministic sequences of random numbers for comparing different abstraction frameworks. - Implement parallel/parallel+asynchronous generation.

In order to introduce common random numbers for all abstraction frameworks, it is beneficial to move code out of some ifdef blocks. This increases the compiler coverage (less mistakes in deactivated paths), and reduces the number of blocks that will have to be added later.

An option to use common c++11 random numbers was added. These can be generated in parallel on the CPU while the GPU is being set up. This features the macro MGONGPU_COMMONRAND_ONHOST, which is either defined to true or false. This way, the compiler always checks both code paths, but for the inactive path, no assembly is generated.

roiser

alles gut

valassi · 2020-11-10T12:45:27Z

Hi @hageboeck and @roiser thanks for the commit!

One point, I would prefer to have all #defines to generate #ifdefs in the code, rather than if(true) or if(false), at this stage. I do not claim that what I have done so far is the best option, but I'd like to have a consistent way to set this: so far, this is choosing one and only one of the #defines for each option in the config file.

Is it ok for you if I modify that in this direction? (By the way I have made quite a few changes that I'm about to push, so I merged already a minor conflict).

Thanks
Andrea

valassi · 2020-11-10T14:54:33Z

One other point, I'd rather keep some naming consistent (sorry, my definition of consistent ;-).

For instance I'd stick with hstRnarray as the thing that is passed between random number generation and rambo. Now there is an additional hstRn, I thought that was used to populate hstRnarray but it is a bit the opposite. I would also change that.

valassi · 2020-11-10T16:51:58Z

Sorry, one more point. Especially as we know this is not meant to be a production system but a solution to validate random numbers on host and device, I would favor clarity and metric calculation over speed. I mean, I am not sure I would use a vector that gets precomputed in async mode and has the size of #iterations. This is actually a nice smart trick! But I prefer to get it much clearer in the code how much time is spent in each phase. Eventually I may need to eat this back... but for now I'd stay with simpler code and generate random numbers at the beginning of each iteration.

Or maybe not. I need to think about it...

One possible issue is that these arrays can easily get far too large if they cover all iterations. Starting a new sequence (and reseeding) in each iteration would consistently follow what is done in the curand solution.

hageboeck · 2020-11-10T17:42:41Z

Hi @hageboeck and @roiser thanks for the commit!

One point, I would prefer to have all #defines to generate #ifdefs in the code, rather than if(true) or if(false), at this stage. I do not claim that what I have done so far is the best option, but I'd like to have a consistent way to set this: so far, this is choosing one and only one of the #defines for each option in the config file.

Is it ok for you if I modify that in this direction? (By the way I have made quite a few changes that I'm about to push, so I merged already a minor conflict).

Thanks
Andrea

Sure, I actually tried that but gave up, because both CPU and GPU work with/without common numbers. That is, you need 4 instead of 2 cases to be defined or not, which means that 3/4 of the cases are not checked by the compiler, and that you have to cycle through all four until you know that the code compiles.
Can be done, though.

On a related note:
Would constexpr bool someFlag = true/false be on option?

The code of both branches is checked by the compiler.
Assembly is only generated for the active branch.

hageboeck · 2020-11-10T17:44:01Z

One other point, I'd rather keep some naming consistent (sorry, my definition of consistent ;-).

For instance I'd stick with hstRnarray as the thing that is passed between random number generation and rambo. Now there is an additional hstRn, I thought that was used to populate hstRnarray but it is a bit the opposite. I would also change that.

This can also be done, but it needs more ifdefs. I had also done that, but when I realised that more and more code hides behind ifdefs, I pulled it out.
... unless I misunderstand and you just want a variable to be renamed.

hageboeck · 2020-11-10T17:56:43Z

Sorry, one more point. Especially as we know this is not meant to be a production system but a solution to validate random numbers on host and device, I would favor clarity and metric calculation over speed. I mean, I am not sure I would use a vector that gets precomputed in async mode and has the size of #iterations. This is actually a nice smart trick! But I prefer to get it much clearer in the code how much time is spent in each phase. Eventually I may need to eat this back... but for now I'd stay with simpler code and generate random numbers at the beginning of each iteration.

Or maybe not. I need to think about it...

One possible issue is that these arrays can easily get far too large if they cover all iterations. Starting a new sequence (and reseeding) in each iteration would consistently follow what is done in the curand solution.

Yes, can also be done, but I was asked specifically if it's possible to do it in parallel. It's changing three lines of code or so, because CommonRandomNumbers.h exposes the function that creates a batch of random numbers.

valassi · 2020-11-10T17:59:52Z

Hi Stephan, thanks... and sorry for being a pain :-)
Before I comment on your replies, I noticed one more fundamental issue (while fixing ifdefs, which indeed gets complex!): when using curand on host, hstRnarray is a pinned memory array (using cudaMalloc), while using the common random generator essentially an std::vector is created and then its data() is used as C-array to be copied over to CUDA. In this case, the memory copy from host to device does an intermediate copy to pinned memory in any case, see https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/. After looking this up, I find it easier to just use hstRnarray as pinned memory, and then memcpy the vector's data() to that pinned host array: this is what would in any case happen behind the scenes if we copied directly data() to the GPU, but at least it allows a simplifaction of the code about what hstRnarray is. This extra memcpy is very ugly, but I think is unavoidable. A nice improvement in your algo, if possible, would be to fill the preallocated rnArray rather than creating a new vector, maybe this is possible? For this validation code it is not essential, but I think it helps to think in terms of pinned memory even here.

That said, I saw your PR #47 which has many interesting and relevant points! But I'd like to complete this one first...

Thanks! Lets discuss maybe at coffee tomorrow morning
Andrea

valassi · 2020-11-10T18:04:00Z

About your replies now:

I think the ifdefs are indeed a mess. Again, not sure this is the best, but I'd like to stay consisten with teh rest. A constexpr? I have not thought of that. Maybe I suggest to stay with the options we have now, and reassess after tomorrow's LHCb/ALICE talks, eventually we may need ifdefs also for AMD etc etc
About the async and parallel, you ar eright, in the end I am keeping that in. Maybe the timers get some slightly changed meanings, but ok. I am trying to make sure the timers are relatively consistent with the rest anyway (this is one thing I was working on this morning in PR Fix throughput metrics and improve float printouts. Merge (fixing conflicts) with @shageboe's common random nubers on host. #46

A demain
Andrea

…rray usage. ./gcheck.exe -p 16384 32 12 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 8.692473e-02 ) sec TotalTime[Rambo+ME] (23)= ( 7.913799e-02 ) sec TotalTime[RndNumGen] (1)= ( 7.786739e-03 ) sec TotalTime[Rambo] (2)= ( 7.001756e-02 ) sec TotalTime[MatrixElems] (3)= ( 9.120431e-03 ) sec MeanTimeInMatrixElems = ( 7.600359e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.568000e-04 , 7.671370e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.237821e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.949982e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.898200e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.877303 sec 0a ProcInit : 0.000488 sec 0b MemAlloc : 0.061239 sec 0c GenCreat : 0.009764 sec 0d SGoodHel : 0.001760 sec 1a GenSeed : 0.000101 sec 1b GenRnGen : 0.007686 sec 2a RamboIni : 0.000178 sec 2b RamboFin : 0.000138 sec 2c CpDTHwgt : 0.006028 sec 2d CpDTHmom : 0.063673 sec 3a SigmaKin : 0.000165 sec 3b CpDTHmes : 0.008956 sec 4a DumpLoop : 0.023916 sec 8a CompStat : 0.045670 sec 9a GenDestr : 0.000054 sec 9b MemFree : 0.012793 sec 9c CudReset : 0.049363 sec 9d DumpScrn : 0.000221 sec 9e DumpJson : 0.000008 sec TOTAL : 1.169505 sec TOTAL (123) : 0.086925 sec TOTAL (23) : 0.079138 sec TOTAL (1) : 0.007787 sec TOTAL (2) : 0.070018 sec TOTAL (3) : 0.009120 sec *********************************************************************** ./check.exe -p 16384 32 12 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 1.787162e+01 ) sec TotalTime[Rambo+ME] (23)= ( 1.753950e+01 ) sec TotalTime[RndNumGen] (1)= ( 3.321243e-01 ) sec TotalTime[Rambo] (2)= ( 1.204920e+00 ) sec TotalTime[MatrixElems] (3)= ( 1.633458e+01 ) sec MeanTimeInMatrixElems = ( 1.361215e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.360797e+00 , 1.361576e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.520361e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.587021e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.851618e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000319 sec 0b MemAlloc : 0.050883 sec 0c GenCreat : 0.000824 sec 1a GenSeed : 0.000105 sec 1b GenRnGen : 0.332019 sec 2a RamboIni : 0.083518 sec 2b RamboFin : 1.121402 sec 3a SigmaKin : 16.334579 sec 4a DumpLoop : 0.020090 sec 8a CompStat : 0.037017 sec 9a GenDestr : 0.000081 sec 9b MemFree : 0.001111 sec 9d DumpScrn : 0.000187 sec 9e DumpJson : 0.000007 sec TOTAL : 17.982145 sec TOTAL (123) : 17.871624 sec TOTAL (23) : 17.539499 sec TOTAL (1) : 0.332124 sec TOTAL (2) : 1.204920 sec TOTAL (3) : 16.334579 sec ***********************************************************************

Improvements over PR #45. More consistent #ifdefs and hstRnarray usage.

valassi · 2020-11-10T18:49:27Z

Ok I have merged the changes I was discussing above, as PR #48
Lets discuss tomorrow

hageboeck added 4 commits November 5, 2020 14:15

[Tools] Add header to create common random numbers.

0a77973

- Add CommonRandomNumbers.h, a header to create deterministic sequences of random numbers for comparing different abstraction frameworks. - Implement parallel/parallel+asynchronous generation.

[eemm_AV] Unify -I in Makefile.

b7a54be

hageboeck requested a review from roiser November 9, 2020 12:39

hageboeck self-assigned this Nov 9, 2020

roiser approved these changes Nov 9, 2020

View reviewed changes

hageboeck merged commit 599e5c7 into madgraph5:master Nov 10, 2020

hageboeck deleted the randomNumbers branch November 10, 2020 11:02

valassi added a commit that referenced this pull request Nov 10, 2020

Merge pull request #48 from valassi/master

6360c71

Improvements over PR #45. More consistent #ifdefs and hstRnarray usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature to produce common random numbers.#45

Add feature to produce common random numbers.#45
hageboeck merged 4 commits into
madgraph5:masterfrom
hageboeck:randomNumbers

hageboeck commented Nov 6, 2020

Uh oh!

roiser left a comment

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020 •

edited

Loading

Uh oh!

hageboeck commented Nov 10, 2020

Uh oh!

hageboeck commented Nov 10, 2020

Uh oh!

hageboeck commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hageboeck commented Nov 6, 2020

Uh oh!

roiser left a comment

Choose a reason for hiding this comment

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hageboeck commented Nov 10, 2020

Uh oh!

hageboeck commented Nov 10, 2020

Uh oh!

hageboeck commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

valassi commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

valassi commented Nov 10, 2020 •

edited

Loading