Skip to content

Add feature to produce common random numbers.#45

Merged
hageboeck merged 4 commits into
madgraph5:masterfrom
hageboeck:randomNumbers
Nov 10, 2020
Merged

Add feature to produce common random numbers.#45
hageboeck merged 4 commits into
madgraph5:masterfrom
hageboeck:randomNumbers

Conversation

@hageboeck
Copy link
Copy Markdown
Member

  • Add a header in tools/ for creating reproducible common random numbers using c++11.
  • Modify eemumu_AV such that it can make use of those.

Note that to increase the compiler coverage and to decrease the probability of errors, this uses

#define FLAG true/false
...
if (FLAG)

This way, the compiler checks both code paths for correct syntax, but no assembly is generated for the inactive path.

- Add CommonRandomNumbers.h, a header to create deterministic sequences
of random numbers for comparing different abstraction frameworks.
- Implement parallel/parallel+asynchronous generation.
In order to introduce common random numbers for all abstraction frameworks,
it is beneficial to move code out of some ifdef blocks. This increases the
compiler coverage (less mistakes in deactivated paths), and reduces the
number of blocks that will have to be added later.
An option to use common c++11 random numbers was added. These can
be generated in parallel on the CPU while the GPU is being set up.

This features the macro MGONGPU_COMMONRAND_ONHOST, which is either
defined to true or false. This way, the compiler always checks both
code paths, but for the inactive path, no assembly is generated.
@hageboeck hageboeck requested a review from roiser November 9, 2020 12:39
@hageboeck hageboeck self-assigned this Nov 9, 2020
Copy link
Copy Markdown
Member

@roiser roiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alles gut

@hageboeck hageboeck merged commit 599e5c7 into madgraph5:master Nov 10, 2020
@hageboeck hageboeck deleted the randomNumbers branch November 10, 2020 11:02
@valassi
Copy link
Copy Markdown
Member

valassi commented Nov 10, 2020

Hi @hageboeck and @roiser thanks for the commit!

One point, I would prefer to have all #defines to generate #ifdefs in the code, rather than if(true) or if(false), at this stage. I do not claim that what I have done so far is the best option, but I'd like to have a consistent way to set this: so far, this is choosing one and only one of the #defines for each option in the config file.

Is it ok for you if I modify that in this direction? (By the way I have made quite a few changes that I'm about to push, so I merged already a minor conflict).

Thanks
Andrea

@valassi
Copy link
Copy Markdown
Member

valassi commented Nov 10, 2020

One other point, I'd rather keep some naming consistent (sorry, my definition of consistent ;-).

For instance I'd stick with hstRnarray as the thing that is passed between random number generation and rambo. Now there is an additional hstRn, I thought that was used to populate hstRnarray but it is a bit the opposite. I would also change that.

@valassi
Copy link
Copy Markdown
Member

valassi commented Nov 10, 2020

Sorry, one more point. Especially as we know this is not meant to be a production system but a solution to validate random numbers on host and device, I would favor clarity and metric calculation over speed. I mean, I am not sure I would use a vector that gets precomputed in async mode and has the size of #iterations. This is actually a nice smart trick! But I prefer to get it much clearer in the code how much time is spent in each phase. Eventually I may need to eat this back... but for now I'd stay with simpler code and generate random numbers at the beginning of each iteration.

Or maybe not. I need to think about it...

One possible issue is that these arrays can easily get far too large if they cover all iterations. Starting a new sequence (and reseeding) in each iteration would consistently follow what is done in the curand solution.

@hageboeck
Copy link
Copy Markdown
Member Author

Hi @hageboeck and @roiser thanks for the commit!

One point, I would prefer to have all #defines to generate #ifdefs in the code, rather than if(true) or if(false), at this stage. I do not claim that what I have done so far is the best option, but I'd like to have a consistent way to set this: so far, this is choosing one and only one of the #defines for each option in the config file.

Is it ok for you if I modify that in this direction? (By the way I have made quite a few changes that I'm about to push, so I merged already a minor conflict).

Thanks
Andrea

Sure, I actually tried that but gave up, because both CPU and GPU work with/without common numbers. That is, you need 4 instead of 2 cases to be defined or not, which means that 3/4 of the cases are not checked by the compiler, and that you have to cycle through all four until you know that the code compiles.
Can be done, though.

On a related note:
Would constexpr bool someFlag = true/false be on option?

  • The code of both branches is checked by the compiler.
  • Assembly is only generated for the active branch.

@hageboeck
Copy link
Copy Markdown
Member Author

One other point, I'd rather keep some naming consistent (sorry, my definition of consistent ;-).

For instance I'd stick with hstRnarray as the thing that is passed between random number generation and rambo. Now there is an additional hstRn, I thought that was used to populate hstRnarray but it is a bit the opposite. I would also change that.

This can also be done, but it needs more ifdefs. I had also done that, but when I realised that more and more code hides behind ifdefs, I pulled it out.
... unless I misunderstand and you just want a variable to be renamed.

@hageboeck
Copy link
Copy Markdown
Member Author

Sorry, one more point. Especially as we know this is not meant to be a production system but a solution to validate random numbers on host and device, I would favor clarity and metric calculation over speed. I mean, I am not sure I would use a vector that gets precomputed in async mode and has the size of #iterations. This is actually a nice smart trick! But I prefer to get it much clearer in the code how much time is spent in each phase. Eventually I may need to eat this back... but for now I'd stay with simpler code and generate random numbers at the beginning of each iteration.

Or maybe not. I need to think about it...

One possible issue is that these arrays can easily get far too large if they cover all iterations. Starting a new sequence (and reseeding) in each iteration would consistently follow what is done in the curand solution.

Yes, can also be done, but I was asked specifically if it's possible to do it in parallel. It's changing three lines of code or so, because CommonRandomNumbers.h exposes the function that creates a batch of random numbers.

@valassi
Copy link
Copy Markdown
Member

valassi commented Nov 10, 2020

Hi Stephan, thanks... and sorry for being a pain :-)
Before I comment on your replies, I noticed one more fundamental issue (while fixing ifdefs, which indeed gets complex!): when using curand on host, hstRnarray is a pinned memory array (using cudaMalloc), while using the common random generator essentially an std::vector is created and then its data() is used as C-array to be copied over to CUDA. In this case, the memory copy from host to device does an intermediate copy to pinned memory in any case, see https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/. After looking this up, I find it easier to just use hstRnarray as pinned memory, and then memcpy the vector's data() to that pinned host array: this is what would in any case happen behind the scenes if we copied directly data() to the GPU, but at least it allows a simplifaction of the code about what hstRnarray is. This extra memcpy is very ugly, but I think is unavoidable. A nice improvement in your algo, if possible, would be to fill the preallocated rnArray rather than creating a new vector, maybe this is possible? For this validation code it is not essential, but I think it helps to think in terms of pinned memory even here.

That said, I saw your PR #47 which has many interesting and relevant points! But I'd like to complete this one first...

Thanks! Lets discuss maybe at coffee tomorrow morning
Andrea

@valassi
Copy link
Copy Markdown
Member

valassi commented Nov 10, 2020

About your replies now:

  • I think the ifdefs are indeed a mess. Again, not sure this is the best, but I'd like to stay consisten with teh rest. A constexpr? I have not thought of that. Maybe I suggest to stay with the options we have now, and reassess after tomorrow's LHCb/ALICE talks, eventually we may need ifdefs also for AMD etc etc

  • About the async and parallel, you ar eright, in the end I am keeping that in. Maybe the timers get some slightly changed meanings, but ok. I am trying to make sure the timers are relatively consistent with the rest anyway (this is one thing I was working on this morning in PR Fix throughput metrics and improve float printouts. Merge (fixing conflicts) with @shageboe's common random nubers on host. #46

A demain
Andrea

valassi added a commit to valassi/madgraph4gpu that referenced this pull request Nov 10, 2020
…rray usage.

./gcheck.exe -p 16384 32 12
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 8.692473e-02                 )  sec
TotalTime[Rambo+ME]    (23)= ( 7.913799e-02                 )  sec
TotalTime[RndNumGen]    (1)= ( 7.786739e-03                 )  sec
TotalTime[Rambo]        (2)= ( 7.001756e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 9.120431e-03                 )  sec
MeanTimeInMatrixElems      = ( 7.600359e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.568000e-04 ,  7.671370e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.237821e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.949982e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.898200e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.200854e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.877303 sec
0a ProcInit :     0.000488 sec
0b MemAlloc :     0.061239 sec
0c GenCreat :     0.009764 sec
0d SGoodHel :     0.001760 sec
1a GenSeed  :     0.000101 sec
1b GenRnGen :     0.007686 sec
2a RamboIni :     0.000178 sec
2b RamboFin :     0.000138 sec
2c CpDTHwgt :     0.006028 sec
2d CpDTHmom :     0.063673 sec
3a SigmaKin :     0.000165 sec
3b CpDTHmes :     0.008956 sec
4a DumpLoop :     0.023916 sec
8a CompStat :     0.045670 sec
9a GenDestr :     0.000054 sec
9b MemFree  :     0.012793 sec
9c CudReset :     0.049363 sec
9d DumpScrn :     0.000221 sec
9e DumpJson :     0.000008 sec
TOTAL       :     1.169505 sec
TOTAL (123) :     0.086925 sec
TOTAL  (23) :     0.079138 sec
TOTAL   (1) :     0.007787 sec
TOTAL   (2) :     0.070018 sec
TOTAL   (3) :     0.009120 sec
***********************************************************************

./check.exe -p 16384 32 12
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 1.787162e+01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.753950e+01                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.321243e-01                 )  sec
TotalTime[Rambo]        (2)= ( 1.204920e+00                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.633458e+01                 )  sec
MeanTimeInMatrixElems      = ( 1.361215e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.360797e+00 ,  1.361576e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.520361e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.587021e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.851618e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.200854e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000319 sec
0b MemAlloc :     0.050883 sec
0c GenCreat :     0.000824 sec
1a GenSeed  :     0.000105 sec
1b GenRnGen :     0.332019 sec
2a RamboIni :     0.083518 sec
2b RamboFin :     1.121402 sec
3a SigmaKin :    16.334579 sec
4a DumpLoop :     0.020090 sec
8a CompStat :     0.037017 sec
9a GenDestr :     0.000081 sec
9b MemFree  :     0.001111 sec
9d DumpScrn :     0.000187 sec
9e DumpJson :     0.000007 sec
TOTAL       :    17.982145 sec
TOTAL (123) :    17.871624 sec
TOTAL  (23) :    17.539499 sec
TOTAL   (1) :     0.332124 sec
TOTAL   (2) :     1.204920 sec
TOTAL   (3) :    16.334579 sec
***********************************************************************
valassi added a commit that referenced this pull request Nov 10, 2020
Improvements over PR #45. More consistent #ifdefs and hstRnarray usage.
@valassi
Copy link
Copy Markdown
Member

valassi commented Nov 10, 2020

Ok I have merged the changes I was discussing above, as PR #48
Lets discuss tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants