Add feature to produce common random numbers.#45
Conversation
- Add CommonRandomNumbers.h, a header to create deterministic sequences of random numbers for comparing different abstraction frameworks. - Implement parallel/parallel+asynchronous generation.
In order to introduce common random numbers for all abstraction frameworks, it is beneficial to move code out of some ifdef blocks. This increases the compiler coverage (less mistakes in deactivated paths), and reduces the number of blocks that will have to be added later.
An option to use common c++11 random numbers was added. These can be generated in parallel on the CPU while the GPU is being set up. This features the macro MGONGPU_COMMONRAND_ONHOST, which is either defined to true or false. This way, the compiler always checks both code paths, but for the inactive path, no assembly is generated.
|
Hi @hageboeck and @roiser thanks for the commit! One point, I would prefer to have all #defines to generate #ifdefs in the code, rather than if(true) or if(false), at this stage. I do not claim that what I have done so far is the best option, but I'd like to have a consistent way to set this: so far, this is choosing one and only one of the #defines for each option in the config file. Is it ok for you if I modify that in this direction? (By the way I have made quite a few changes that I'm about to push, so I merged already a minor conflict). Thanks |
|
One other point, I'd rather keep some naming consistent (sorry, my definition of consistent ;-). For instance I'd stick with hstRnarray as the thing that is passed between random number generation and rambo. Now there is an additional hstRn, I thought that was used to populate hstRnarray but it is a bit the opposite. I would also change that. |
|
Sorry, one more point. Especially as we know this is not meant to be a production system but a solution to validate random numbers on host and device, I would favor clarity and metric calculation over speed. I mean, I am not sure I would use a vector that gets precomputed in async mode and has the size of #iterations. This is actually a nice smart trick! But I prefer to get it much clearer in the code how much time is spent in each phase. Eventually I may need to eat this back... but for now I'd stay with simpler code and generate random numbers at the beginning of each iteration. Or maybe not. I need to think about it... One possible issue is that these arrays can easily get far too large if they cover all iterations. Starting a new sequence (and reseeding) in each iteration would consistently follow what is done in the curand solution. |
Sure, I actually tried that but gave up, because both CPU and GPU work with/without common numbers. That is, you need 4 instead of 2 cases to be On a related note:
|
This can also be done, but it needs more |
Yes, can also be done, but I was asked specifically if it's possible to do it in parallel. It's changing three lines of code or so, because |
|
Hi Stephan, thanks... and sorry for being a pain :-) That said, I saw your PR #47 which has many interesting and relevant points! But I'd like to complete this one first... Thanks! Lets discuss maybe at coffee tomorrow morning |
|
About your replies now:
A demain |
…rray usage. ./gcheck.exe -p 16384 32 12 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 8.692473e-02 ) sec TotalTime[Rambo+ME] (23)= ( 7.913799e-02 ) sec TotalTime[RndNumGen] (1)= ( 7.786739e-03 ) sec TotalTime[Rambo] (2)= ( 7.001756e-02 ) sec TotalTime[MatrixElems] (3)= ( 9.120431e-03 ) sec MeanTimeInMatrixElems = ( 7.600359e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.568000e-04 , 7.671370e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.237821e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.949982e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.898200e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.877303 sec 0a ProcInit : 0.000488 sec 0b MemAlloc : 0.061239 sec 0c GenCreat : 0.009764 sec 0d SGoodHel : 0.001760 sec 1a GenSeed : 0.000101 sec 1b GenRnGen : 0.007686 sec 2a RamboIni : 0.000178 sec 2b RamboFin : 0.000138 sec 2c CpDTHwgt : 0.006028 sec 2d CpDTHmom : 0.063673 sec 3a SigmaKin : 0.000165 sec 3b CpDTHmes : 0.008956 sec 4a DumpLoop : 0.023916 sec 8a CompStat : 0.045670 sec 9a GenDestr : 0.000054 sec 9b MemFree : 0.012793 sec 9c CudReset : 0.049363 sec 9d DumpScrn : 0.000221 sec 9e DumpJson : 0.000008 sec TOTAL : 1.169505 sec TOTAL (123) : 0.086925 sec TOTAL (23) : 0.079138 sec TOTAL (1) : 0.007787 sec TOTAL (2) : 0.070018 sec TOTAL (3) : 0.009120 sec *********************************************************************** ./check.exe -p 16384 32 12 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 1.787162e+01 ) sec TotalTime[Rambo+ME] (23)= ( 1.753950e+01 ) sec TotalTime[RndNumGen] (1)= ( 3.321243e-01 ) sec TotalTime[Rambo] (2)= ( 1.204920e+00 ) sec TotalTime[MatrixElems] (3)= ( 1.633458e+01 ) sec MeanTimeInMatrixElems = ( 1.361215e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.360797e+00 , 1.361576e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.520361e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.587021e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.851618e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000319 sec 0b MemAlloc : 0.050883 sec 0c GenCreat : 0.000824 sec 1a GenSeed : 0.000105 sec 1b GenRnGen : 0.332019 sec 2a RamboIni : 0.083518 sec 2b RamboFin : 1.121402 sec 3a SigmaKin : 16.334579 sec 4a DumpLoop : 0.020090 sec 8a CompStat : 0.037017 sec 9a GenDestr : 0.000081 sec 9b MemFree : 0.001111 sec 9d DumpScrn : 0.000187 sec 9e DumpJson : 0.000007 sec TOTAL : 17.982145 sec TOTAL (123) : 17.871624 sec TOTAL (23) : 17.539499 sec TOTAL (1) : 0.332124 sec TOTAL (2) : 1.204920 sec TOTAL (3) : 16.334579 sec ***********************************************************************
Improvements over PR #45. More consistent #ifdefs and hstRnarray usage.
|
Ok I have merged the changes I was discussing above, as PR #48 |
tools/for creating reproducible common random numbers using c++11.Note that to increase the compiler coverage and to decrease the probability of errors, this uses
This way, the compiler checks both code paths for correct syntax, but no assembly is generated for the inactive path.