WIP - het + klas2 + epoch1/epoch2 (Heterogeneous standalone application: GPU + SIMD CPU)#159
Closed
valassi wants to merge 127 commits into
Closed
WIP - het + klas2 + epoch1/epoch2 (Heterogeneous standalone application: GPU + SIMD CPU)#159valassi wants to merge 127 commits into
valassi wants to merge 127 commits into
Conversation
./hcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND DEVICE (CUDA code) Wavefunction GPU memory = LOCAL ----------------------------------------------------------------------- NumIterations = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.312938e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.714818e-03 ) sec TotalTime[RndNumGen] (1)= ( 5.981200e-04 ) sec TotalTime[Rambo] (2)= ( 5.945168e-03 ) sec TotalTime[MatrixElems] (3)= ( 7.696500e-04 ) sec MeanTimeInMatrixElems = ( 7.696500e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.696500e-04 , 7.696500e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 (nan=0) EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.169321e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.807926e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.812032e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** (GPU) 00 CudaFree : 0.872603 sec (GPU) 0a ProcInit : 0.000233 sec (GPU) 0b MemAlloc : 0.035793 sec (GPU) 0c GenCreat : 0.009814 sec (GPU) 0d SGoodHel : 0.001759 sec (GPU) 1a GenSeed : 0.000009 sec (GPU) 1b GenRnGen : 0.000589 sec (GPU) 2a RamboIni : 0.000022 sec (GPU) 2b RamboFin : 0.000014 sec (GPU) 2c CpDTHwgt : 0.000502 sec (GPU) 2d CpDTHmom : 0.005407 sec (GPU) 3a SigmaKin : 0.000014 sec (GPU) 3b CpDTHmes : 0.000756 sec (GPU) 4a DumpLoop : 0.004765 sec (GPU) 8a CompStat : 0.003540 sec (GPU) 9a GenDestr : 0.000048 sec (GPU) 9b DumpScrn : 0.000044 sec (GPU) 9c DumpJson : 0.000007 sec (GPU) TOTAL : 0.935918 sec (GPU) TOTAL (123) : 0.007313 sec (GPU) TOTAL (23) : 0.006715 sec (GPU) TOTAL (1) : 0.000598 sec (GPU) TOTAL (2) : 0.005945 sec (GPU) TOTAL (3) : 0.000770 sec *********************************************************************** *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) OMP threads / maxthreads = 4 / 4 ----------------------------------------------------------------------- NumIterations = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 5.725351e-01 ) sec TotalTime[Rambo+ME] (23)= ( 5.449318e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.760323e-02 ) sec TotalTime[Rambo] (2)= ( 9.914417e-02 ) sec TotalTime[MatrixElems] (3)= ( 4.457877e-01 ) sec MeanTimeInMatrixElems = ( 4.457877e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.457877e-01 , 4.457877e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 (nan=0) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.157308e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 9.621167e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.176094e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** (CPU) 0a ProcInit : 0.000331 sec (CPU) 0b MemAlloc : 0.025358 sec (CPU) 0c GenCreat : 0.000915 sec (CPU) 1a GenSeed : 0.000009 sec (CPU) 1b GenRnGen : 0.027595 sec (CPU) 2a RamboIni : 0.006872 sec (CPU) 2b RamboFin : 0.092273 sec (CPU) 3a SigmaKin : 0.445788 sec (CPU) 4a DumpLoop : 0.004605 sec (CPU) 8a CompStat : 0.003633 sec (CPU) 9a GenDestr : 0.000094 sec (CPU) 9b DumpScrn : 0.004946 sec (CPU) 9c DumpJson : 0.000008 sec (CPU) TOTAL : 0.612425 sec (CPU) TOTAL (123) : 0.572535 sec (CPU) TOTAL (23) : 0.544932 sec (CPU) TOTAL (1) : 0.027603 sec (CPU) TOTAL (2) : 0.099144 sec (CPU) TOTAL (3) : 0.445788 sec *********************************************************************** ----------------------------------------------------------------------- TotalTime[Rnd+Rmb+ME] (123)= ( 5.798480e-01 ) sec TotalTime[Rambo+ME] (23)= ( 5.516467e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.820135e-02 ) sec TotalTime[Rambo] (2)= ( 1.050893e-01 ) sec TotalTime[MatrixElems] (3)= ( 4.465573e-01 ) sec ----------------------------------------------------------------------- TotalEventsComputed = 1048576 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.808364e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.900811e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.348133e+06 ) sec^-1 -----------------------------------------------------------------------
This makes me realise the calculation is clearly wrong: one should add throughputs, not times (the wall time is the same on CPU and GPU!)
Decrease the GPU multiplier from 100 to 70 (itscrd03, with 4 OMP threads). **************************************************************************** (GPU) NumBlocksPerGrid = 16384 (GPU) NumThreadsPerBlock = 32 (GPU) NumIterations = 700 ---------------------------------------------------------------------------- (GPU) FP precision = DOUBLE (GPU) Complex type = THRUST::COMPLEX (GPU) RanNumb memory layout = AOSOA[4] (GPU) Momenta memory layout = AOSOA[4] (GPU) Random number generation = CURAND DEVICE (CUDA code) (GPU) Wavefunction GPU memory = LOCAL ---------------------------------------------------------------------------- (GPU) NumIterations = 700 (GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00 ) sec (GPU) TotalTime[Rambo+ME] (23)= ( 5.378664e+00 ) sec (GPU) TotalTime[RndNumGen] (1)= ( 4.571957e-01 ) sec (GPU) TotalTime[Rambo] (2)= ( 4.823339e+00 ) sec (GPU) TotalTime[MatrixElems] (3)= ( 5.553249e-01 ) sec (GPU) MeanTimeInMatrixElems = ( 7.933214e-04 ) sec (GPU) [Min,Max]TimeInMatrixElems = [ 7.126600e-04 , 1.121232e-02 ] sec ---------------------------------------------------------------------------- (GPU) TotalEventsComputed = 367001600 (nan=0) (GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07 ) sec^-1 (GPU) EvtsPerSec[Rmb+ME] (23)= ( 6.823286e+07 ) sec^-1 (GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08 ) sec^-1 **************************************************************************** (GPU) NumMatrixElements(notNan) = 367001600 (GPU) MeanMatrixElemValue = ( 1.371705e-02 +- 4.280686e-07 ) GeV^0 (GPU) [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374926e-02 ] GeV^0 (GPU) StdDevMatrixElemValue = ( 8.200632e-03 ) GeV^0 (GPU) MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) (GPU) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] (GPU) StdDevWeight = ( 0.000000e+00 ) **************************************************************************** (GPU) 00 CudaFree : 1.040999 sec (GPU) 0a ProcInit : 0.000255 sec (GPU) 0b MemAlloc : 0.043748 sec (GPU) 0c GenCreat : 0.035068 sec (GPU) 0d SGoodHel : 0.001759 sec (GPU) 1a GenSeed : 0.005864 sec (GPU) 1b GenRnGen : 0.451332 sec (GPU) 2a RamboIni : 0.010404 sec (GPU) 2b RamboFin : 0.008416 sec (GPU) 2c CpDTHwgt : 0.369433 sec (GPU) 2d CpDTHmom : 4.435089 sec (GPU) 3a SigmaKin : 0.010518 sec (GPU) 3b CpDTHmes : 0.544807 sec (GPU) 4a DumpLoop : 2.404587 sec (GPU) 8a CompStat : 2.632091 sec (GPU) 9a GenDestr : 0.000196 sec (GPU) 9b DumpScrn : 0.000071 sec (GPU) 9c DumpJson : 0.000007 sec (GPU) TOTAL : 11.994643 sec (GPU) TOTAL (123) : 5.835862 sec (GPU) TOTAL (23) : 5.378666 sec (GPU) TOTAL (1) : 0.457196 sec (GPU) TOTAL (2) : 4.823342 sec (GPU) TOTAL (3) : 0.555325 sec **************************************************************************** **************************************************************************** (CPU) NumBlocksPerGrid = 16384 (CPU) NumThreadsPerBlock = 32 (CPU) NumIterations = 10 ---------------------------------------------------------------------------- (CPU) FP precision = DOUBLE (CPU) Complex type = STD::COMPLEX (CPU) RanNumb memory layout = AOSOA[4] (CPU) Momenta memory layout = AOSOA[4] (CPU) Random number generation = CURAND (C++ code) (CPU) OMP threads / maxthreads = 4 / 4 ---------------------------------------------------------------------------- (CPU) NumIterations = 10 (CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00 ) sec (CPU) TotalTime[Rambo+ME] (23)= ( 4.964864e+00 ) sec (CPU) TotalTime[RndNumGen] (1)= ( 2.805332e-01 ) sec (CPU) TotalTime[Rambo] (2)= ( 1.005407e+00 ) sec (CPU) TotalTime[MatrixElems] (3)= ( 3.959456e+00 ) sec (CPU) MeanTimeInMatrixElems = ( 3.959456e-01 ) sec (CPU) [Min,Max]TimeInMatrixElems = [ 2.837697e-01 , 4.298637e-01 ] sec ---------------------------------------------------------------------------- (CPU) TotalEventsComputed = 5242880 (nan=0) (CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05 ) sec^-1 (CPU) EvtsPerSec[Rmb+ME] (23)= ( 1.055997e+06 ) sec^-1 (CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06 ) sec^-1 **************************************************************************** (CPU) NumMatrixElements(notNan) = 5242880 (CPU) MeanMatrixElemValue = ( 1.372304e-02 +- 3.581814e-06 ) GeV^0 (CPU) [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 (CPU) StdDevMatrixElemValue = ( 8.201400e-03 ) GeV^0 (CPU) MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) (CPU) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] (CPU) StdDevWeight = ( 0.000000e+00 ) **************************************************************************** (CPU) 0a ProcInit : 0.015581 sec (CPU) 0b MemAlloc : 0.028460 sec (CPU) 0c GenCreat : 0.000978 sec (CPU) 1a GenSeed : 0.000084 sec (CPU) 1b GenRnGen : 0.280449 sec (CPU) 2a RamboIni : 0.072770 sec (CPU) 2b RamboFin : 0.932637 sec (CPU) 3a SigmaKin : 3.959456 sec (CPU) 4a DumpLoop : 0.034727 sec (CPU) 8a CompStat : 0.038073 sec (CPU) 9a GenDestr : 0.000141 sec (CPU) 9b DumpScrn : 0.061204 sec (CPU) 9c DumpJson : 0.000011 sec (CPU) TOTAL : 5.424570 sec (CPU) TOTAL (123) : 5.245397 sec (CPU) TOTAL (23) : 4.964864 sec (CPU) TOTAL (1) : 0.280533 sec (CPU) TOTAL (2) : 1.005407 sec (CPU) TOTAL (3) : 3.959456 sec **************************************************************************** ---------------------------------------------------------------------------- (GPU) TotalEventsComputed = 367001600 (GPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.835859e+00 ) sec (GPU) TotalTime[Rambo+ME] (23)= ( 5.378664e+00 ) sec (GPU) TotalTime[RndNumGen] (1)= ( 4.571957e-01 ) sec (GPU) TotalTime[Rambo] (2)= ( 4.823339e+00 ) sec (GPU) TotalTime[MatrixElems] (3)= ( 5.553249e-01 ) sec ---------------------------------------------------------------------------- (GPU) TotalEventsComputed = 367001600 (GPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.288733e+07 ) sec^-1 (GPU) EvtsPerSec[Rmb+ME] (23)= ( 6.823286e+07 ) sec^-1 (GPU) EvtsPerSec[MatrixElems] (3)= ( 6.608772e+08 ) sec^-1 **************************************************************************** ---------------------------------------------------------------------------- (CPU) TotalEventsComputed = 5242880 (CPU) TotalTime[Rnd+Rmb+ME] (123)= ( 5.245397e+00 ) sec (CPU) TotalTime[Rambo+ME] (23)= ( 4.964864e+00 ) sec (CPU) TotalTime[RndNumGen] (1)= ( 2.805332e-01 ) sec (CPU) TotalTime[Rambo] (2)= ( 1.005407e+00 ) sec (CPU) TotalTime[MatrixElems] (3)= ( 3.959456e+00 ) sec ---------------------------------------------------------------------------- (CPU) TotalEventsComputed = 5242880 (CPU) EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.995201e+05 ) sec^-1 (CPU) EvtsPerSec[Rmb+ME] (23)= ( 1.055997e+06 ) sec^-1 (CPU) EvtsPerSec[MatrixElems] (3)= ( 1.324141e+06 ) sec^-1 **************************************************************************** (HET) TotalEventsComputed = 372244480 (HET) EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.388685e+07 ) sec^-1 (HET) EvtsPerSec[Rmb+ME] (23)= ( 6.928886e+07 ) sec^-1 (HET) EvtsPerSec[MatrixElems] (3)= ( 6.622013e+08 ) sec^-1 ****************************************************************************
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/Makefile
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
Re-implement the SEP79 solutionin hetklas as in klas3
[hetklas] fix conflicts and enhance throughput12.sh for hetklas
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/throughput12.sh
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/timermap.h
The printout is wrong...
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.894957e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.570639 sec
----- 9,041,553,537 cycles # 2.527 GHz
----- 16,426,678,315 instructions # 1.82 insn per cycle
----- 3.580557991 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
Still three issues: one error, no combined het throughput, and missing 3 in het (only 3a)
=========================================================================
(GPU) TOTAL : 2.721498 sec
(CPU) TOTAL : 3.118861 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MatrixElems] (3) = ( 7.883282e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.608714e+09 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.798953e+08 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 8.432842e+06 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 8.432842e+06 ) sec^-1
----- 28,771,618,960 instructions # 1.53 insn per cycle
----- 18,804,899,462 cycles # 2.585 GHz
----- 3.382948298 seconds time elapsed
=========================================================================
…script
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.901010e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.572420 sec
----- 9,052,563,886 cycles # 2.529 GHz
----- 16,426,685,901 instructions # 1.81 insn per cycle
----- 3.582549083 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.856356e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.619354 sec
----- 9,288,447,580 cycles # 2.523 GHz
----- 16,522,745,522 instructions # 1.78 insn per cycle
----- 2.629579127 seconds time elapsed
=========================================================================
…put - all ok now
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.279840e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.376418e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 0.981506 sec
----- 2,909,979,674 cycles # 2.647 GHz
----- 4,053,849,598 instructions # 1.39 insn per cycle
----- 1.274880943 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.308184e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.152095 sec
----- 19,148,855,993 cycles # 2.675 GHz
----- 48,541,498,678 instructions # 2.53 insn per cycle
----- 7.162556414 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 5.090300e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.599609 sec
----- 19,594,680,071 cycles # 2.672 GHz
----- 48,686,184,693 instructions # 2.48 insn per cycle
----- 3.609641369 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.902057e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.573937 sec
----- 9,045,687,703 cycles # 2.526 GHz
----- 16,426,907,941 instructions # 1.82 insn per cycle
----- 3.584309517 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.856200e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.618127 sec
----- 9,274,911,378 cycles # 2.521 GHz
----- 16,520,288,091 instructions # 1.78 insn per cycle
----- 2.628126061 seconds time elapsed
=========================================================================
(GPU) TOTAL : 2.873521 sec
(CPU) TOTAL : 3.185181 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.625534e+09 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 7.636379e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.617340e+09 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.554441e+08 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 8.193795e+06 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 8.193795e+06 ) sec^-1
----- 28,916,334,658 instructions # 1.52 insn per cycle
----- 19,055,815,350 cycles # 2.564 GHz
----- 3.441221844 seconds time elapsed
=========================================================================
Fix conflicts: SubProcesses/Makefile, Subprocesses/P1_Sigma_sm_epem_mupmum/throughput12.sh
…ging [fpsingle]
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 6.457473e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.359578e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 0.761059 sec
----- 2,675,916,348 cycles # 2.649 GHz
----- 3,591,247,699 instructions # 1.34 insn per cycle
----- 1.069993088 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.307903e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.156365 sec
----- 19,161,936,422 cycles # 2.674 GHz
----- 48,541,618,465 instructions # 2.53 insn per cycle
----- 7.168349564 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 5.094698e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.573914 sec
----- 19,516,575,155 cycles # 2.672 GHz
----- 48,677,989,536 instructions # 2.49 insn per cycle
----- 3.584290977 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.860371e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.585583 sec
----- 9,077,576,394 cycles # 2.526 GHz
----- 16,426,539,381 instructions # 1.81 insn per cycle
----- 3.596330394 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.854829e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.629101 sec
----- 9,314,239,522 cycles # 2.524 GHz
----- 16,523,257,965 instructions # 1.77 insn per cycle
----- 2.640251813 seconds time elapsed
=========================================================================
(GPU) TOTAL : 3.686469 sec
(CPU) TOTAL : 3.228275 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.639484e+09 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 6.908414e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.631750e+09 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 6.831078e+08 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 7.733656e+06 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 7.733656e+06 ) sec^-1
----- 30,229,999,093 instructions # 1.46 insn per cycle
----- 20,707,802,527 cycles # 2.479 GHz
----- 4.051993237 seconds time elapsed
=========================================================================
=========================================================================
(GPU) TOTAL : 3.311079 sec
(CPU) TOTAL : 3.204850 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.639362e+09 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.631746e+09 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 7.616442e+06 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 6.722884e+08 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 6.646720e+08 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 7.616442e+06 ) sec^-1
----- 3.685237043 seconds time elapsed
----- 30,318,258,978 instructions # 1.48 insn per cycle
----- 20,477,280,922 cycles # 2.569 GHz
=========================================================================
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221 (gcc 9.2.0)]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.249753e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.368568e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 0.723287 sec
----- 2,548,040,683 cycles # 2.654 GHz
----- 3,487,699,528 instructions # 1.37 insn per cycle
----- 1.020851355 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.310905e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.163021 sec
----- 19,171,190,104 cycles # 2.674 GHz
----- 48,541,549,607 instructions # 2.53 insn per cycle
----- 7.173169402 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 5.096624e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.570581 sec
----- 19,503,007,815 cycles # 2.672 GHz
----- 48,679,307,926 instructions # 2.50 insn per cycle
----- 3.580597835 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.928517e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.550336 sec
----- 9,000,866,949 cycles # 2.530 GHz
----- 16,427,363,324 instructions # 1.83 insn per cycle
----- 3.560183214 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.859750e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.615760 sec
----- 9,265,829,089 cycles # 2.522 GHz
----- 16,521,299,678 instructions # 1.78 insn per cycle
----- 2.625727640 seconds time elapsed
=========================================================================
(GPU) TOTAL : 3.056203 sec
(CPU) TOTAL : 3.207751 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221 (gcc 9.2.0)]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.634803e+09 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.627118e+09 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 7.684278e+06 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 8.016014e+08 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.939171e+08 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 7.684278e+06 ) sec^-1
----- 3.463894706 seconds time elapsed
----- 29,444,932,152 instructions # 1.49 insn per cycle
----- 19,727,119,505 cycles # 2.559 GHz
=========================================================================
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/Makefile
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc, throughput12.sh
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.018859e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.365012e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 0.973702 sec
----- 3,064,269,630 cycles # 2.627 GHz
----- 4,383,344,946 instructions # 1.43 insn per cycle
----- 1.279814223 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.317277e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.194522 sec
----- 19,294,801,502 cycles # 2.674 GHz
----- 48,715,852,452 instructions # 2.52 insn per cycle
----- 7.219185289 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 5.139811e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.640767 sec
----- 19,638,855,299 cycles # 2.671 GHz
----- 48,838,525,937 instructions # 2.49 insn per cycle
----- 3.665748421 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.913066e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.633715 sec
----- 9,256,697,481 cycles # 2.533 GHz
----- 16,602,346,820 instructions # 1.79 insn per cycle
----- 3.658132763 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.875032e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.694719 sec
----- 9,506,185,230 cycles # 2.529 GHz
----- 16,697,940,025 instructions # 1.76 insn per cycle
----- 2.719416431 seconds time elapsed
=========================================================================
(GPU) TOTAL : 3.718765 sec
(CPU) TOTAL : 3.258146 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.498103e+09 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.489995e+09 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 8.107556e+06 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 7.484821e+08 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.403745e+08 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 8.107556e+06 ) sec^-1
----- 4.104060784 seconds time elapsed
----- 29,871,816,625 instructions # 1.46 insn per cycle
----- 20,504,834,952 cycles # 2.429 GHz
=========================================================================
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/check.cc, throughput12.sh
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)] [inlineHel=0]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 6.778464e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.364384e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 1.154516 sec
----- 3,125,724,102 cycles # 2.615 GHz
----- 4,464,428,917 instructions # 1.43 insn per cycle
----- 1.461968802 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.316325e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.226542 sec
----- 19,379,644,866 cycles # 2.673 GHz
----- 48,715,983,967 instructions # 2.51 insn per cycle
----- 7.252093290 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 5.148972e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.690117 sec
----- 19,747,504,200 cycles # 2.670 GHz
----- 48,842,617,925 instructions # 2.47 insn per cycle
----- 3.716313535 seconds time elapsed
-------------------------------------------------------------------------
Fix conflicts in epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/throughput12.sh
…/without inlinining)
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.396100e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.365518e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 1.008505 sec
----- 965,476,144 cycles:u # 0.825 GHz
----- 1,911,254,017 instructions:u # 1.98 insn per cycle
----- 1.300031388 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.301439e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.179015 sec
----- 19,035,950,018 cycles:u # 2.650 GHz
----- 48,630,925,016 instructions:u # 2.55 insn per cycle
----- 7.187385177 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 638) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.714571e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.675687 sec
----- 19,533,551,567 cycles:u # 2.644 GHz
----- 48,834,661,471 instructions:u # 2.50 insn per cycle
----- 3.684041956 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.857095e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.590117 sec
----- 8,937,713,218 cycles:u # 2.486 GHz
----- 16,377,460,972 instructions:u # 1.83 insn per cycle
----- 3.598792181 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2690) (512y: 51) (512z: 0)
-------------------------------------------------------------------------
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.396100e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.365518e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 1.008505 sec
----- 965,476,144 cycles:u # 0.825 GHz
----- 1,911,254,017 instructions:u # 1.98 insn per cycle
----- 1.300031388 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.301439e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.179015 sec
----- 19,035,950,018 cycles:u # 2.650 GHz
----- 48,630,925,016 instructions:u # 2.55 insn per cycle
----- 7.187385177 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 638) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.714571e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.675687 sec
----- 19,533,551,567 cycles:u # 2.644 GHz
----- 48,834,661,471 instructions:u # 2.50 insn per cycle
----- 3.684041956 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.857095e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.590117 sec
----- 8,937,713,218 cycles:u # 2.486 GHz
----- 16,377,460,972 instructions:u # 1.83 insn per cycle
----- 3.598792181 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2690) (512y: 51) (512z: 0)
-------------------------------------------------------------------------
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 7.396100e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.365518e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 1.008505 sec
----- 965,476,144 cycles:u # 0.825 GHz
----- 1,911,254,017 instructions:u # 1.98 insn per cycle
----- 1.300031388 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.301439e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.179015 sec
----- 19,035,950,018 cycles:u # 2.650 GHz
----- 48,630,925,016 instructions:u # 2.55 insn per cycle
----- 7.187385177 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 638) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.714571e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.675687 sec
----- 19,533,551,567 cycles:u # 2.644 GHz
----- 48,834,661,471 instructions:u # 2.50 insn per cycle
----- 3.684041956 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.857095e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.590117 sec
----- 8,937,713,218 cycles:u # 2.486 GHz
----- 16,377,460,972 instructions:u # 1.83 insn per cycle
----- 3.598792181 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2690) (512y: 51) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.449487e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.724771 sec
----- 9,392,150,099 cycles:u # 2.483 GHz
----- 16,574,584,936 instructions:u # 1.76 insn per cycle
----- 2.732934882 seconds time elapsed
=========================================================================
(GPU) TOTAL : 2.496799 sec
(CPU) TOTAL : 3.082594 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.619359e+09 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.610353e+09 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 9.006560e+06 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 8.145232e+08 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 8.055166e+08 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 9.006560e+06 ) sec^-1
----- 3.340816454 seconds time elapsed
----- 23,689,488,998 instructions:u # 1.74 insn per cycle
----- 13,628,749,532 cycles:u # 1.957 GHz
=========================================================================
…rt - also clean up the code cosmetics
Works out of the box without conflicts and without the need for further changes
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
(GPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(GPU) EvtsPerSec[MatrixElems] (3) = ( 6.610417e+08 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.360324e+09 ) sec^-1
(GPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(GPU) TOTAL : 0.926625 sec
----- 667,682,753 cycles:u # 0.610 GHz
----- 1,269,068,165 instructions:u # 1.90 insn per cycle
----- 1.228634507 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.292733e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 7.196342 sec
----- 19,079,313,321 cycles:u # 2.650 GHz
----- 48,630,931,774 instructions:u # 2.55 insn per cycle
----- 7.205396444 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.792919e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.703715 sec
----- 19,961,984,598 cycles:u # 2.645 GHz
----- 49,005,630,913 instructions:u # 2.45 insn per cycle
----- 3.712541751 seconds time elapsed
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 1 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 4.922604e+06 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 3.572791 sec
----- 8,898,911,817 cycles:u # 2.487 GHz
----- 16,362,519,365 instructions:u # 1.84 insn per cycle
----- 3.581352506 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2696) (512y: 52) (512z: 0)
-------------------------------------------------------------------------
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) FP precision = DOUBLE (NaN/abnormal=0, zero=0)
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(CPU) OMP threads / `nproc --all` = 4 / 4
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.477452e+07 ) sec^-1
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) TOTAL : 2.723580 sec
----- 9,382,125,796 cycles:u # 2.482 GHz
----- 16,562,376,736 instructions:u # 1.77 insn per cycle
----- 2.732325154 seconds time elapsed
=========================================================================
(GPU) TOTAL : 3.361818 sec
(CPU) TOTAL : 3.203409 sec
(GPU) Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
(CPU) Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
(CPU) OMP threads / `nproc --all` = 4 / 4
(GPU) MeanMatrixElemValue = ( 1.371821e-02 +- 9.438398e-07 ) GeV^0
(CPU) MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
(CPU) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
(HET) EvtsPerSec[MECalcOnly] (3a) = ( 1.645827e+09 ) sec^-1
(GPU) EvtsPerSec[MECalcOnly] (3a) = ( 1.637887e+09 ) sec^-1
(CPU) EvtsPerSec[MECalcOnly] (3a) = ( 7.939933e+06 ) sec^-1
(HET) EvtsPerSec[MatrixElems] (3) = ( 6.480231e+08 ) sec^-1
(GPU) EvtsPerSec[MatrixElems] (3) = ( 6.400832e+08 ) sec^-1
(CPU) EvtsPerSec[MatrixElems] (3) = ( 7.939933e+06 ) sec^-1
----- 3.710137015 seconds time elapsed
----- 24,631,025,349 instructions:u # 1.71 insn per cycle
----- 14,415,712,621 cycles:u # 1.806 GHz
=========================================================================
Member
Author
|
This is a very old MR about hetereogeneous processing for the standalone application, with some MEs on the CPU and some MEs on the GPU. It is so old that it would need significant reshuffling to solve conflicts, probably. Now that we are focusing on madevent, and that we know the CPU bottleneck is the madevent non-me fortran component, heterogeneous strategies are better focused on targeting CPU MT for that fortran part. Cloising this as unmerged, as suggested by @roiser |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I create this new DRAFT PR about Heterogeneous GPU+CPU, issue #85.
This is essentially about the (very few!) changes initially added in PR #87 (het), which were later merged with the epoch12 work (PR #151) into a new PR #153 (hetep12).
This new PR replaces both #87 and #153, which will be closed. It includes not ony the het and hetep12 (het + ep12) changes, but also the vectorization work (presently in klasep12 PR #152). It is based on a new branch hetklas, which merges hetep12 into klasep12.
I will eventually move here a few comments from the old het/hetep12 PRs.