Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds#874
Conversation
…f OMPFLAGS is set externally to an empty string (madgraph5#758)
…setting OMPFLAGS= as implemented in madgraph5#758)
… of known issues: the CI will fail
…externally to an empty string (madgraph5#758)
…ff OpenMP, to debug madgraph5#845 As previously observed, this crashes immediately (NB: it only crashes with AVX512 in '512z' mode!) gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \ -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit Program received signal SIGFPE, Arithmetic exception. 0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197 1197 if( okcol ) +where 0 0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197 1 0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115 2 0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cc70, selhel=0x7fffffc2cc70, mes=0x7fffffc3cc70, channelId=1, rndcol=0x7fffffc9cfd0, rndhel=0x7fffffcbcfd0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173 3 fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbcfd0, rndcol=0x7fffffc9cfd0, pchannelId=<optimized out>, mes=0x7fffffc3cc70, selhel=0x7fffffc2cc70, selcol=0x7fffffc1cc70) at fbridge.cc:106 4 0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618 5 0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445 6 0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034 7 0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327 8 0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208 9 0x000000000042ebe0 in driver () at driver.f:257 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6 13 0x0000000000403845 in _start () +l 1192 #if defined MGONGPU_CPPSIMD 1193 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] ); 1194 #else 1195 const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] ); 1196 #endif 1197 if( okcol ) 1198 { 1199 allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1] 1200 break; 1201 } +p okcol $1 = <optimized out>
|
Note: even after disabling OMP in the CI tests, issue #845 has not shown up. The only 6 usual errors are from #826 and #872. I think that I understand this better now: this issue ONLY happens for AVX512 (and specifically 512z with FPTYPE=f). Our CI on standard github nodes does not offer AVX512 (and it would be an overkill to set up our own nodes just for that). So I will debug this in manual tests. |
…adgraph5#845) by adding 'volatile', again This no longer crashes gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \ -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit
… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again
…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)
…g of known issues: the CI will succeed Revert "[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing of known issues: the CI will fail" This reverts commit a5b7c42.
|
Hi @oliviermattelaer yet another one (we are getting there...). I have fixed this intermittent crash #845 again by using 'volatile'. This time it should be less controversial I hope. The crash happens deep in SIMD code and only in specific configurations like 512z and fptype=f. So I am quite confident that disabling optimizations with volatile makes sense. Note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993 En passant I added the option (that you had asked for in #758) to disable OpenMP. You just need to export OMPFLAGS= (set an empty string: this is different from unsetting it). Can you please review? Thanks! |
STARTED AT Sat Jun 29 01:07:28 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Jun 29 02:40:30 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sat Jun 29 02:55:58 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sat Jun 29 03:03:51 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sat Jun 29 03:06:32 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sat Jun 29 03:09:12 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Sat Jun 29 03:11:56 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Sat Jun 29 03:23:34 PM CEST 2024 [Status=0]
…ected (failures in heft madgraph5#833 and susy madgraph5#826 - but intermittent gqttq madgraph5#845 is fixed) Note two points: - gqttq madgraph5#845 was intermittent, so the fact that it has disappeared could be casual: but I actually think it is fixed - the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet STARTED AT Sat Jun 29 03:23:34 PM CEST 2024 (SM tests) ENDED(1) AT Sat Jun 29 07:44:46 PM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Sat Jun 29 07:54:26 PM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
|
I have rerun the manual tests. This is now ready to be merged (but it would be best to first agree on and merge mg5amcnlo/mg5amcnlo#115 and #873) |
… merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…er merging git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)
…adgraph5#852 and color2 PR madgraph5#877, instead of color PR madgraph5#873) into fpe
…easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
… and madgraph5#881 fixes for madgraph5#856) into fpe
… to be sure - no change in code (two matrix1.pdf changed but content looks the same?)
|
Hi @oliviermattelaer I have updated this again with upstream/master, it is ready to be merged
Can you please review? (when you have time, not urgent) Thanks |
|
Perfect to be merged (but I would like to keep #758 open, since I think that we should have an agreement on how we handle OpenMP by default from the madgraph5. (i.e. should I set OMPFLAGS= by default or not, should we have an entry in the run_card to control that /...) -> will copy-paste this message in #758 (since this is the proper place to discuss that). Thanks a lot, Olivier |
|
Very good, thanks Olivier! I will merge this then, and reopen #758 Note, there are 6 errors in the CI but they are the usual failing ones |
…from fpe PR madgraph5#874 and omp PR madgraph5#900 for OpenMP issue madgraph5#758 - bldall and runTest succeds

This is a WIP PR to allow no-OpenMP builds (#758) and to debug intermittent FPEs (#845).
This is starting from MR #873 which fixes color mappings and related issues.
The reason why OMP and FPEs are related is that I had some indication that the FPE crash is intermittent with OMP, but happens all the time without OMP. So disabling OMP might trigger the issue more often.
See f1e0d42