Skip to content

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds#874

Merged
valassi merged 22 commits into
madgraph5:masterfrom
valassi:fpe
Jul 4, 2024
Merged

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds#874
valassi merged 22 commits into
madgraph5:masterfrom
valassi:fpe

Conversation

@valassi
Copy link
Copy Markdown
Member

@valassi valassi commented Jun 28, 2024

This is a WIP PR to allow no-OpenMP builds (#758) and to debug intermittent FPEs (#845).

This is starting from MR #873 which fixes color mappings and related issues.

The reason why OMP and FPEs are related is that I had some indication that the FPE crash is intermittent with OMP, but happens all the time without OMP. So disabling OMP might trigger the issue more often.
See f1e0d42

@valassi valassi requested a review from a team as a code owner June 28, 2024 17:47
@valassi valassi marked this pull request as draft June 28, 2024 17:47
valassi added 3 commits June 28, 2024 20:37
…ff OpenMP, to debug madgraph5#845

As previously observed, this crashes immediately (NB: it only crashes with AVX512 in '512z' mode!)

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
1197                if( okcol )
+where
 0  0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
 1  0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115
 2  0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cc70, selhel=0x7fffffc2cc70, mes=0x7fffffc3cc70, channelId=1, rndcol=0x7fffffc9cfd0, rndhel=0x7fffffcbcfd0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
 3  fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbcfd0, rndcol=0x7fffffc9cfd0, pchannelId=<optimized out>, mes=0x7fffffc3cc70, selhel=0x7fffffc2cc70, selcol=0x7fffffc1cc70) at fbridge.cc:106
 4  0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618
 5  0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
 6  0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
 7  0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327
 8  0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208
 9  0x000000000042ebe0 in driver () at driver.f:257
 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
 13 0x0000000000403845 in _start ()
+l
1192    #if defined MGONGPU_CPPSIMD
1193                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1194    #else
1195                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1196    #endif
1197                if( okcol )
1198                {
1199                  allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
1200                  break;
1201                }
+p okcol
$1 = <optimized out>
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jun 28, 2024

Note: even after disabling OMP in the CI tests, issue #845 has not shown up. The only 6 usual errors are from #826 and #872.

I think that I understand this better now: this issue ONLY happens for AVX512 (and specifically 512z with FPTYPE=f). Our CI on standard github nodes does not offer AVX512 (and it would be an overkill to set up our own nodes just for that). So I will debug this in manual tests.

valassi added 5 commits June 28, 2024 21:47
…adgraph5#845) by adding 'volatile', again

This no longer crashes

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit
… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again
…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)
…g of known issues: the CI will succeed

Revert "[fpe] in .github/workflows/testsuite_oneprocess.sh, disable bypassing of known issues: the CI will fail"
This reverts commit a5b7c42.
@valassi valassi changed the title WIP: allow no-OpenMP builds and debug intermittent FPEs Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds and Jun 28, 2024
@valassi valassi marked this pull request as ready for review June 28, 2024 19:53
@valassi valassi requested a review from oliviermattelaer June 28, 2024 19:53
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jun 28, 2024

Hi @oliviermattelaer yet another one (we are getting there...).

I have fixed this intermittent crash #845 again by using 'volatile'. This time it should be less controversial I hope. The crash happens deep in SIMD code and only in specific configurations like 512z and fptype=f. So I am quite confident that disabling optimizations with volatile makes sense. Note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

En passant I added the option (that you had asked for in #758) to disable OpenMP. You just need to export OMPFLAGS= (set an empty string: this is different from unsetting it).

Can you please review?
NB This is based on #873, so I would FIRST review and merge that, and THEN review this (so it becomes easier to see only the additional differences).

Thanks!
Andrea

valassi added 4 commits June 29, 2024 13:00
STARTED  AT Sat Jun 29 01:07:28 PM CEST 2024
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sat Jun 29 02:40:30 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Sat Jun 29 02:55:58 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Sat Jun 29 03:03:51 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Sat Jun 29 03:06:32 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Sat Jun 29 03:09:12 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common
ENDED(6) AT Sat Jun 29 03:11:56 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Sat Jun 29 03:23:34 PM CEST 2024 [Status=0]
…ected (failures in heft madgraph5#833 and susy madgraph5#826 - but intermittent gqttq madgraph5#845 is fixed)

Note two points:
- gqttq madgraph5#845 was intermittent, so the fact that it has disappeared could be casual: but I actually think it is fixed
- the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet

STARTED  AT Sat Jun 29 03:23:34 PM CEST 2024
(SM tests)
ENDED(1) AT Sat Jun 29 07:44:46 PM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Sat Jun 29 07:54:26 PM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jul 1, 2024

I have rerun the manual tests. This is now ready to be merged (but it would be best to first agree on and merge mg5amcnlo/mg5amcnlo#115 and #873)

@valassi valassi marked this pull request as draft July 1, 2024 14:16
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jul 1, 2024

Putting this back to draft for the moment until we agree on #873 and #877

valassi added 4 commits July 3, 2024 15:25
… merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…er merging

git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)
@valassi valassi changed the title Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds and Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds Jul 3, 2024
@valassi valassi marked this pull request as ready for review July 3, 2024 13:35
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jul 3, 2024

I have merged the latest upstream/master (including PR #852 and PR #877) into this and marked it again as ready for review.

valassi added 3 commits July 4, 2024 15:37
…easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
… to be sure - no change in code (two matrix1.pdf changed but content looks the same?)
@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jul 4, 2024

Hi @oliviermattelaer I have updated this again with upstream/master, it is ready to be merged

Can you please review? (when you have time, not urgent)

Thanks
Andrea

@oliviermattelaer
Copy link
Copy Markdown
Member

Perfect to be merged (but I would like to keep #758 open, since I think that we should have an agreement on how we handle OpenMP by default from the madgraph5. (i.e. should I set OMPFLAGS= by default or not, should we have an entry in the run_card to control that /...) -> will copy-paste this message in #758 (since this is the proper place to discuss that).

Thanks a lot,

Olivier

@valassi
Copy link
Copy Markdown
Member Author

valassi commented Jul 4, 2024

Very good, thanks Olivier!

I will merge this then, and reopen #758

Note, there are 6 errors in the CI but they are the usual failing ones
image

@valassi valassi merged commit c5805fe into madgraph5:master Jul 4, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024
…from fpe PR madgraph5#874 and omp PR madgraph5#900 for OpenMP issue madgraph5#758 - bldall and runTest succeds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants