Fix runTest segfault (remove cudaDeviceReset) and simplify googletest template usage#909
Merged
Conversation
…n both CPU and GPU (prepare for madgraph5#896) - the C++ tests succeed but the CUDA tests segfaults madgraph5#903
…from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert
…ase-1.11.0 Revert "[gtest/june24] in CODEGEN cudacpp_test.mk, try to upgrade googletest from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert" This reverts commit 34cd623.
…cc build in CUDA while debugging madgraph5#903 With testmisc.cc, valgrind gives a confusing error ==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000 ==2887713== ==2887713== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==2887713== Access not within mapped region at address 0x1FFE801FF8 ==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000 ==2887713== at 0x449C06: mg5amcGpu::constexpr_sin_quad(long double, bool) (constexpr_math.h:156) ==2887713== If you believe this happened as a result of a stack ==2887713== overflow in your program's main thread (unlikely but ==2887713== possible), you can try to increase the size of the ==2887713== main thread stack using the --main-stacksize= flag. ==2887713== The main thread stack size used in this run was 8388608. ==2887713== ==2887713== HEAP SUMMARY: ==2887713== in use at exit: 21,309,363 bytes in 13,995 blocks ==2887713== total heap usage: 18,083 allocs, 4,088 frees, 51,971,780 bytes allocated ==2887713== ==2887713== LEAK SUMMARY: ==2887713== definitely lost: 0 bytes in 0 blocks ==2887713== indirectly lost: 0 bytes in 0 blocks ==2887713== possibly lost: 2,599,608 bytes in 825 blocks ==2887713== still reachable: 18,709,755 bytes in 13,170 blocks ==2887713== suppressed: 0 bytes in 0 blocks ==2887713== Rerun with --leak-check=full to see details of leaked memory ==2887713== ==2887713== For lists of detected and suppressed errors, rerun with: -s ==2887713== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Segmentation fault (core dumped) Without testmisc.cc instead [ RUN ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt ==2889432== Invalid write of size 8 ==2889432== at 0x484E2DB: memmove (vg_replace_strmem.c:1385) ==2889432== by 0x41A6EA: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431) ==2889432== by 0x41A49B: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494) ==2889432== by 0x41A1A5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522) ==2889432== by 0x419F4D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529) ==2889432== by 0x419D0C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619) ==2889432== by 0x419950: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34) ==2889432== by 0x44443D: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202) ==2889432== by 0x440D98: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253) ==2889432== by 0x48790F: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607) ==2889432== by 0x480EF8: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643) ==2889432== by 0x459587: testing::Test::Run() (gtest.cc:2682) ==2889432== Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd ==2889432== ==2889432== ==2889432== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==2889432== Access not within mapped region at address 0x2FC0F200 ==2889432== at 0x484E2DB: memmove (vg_replace_strmem.c:1385) ... Segmentation fault (core dumped)
…cc build while debugging madgraph5#903 also for C++ The test does not segfault without valgrind, but it does segfault in valgrind! (NB this all realted to debug builds, in C++ and in CUDA) And with testmisc.cc, valgrind gives a confusing error for C++ (cppnone here) as in CUDA: ==2893804== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==2893804== Access not within mapped region at address 0x1FFE801FF8 ==2893804== Stack overflow in thread #1: can't grow stack to 0x1ffe801000 ==2893804== at 0x431835: mg5amcCpu::constexpr_sin_quad(long double, bool) (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/runTest_cpp.exe) So I disable testmisc but now the C++ test (cppnone here) no longer segfaults...?!
…pp.exe by adding -no-pie madgraph5#904
…ng OMP only for clang16 madgraph5#904
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
…pp.exe by adding -no-pie madgraph5#904
…ng OMP only for clang16 madgraph5#904
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
Revert "[gtest/june24] in gg_tt.mad cudacpp.mk, TEMPORARELY disable testmisc.cc build while debugging madgraph5#903 also for C++" This reverts commit 944caab. Will now test with clang16 (after recent fixes) and valgrind (after upgrading to 3.23)
…ster for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ng OMP only for clang17 madgraph5#904
…7 builds madgraph5#904 (disable OMP also for clang17)
…ng OMP only for clang17 madgraph5#904
…7 builds madgraph5#904 (disable OMP also for clang17)
…ster for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…adgraph5#904, adding -fPIC to fortran compilation
…ODEGEN logs from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…g constexpr_sin: now valgrind on c++ runTest succeds again?! However cuda still fails (even without valgrind) madgraph5#903
… now valgrind runTest_cpp.exe will fail Revert "[gtest/june24] in gg_tt.mad testmisc.cc, comment out the section using constexpr_sin: now valgrind on c++ runTest succeds again?!" This reverts commit 975f7aacb8661807a329ec1f51b2d7d8dba45167.
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…et() at the end, but an abort reappears INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (194 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (194 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (174 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (174 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (384 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed. Aborted (core dumped)
…st.cc to the main in testxxx.cc, but an abort reappears INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (180 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (180 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (395 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed. Aborted (core dumped)
… to the atexit function, but this STILL crashes! madgraph5#907 WILL THEREFORE COMMENT OUT THIS CALL... INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (198 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (198 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (179 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (179 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (393 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported ERROR! assertGpu: 'invalid argument' (1) in MemoryBuffers.h:155 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed. Aborted (core dumped)
… to avoid all crashes madgraph5#907 (FIXME? avoid cuda api calls in dtors?) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX [ RUN ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (1 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (1 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC [ RUN ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (14 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (14 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH1.compareMomAndME (199 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH1 (199 ms total) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 [ RUN ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt [ OK ] SIGMA_SM_GG_TTX_GPU_MADGRAPH2.compareMomAndME (181 ms) [----------] 1 test from SIGMA_SM_GG_TTX_GPU_MADGRAPH2 (181 ms total) [----------] Global test environment tear-down [==========] 4 tests from 4 test suites ran. (396 ms total) [ PASSED ] 4 tests. INFO: No Floating Point Exceptions have been reported INFO: No Floating Point Exceptions have been reported
…ting::Test argument to the compareME function, to allow the use f HasFailure This essentially COMPLETES the fixes for madgraph5#907 and preparatory work for madgraph5#896
…pare to comment out test2 (preparatory work for madgraph5#896) All tests succeed on cuda and all simd
…ry work for madgraph5#896) All tests succeed on cuda and all simd - will backport to CODEGEN now
…st.cc, testxxx.cc: simplify gtest templates, remove cudaDeviceReset to fix madgraph5#907, complete preparation of two-test infrastructure madgraph5#896 More in detail: - move to the simplest "TEST(" use case of Google tests in MadgraphTest.h and runTest.cc (remove unnecessary levels of templating) - move gpuDeviceReset() to an atexit function of main in testxxx and comment it out anyway, to fix the segfaults madgraph5#907 (eventually it may be necessary to remove all CUDA API calls from destructors, if ever we need to put this back in) - in runTest.cc, complete a proff of concept for adding two separate tests (without/with multichannel madgraph5#896) Fix some clang formatting issues with respect to the last gg_tt.mad
… the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
oliviermattelaer
approved these changes
Jul 16, 2024
…h5#900 and submod madgraph5#897) into clang
…r if OpenMP builds are attempted on clang16/17 (as discussed with Olivier in madgraph5#905)
…s from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster with OMP madgraph5#900 and submod madgraph5#897) into gtest Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt Note: MG5AMC has been updated including mg5amcnlo#107
…s from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster with clang madgraph5#905, OMP madgraph5#900 and submod madgraph5#897) into gtest Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt Note: MG5AMC has been updated including mg5amcnlo#107
…ogs from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
Member
Author
Thanks Olivier :-) I again updated this and regenerated as a check. Will run the CI then merge. Andrea |
Member
Author
|
The CI completed with Merging now |
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Jul 17, 2024
…ion, removing the attempts to add two tests madgraph5#896 My last commit was showing the segfault issue madgraph5#907 solved in upcoming PR madgraph5#909 (and bits of madgraph5#908). I will cherry pick the CODEGEN from madgraph5#909 (and madgraph5#908) first and try again. git checkout 3eb4c29 gg_tt.mad/SubProcesses/runTest.cc
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Jul 17, 2024
…ng PR madgraph5#905, constexpr_math.h PR madgraph5#908 and runTest/cudaDeviceReset PR madgraph5#909 Add valgrind.h and its symlink in the repo for gg_tt.mad The new runTest.cc template now has a (commented out) proof of concept for including two tests (with/without multichannel) madgraph5#896, I will resume from there After building bldall, the following succeeds for bck in none sse4 avx2 512y 512z cuda; do echo $bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done This instead is crashing (again?) for some AVX values for bck in none sse4 avx2 512y 512z cuda; do echo $bck; valgrind ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done On closer inspection, this is because valgrind does not support AVX512, so this is ok
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a PR to fix #907, finally removing the issues blocking the implementation of two tests #896. This also implied a simplification in the usage of googletest templates in our code. I made a successful proof of concept of adding two tests, which I want for #896 in the work I am doing on master_june24 for channelids (previously this was blocked by bug #907).
@oliviermattelaer can you please review? Note, this PR sits on top of PR #908, which itself sits on top of #900 and #905. So I suggest reviewing and merging in this order
Thanks
Andrea