util: Improve thread safety of memhook monitor#11866
util: Improve thread safety of memhook monitor#11866jiaxiyan wants to merge 2 commits intoofiwg:mainfrom
Conversation
|
Attempt to fix #10943 |
|
spacing in this patch is off -- using spaces instead of tabs |
Add mutex lock and reference counting to ensure only one thread can install/remove patch. This avoids the memmove_evex_unaligned_erms segfault when multiple threads simultaneously patching glibc functions. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Every start needs to be followed by a stop. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
|
@shefty Can you take another look? |
|
Something is off, and it's been too long since I've looked at this code. There are a bunch of monitors in the code. Why is special protection only being applied to one of them? This also changes the behavior such that it now requires pairing start/stop calls. struct ofi_mem_monitor maintains a state that indicates if a monitor is idle, running, stopping, etc. That state is protected by an mm_state_lock. Why are providers poking directly into this structure, rather than calling the APIs exposed in the header file, such as ofi_monitors_add_cache() or ofi_monitors_del_cache()? |
Memhook monitor is special because the symbol interception modifies global glibc functions. In #10943, when the memhook is released (e.g., on domain destruction), another unrelated thread might be executing the assembler while it is being reverted to its original state by libfabric.
We have some legacy code to test compatibility with OpenMPI who also does memory patching. |
|
The problem seems to be that the provider is calling the wrong functions, breaking the protections that already exist. ofi_monitors_update() calls start() once. It's not just serializing the call, it prevents it from being called a second time until it transitions back to idle. We already have a lock to protect against multiple threads. We already have state checking. Adding a second lock and second state for 1 specific monitor still seems wrong. All providers have to use the same entry points, or they can't work together. |
|
#10943 said it is an issue of all providers, not just efa provider. |
|
Providers shouldn't be calling ofi_memhooks_start() directly. That seems to be the issue. They should call ofi_monitors_add_cache() / ofi_monitors_del_cache(). The monitor's lifetime isn't associated with threads. It should be based on whether or not there is an active cache. |
|
The issue in #10943 is not due to multiple threads trying to patch the code concurrently. It's that thread A is undoing the patch while thread B is in the middle of executing the patched code. Thread B is totally unaware of the patching itself. |
|
@j-xiong I believe that's because the provider is bypassing the locks and state checks that already exist. Or there's some other issue I'm missing. |
|
@shefty Do you mean the |
|
@j-xiong Yes, that's what I meant, but it doesn't provide the right protection. I don't know if we can ever protect against the problem, which might mean that the act of trying to intercept the calls is also hopelessly broken. (That wouldn't be surprising given how completely hacky the approach is.) Maybe the best option is make 'stop' a no-op and never revert the calls, and hope that the intercept part mostly works. |
|
@shefty Thank, that matches my understanding. Indeed, not reverting the patch may be a viable solution. In addition, we could use a flag to determine whether the interception handler needs to be called. |
Add mutex lock and reference counting to ensure only one thread can install/remove patch. This avoids the memmove_evex_unaligned_erms segfault when multiple threads simultaneously patching glibc functions.