util: Improve thread safety of memhook monitor by jiaxiyan · Pull Request #11866 · ofiwg/libfabric

jiaxiyan · 2026-02-04T00:45:27Z

Add mutex lock and reference counting to ensure only one thread can install/remove patch. This avoids the memmove_evex_unaligned_erms segfault when multiple threads simultaneously patching glibc functions.

jiaxiyan · 2026-02-04T00:46:04Z

Attempt to fix #10943

shefty · 2026-02-04T01:12:10Z

spacing in this patch is off -- using spaces instead of tabs

prov/util/src/util_mem_hooks.c

Add mutex lock and reference counting to ensure only one thread can install/remove patch. This avoids the memmove_evex_unaligned_erms segfault when multiple threads simultaneously patching glibc functions. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Every start needs to be followed by a stop. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

jiaxiyan · 2026-02-06T18:31:11Z

@shefty Can you take another look?

shefty · 2026-02-06T19:49:58Z

Something is off, and it's been too long since I've looked at this code.

There are a bunch of monitors in the code. Why is special protection only being applied to one of them? This also changes the behavior such that it now requires pairing start/stop calls.

struct ofi_mem_monitor maintains a state that indicates if a monitor is idle, running, stopping, etc. That state is protected by an mm_state_lock. Why are providers poking directly into this structure, rather than calling the APIs exposed in the header file, such as ofi_monitors_add_cache() or ofi_monitors_del_cache()?

jiaxiyan · 2026-02-06T22:35:24Z

Why is special protection only being applied to one of them? This also changes the behavior such that it now requires pairing start/stop calls.

Memhook monitor is special because the symbol interception modifies global glibc functions. In #10943, when the memhook is released (e.g., on domain destruction), another unrelated thread might be executing the assembler while it is being reverted to its original state by libfabric. mm_state_lock ensures that ofi_memhooks_start/stop calls are serialized, but reference counting is needed to make sure the patch is only applied once and restored once after all threads are done.

Why are providers poking directly into this structure, rather than calling the APIs exposed in the header file, such as ofi_monitors_add_cache() or ofi_monitors_del_cache()?

We have some legacy code to test compatibility with OpenMPI who also does memory patching.

shefty · 2026-02-07T00:08:47Z

The problem seems to be that the provider is calling the wrong functions, breaking the protections that already exist. ofi_monitors_update() calls start() once. It's not just serializing the call, it prevents it from being called a second time until it transitions back to idle. We already have a lock to protect against multiple threads. We already have state checking. Adding a second lock and second state for 1 specific monitor still seems wrong. All providers have to use the same entry points, or they can't work together.

jiaxiyan · 2026-02-07T00:33:51Z

#10943 said it is an issue of all providers, not just efa provider. ofi_memhooks_start prevents reentrancy now, but ofi_restore_intercepts in ofi_memhooks_stop needs to be called last after all threads complete.

shefty · 2026-02-07T00:40:34Z

Providers shouldn't be calling ofi_memhooks_start() directly. That seems to be the issue. They should call ofi_monitors_add_cache() / ofi_monitors_del_cache(). The monitor's lifetime isn't associated with threads. It should be based on whether or not there is an active cache.

j-xiong · 2026-02-07T02:18:41Z

The issue in #10943 is not due to multiple threads trying to patch the code concurrently. It's that thread A is undoing the patch while thread B is in the middle of executing the patched code. Thread B is totally unaware of the patching itself.

shefty · 2026-02-07T04:13:38Z

@j-xiong I believe that's because the provider is bypassing the locks and state checks that already exist. Or there's some other issue I'm missing.

j-xiong · 2026-02-07T04:37:59Z

@shefty Do you mean the mm_lock, mm_state_lock, and mm_list_rwlock? Those protect the memory monitor data structures, but do they protect calls to the intercepted functions? For example, how do a thread know if another thread is calling malloc when it plans to call ofi_monitor_cleanup which will undo the patching?

shefty · 2026-02-07T06:47:58Z

@j-xiong Yes, that's what I meant, but it doesn't provide the right protection. I don't know if we can ever protect against the problem, which might mean that the act of trying to intercept the calls is also hopelessly broken. (That wouldn't be surprising given how completely hacky the approach is.)

Maybe the best option is make 'stop' a no-op and never revert the calls, and hope that the intercept part mostly works.

j-xiong · 2026-02-09T17:29:10Z

@shefty Thank, that matches my understanding.

Indeed, not reverting the patch may be a viable solution. In addition, we could use a flag to determine whether the interception handler needs to be called.

jiaxiyan requested a review from a team February 4, 2026 00:45

jiaxiyan requested a review from shefty February 4, 2026 00:47

alekswn reviewed Feb 4, 2026

View reviewed changes

prov/util/src/util_mem_hooks.c Outdated Show resolved Hide resolved

prov/util/src/util_mem_hooks.c Outdated Show resolved Hide resolved

prov/util/src/util_mem_hooks.c Outdated Show resolved Hide resolved

jiaxiyan force-pushed the memhook branch from 3cc86cf to 0c6877d Compare February 5, 2026 00:38

alekswn previously approved these changes Feb 5, 2026

View reviewed changes

jiaxiyan added 2 commits February 5, 2026 11:33

prov/efa: Stop memhooks_monitor after starting it

e0a82d8

Every start needs to be followed by a stop. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

jiaxiyan dismissed alekswn’s stale review via e0a82d8 February 6, 2026 18:18

jiaxiyan force-pushed the memhook branch from 0c6877d to e0a82d8 Compare February 6, 2026 18:18

Conversation

jiaxiyan commented Feb 4, 2026

Uh oh!

jiaxiyan commented Feb 4, 2026

Uh oh!

shefty commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiaxiyan commented Feb 6, 2026

Uh oh!

shefty commented Feb 6, 2026

Uh oh!

jiaxiyan commented Feb 6, 2026

Uh oh!

shefty commented Feb 7, 2026

Uh oh!

jiaxiyan commented Feb 7, 2026

Uh oh!

shefty commented Feb 7, 2026

Uh oh!

j-xiong commented Feb 7, 2026

Uh oh!

shefty commented Feb 7, 2026

Uh oh!

j-xiong commented Feb 7, 2026

Uh oh!

shefty commented Feb 7, 2026

Uh oh!

j-xiong commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants