Skip to content

prov/psm3: freezing due to lack of resources #11893

@tatarintsevsv

Description

@tatarintsevsv

If EP on remote side was ended, but we still try to perform an rma op, provider lacks the resources for initiate operation and freezes inside AMSH_POLL_UNTIL() (on intranode communications) or inside PSMI_BLOCKUNTIL() (on cross-node)
Reproduce: https://github.com/tatarintsevsv/ofi_psm3_freeze

$ ./ofi_psm3_freeze -h 127.0.0.1
inserted address into av: fi_addr_psmx3://c0a801270000017c:e9d0c8db:0:0
client started
wanna perform 1000 iters

[writemsg 0] ===> CQ: Op completed len=32768 flags=516
[writemsg 1] ===> CQ: Op completed len=32768 flags=516
[writemsg 2] ===> CQ: Op completed len=32768 flags=516
[writemsg 3] ===> CQ: Op completed len=32768 flags=516
[writemsg 4] ===> CQ: Op completed len=32768 flags=516
[writemsg 5]
[writemsg 6]
[writemsg 7]
[writemsg 8]
[writemsg 9]
[writemsg 10]
[writemsg 11]^C

The situation is as follows: both sides are working, data exchange is proceeding correctly, but at some point receiver side terminates. The sender continues to perform operations, but due to a lack of resources, it gets stuck in the depths of the library...
I found that we had an infinite loop trying to acquire resources.
There two places where we can stuck - in intranode communication (afaik here used shmem?) we got infinite loop in AMSH_POLL_UNTIL() macro. In cross-node communication (tested with sockets hal), infinite loop is PSMI_BLOCKUNTIL().
I think we should use a timer to prevent freezing and return something like PSM2_EP_NO_RESOURCES on timeout (and than return -FI_EAGAIN from intiated operation to client).
Unfortunately I'm not sure how to use psm3_timer api to make it correctly. lack of documentation and comments is sometimes painful :(

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions