Skip to content

Fix data race when terminating or waiting for a thread.#1924

Merged
wenyongh merged 2 commits into
bytecodealliance:mainfrom
loganek:loganek/fix-data-race
Feb 6, 2023
Merged

Fix data race when terminating or waiting for a thread.#1924
wenyongh merged 2 commits into
bytecodealliance:mainfrom
loganek:loganek/fix-data-race

Conversation

@loganek
Copy link
Copy Markdown
Contributor

@loganek loganek commented Jan 31, 2023

There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.

@loganek
Copy link
Copy Markdown
Contributor Author

loganek commented Jan 31, 2023

@xujuntwt95329 I'd appreciate your feedback as I think it's related to your recent changes in thread manager: #1889

@loganek loganek force-pushed the loganek/fix-data-race branch from 42bf4bd to d918ade Compare January 31, 2023 17:20
Comment thread core/iwasm/libraries/thread-mgr/thread_manager.c Outdated
@loganek loganek force-pushed the loganek/fix-data-race branch from d918ade to 0d3e236 Compare February 1, 2023 10:15

next = bh_list_elem_next(node);
os_mutex_unlock(&cluster->lock);
visitor(node, user_data);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what keeps "node" from being freed while we release cluster->lock here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's visitor's responsibility. See wait_for_thread_visitor and terminate_thread_visitor - they both check if the exec_env if clusters_have_exec_env() while keeping the lock.

if (already_processed) {
node = bh_list_elem_next(node);
continue;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess for many (all?) of usage it's ok to visit nodes multiple times and thus it isn't worth to have this O(x^2) check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, most likely that won't be an issue; having said that, I don't think x is going to be large enough to worry about that (so I'd rather keep the implementation slow but always correct, than making it a bit faster but risk somebody will use it incorrectly). If you think that will have a significant performance impact (because x can be large), I can update the code.

@loganek loganek requested review from xujuntwt95329 and yamt February 2, 2023 09:01
Copy link
Copy Markdown
Collaborator

@xujuntwt95329 xujuntwt95329 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

while (node) {
bool already_processed = false;
void *proc_node;
for (size_t i = 0; i < bh_vector_size(&proc_nodes); i++) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had better declare size_t i; with a single line, here it it C source code, we usually keep the coding style of C.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll update the code. It'd be good to have some sort of CI check to enforce that though (I guess compile with -std=c89 would do, but is the whole codebase c89 compliant?)

Comment thread core/iwasm/libraries/thread-mgr/thread_manager.c Outdated
@loganek loganek requested review from wenyongh and removed request for yamt February 3, 2023 10:59
@loganek loganek force-pushed the loganek/fix-data-race branch from 33d4bcc to 0d3e236 Compare February 3, 2023 12:52
goto final;
}

node = bh_list_first_elem(&cluster->exec_env_list);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make it traversing from the first node again, nodes from the list head to current will be tested again whether they are in the vector, it is nearly O(n^2), and besides the top while (node) loop, the algorithm's complexity is nearly O(n^3). Not very efficient.

I think that we can check whether the most recent visited node is still in the exec_env list, if yes, we can get its next, and then continue to traverse, if not, traversing from the head of the list:

static bool
safe_traverse_exec_env_list(WASMCluster *cluster, list_visitor visitor,
                            void *user_data)
{
    Vector proc_nodes;
    void *node;
    bool ret = false;

    if (!bh_vector_init(&proc_nodes, cluster->exec_env_list.len, sizeof(void *),
                        false)) {
        goto final;
    }

    node = bh_list_first_elem(&cluster->exec_env_list);

    while (node) {
        void *proc_node;
        bool found;
        int i;

        os_mutex_unlock(&cluster->lock);
        visitor(node, user_data);
        os_mutex_lock(&cluster->lock);

        if (!bh_vector_append(&proc_nodes, &node)) {
            goto final;
        }

        found = false;
        /* Find the most recent visited and still existing node */
        for (i = (int)bh_vector_size(&proc_nodes) - 1; i >= 0; i--) {
            if (!bh_vector_get(&proc_nodes, i, &proc_node)) {
                goto final;
            }
            node = bh_list_first_elem(&cluster->exec_env_list);
            while (node) {
                if (proc_node == node) {
                    found = true;
                    break;
                }
                node = bh_list_elem_next(node);
            }
            if (found)
                break;
            else {
                /* The node has been removed from cluster->exec_env_list */
                (void)bh_vector_remove(&proc_nodes, i, NULL);
            }
        }

        if (found)
            node = bh_list_elem_next(node);
        else
            node = bh_list_first_elem(&cluster->exec_env_list);
    }

    ret = true;
final:
    bh_vector_destroy(&proc_nodes);

    return ret;
}

Could you help check whether it works?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still not very efficient, seems to be O(n^2).
Wondering why we need to unlock and lock again when visiting the node? Will the visiting lock the cluster->lock again?

Copy link
Copy Markdown
Contributor Author

@loganek loganek Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyongh please note the critical operation in the loop is:

os_mutex_unlock(&cluster->lock);
visitor(node, user_data);
os_mutex_lock(&cluster->lock);

and that in either of the implementations is executed only N times. Also, I think in the current implementation we only go beyond for loop (and therefore, goes at the beginning of the list) if the node is not in the vector, so the actual complexity is quadratic. I think there's a room for improvement (not only in the iteration itself, but e.g. by using hash set instead of vector), but I don't think this will significantly improve performance of this function.

Wondering why we need to unlock and lock again when visiting the node? Will the visiting lock the cluster->lock again?

Yes, they lock the cluster's lock internally.

My suggestion is to push the fix as it is (unless you see bugs in it) to not block potential customers (please note the problem is also with the main branch, so I'm updating the PR to point to main) and we can tune the performance of the function later if needed. What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's merge this PR first.

There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.
…art iteration after processing every node.
@loganek loganek force-pushed the loganek/fix-data-race branch from 26ab559 to 1ae3042 Compare February 6, 2023 10:23
@loganek loganek changed the base branch from dev/wasi_threads to main February 6, 2023 10:23
@wenyongh wenyongh merged commit e2c754f into bytecodealliance:main Feb 6, 2023
eloparco pushed a commit to eloparco/wasm-micro-runtime that referenced this pull request Feb 13, 2023
…nce#1924)

There is a possibility that while iterating through a list of threads, the threads
finish by themselves, making a list of exec_env invalid.
vickiegpt pushed a commit to vickiegpt/wamr-aot-gc-checkpoint-restore that referenced this pull request May 27, 2024
…nce#1924)

There is a possibility that while iterating through a list of threads, the threads
finish by themselves, making a list of exec_env invalid.
@loganek loganek deleted the loganek/fix-data-race branch June 10, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants