Fix data race when terminating or waiting for a thread. by loganek · Pull Request #1924 · bytecodealliance/wasm-micro-runtime

loganek · 2023-01-31T16:45:23Z

There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.

loganek · 2023-01-31T16:46:26Z

@xujuntwt95329 I'd appreciate your feedback as I think it's related to your recent changes in thread manager: #1889

yamt · 2023-02-01T10:31:28Z

+
+        next = bh_list_elem_next(node);
+        os_mutex_unlock(&cluster->lock);
+        visitor(node, user_data);


what keeps "node" from being freed while we release cluster->lock here?

It's visitor's responsibility. See wait_for_thread_visitor and terminate_thread_visitor - they both check if the exec_env if clusters_have_exec_env() while keeping the lock.

yamt · 2023-02-01T10:36:33Z

+        if (already_processed) {
+            node = bh_list_elem_next(node);
+            continue;
+        }


i guess for many (all?) of usage it's ok to visit nodes multiple times and thus it isn't worth to have this O(x^2) check.

Yeah, most likely that won't be an issue; having said that, I don't think x is going to be large enough to worry about that (so I'd rather keep the implementation slow but always correct, than making it a bit faster but risk somebody will use it incorrectly). If you think that will have a significant performance impact (because x can be large), I can update the code.

xujuntwt95329

LGTM

wenyongh · 2023-02-03T02:25:16Z

+    while (node) {
+        bool already_processed = false;
+        void *proc_node;
+        for (size_t i = 0; i < bh_vector_size(&proc_nodes); i++) {


Had better declare size_t i; with a single line, here it it C source code, we usually keep the coding style of C.

Ok, I'll update the code. It'd be good to have some sort of CI check to enforce that though (I guess compile with -std=c89 would do, but is the whole codebase c89 compliant?)

wenyongh · 2023-02-04T01:19:26Z

+            goto final;
+        }
+
+        node = bh_list_first_elem(&cluster->exec_env_list);


This will make it traversing from the first node again, nodes from the list head to current will be tested again whether they are in the vector, it is nearly O(n^2), and besides the top while (node) loop, the algorithm's complexity is nearly O(n^3). Not very efficient.

I think that we can check whether the most recent visited node is still in the exec_env list, if yes, we can get its next, and then continue to traverse, if not, traversing from the head of the list:

static bool safe_traverse_exec_env_list(WASMCluster *cluster, list_visitor visitor, void *user_data) { Vector proc_nodes; void *node; bool ret = false; if (!bh_vector_init(&proc_nodes, cluster->exec_env_list.len, sizeof(void *), false)) { goto final; } node = bh_list_first_elem(&cluster->exec_env_list); while (node) { void *proc_node; bool found; int i; os_mutex_unlock(&cluster->lock); visitor(node, user_data); os_mutex_lock(&cluster->lock); if (!bh_vector_append(&proc_nodes, &node)) { goto final; } found = false; /* Find the most recent visited and still existing node */ for (i = (int)bh_vector_size(&proc_nodes) - 1; i >= 0; i--) { if (!bh_vector_get(&proc_nodes, i, &proc_node)) { goto final; } node = bh_list_first_elem(&cluster->exec_env_list); while (node) { if (proc_node == node) { found = true; break; } node = bh_list_elem_next(node); } if (found) break; else { /* The node has been removed from cluster->exec_env_list */ (void)bh_vector_remove(&proc_nodes, i, NULL); } } if (found) node = bh_list_elem_next(node); else node = bh_list_first_elem(&cluster->exec_env_list); } ret = true; final: bh_vector_destroy(&proc_nodes); return ret; }

Could you help check whether it works?

It is still not very efficient, seems to be O(n^2).
Wondering why we need to unlock and lock again when visiting the node? Will the visiting lock the cluster->lock again?

@wenyongh please note the critical operation in the loop is:

os_mutex_unlock(&cluster->lock); visitor(node, user_data); os_mutex_lock(&cluster->lock);

and that in either of the implementations is executed only N times. Also, I think in the current implementation we only go beyond for loop (and therefore, goes at the beginning of the list) if the node is not in the vector, so the actual complexity is quadratic. I think there's a room for improvement (not only in the iteration itself, but e.g. by using hash set instead of vector), but I don't think this will significantly improve performance of this function.

Wondering why we need to unlock and lock again when visiting the node? Will the visiting lock the cluster->lock again?

Yes, they lock the cluster's lock internally.

My suggestion is to push the fix as it is (unless you see bugs in it) to not block potential customers (please note the problem is also with the main branch, so I'm updating the PR to point to main) and we can tune the performance of the function later if needed. What do you think?

OK, let's merge this PR first.

There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.

…art iteration after processing every node.

…nce#1924) There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.

loganek force-pushed the loganek/fix-data-race branch from 42bf4bd to d918ade Compare January 31, 2023 17:20

xujuntwt95329 reviewed Feb 1, 2023

View reviewed changes

Comment thread core/iwasm/libraries/thread-mgr/thread_manager.c Outdated

loganek force-pushed the loganek/fix-data-race branch from d918ade to 0d3e236 Compare February 1, 2023 10:15

yamt reviewed Feb 1, 2023

View reviewed changes

loganek requested review from xujuntwt95329 and yamt February 2, 2023 09:01

xujuntwt95329 reviewed Feb 2, 2023

View reviewed changes

wenyongh reviewed Feb 3, 2023

View reviewed changes

loganek requested review from wenyongh and removed request for yamt February 3, 2023 10:59

loganek force-pushed the loganek/fix-data-race branch from 33d4bcc to 0d3e236 Compare February 3, 2023 12:52

wenyongh reviewed Feb 4, 2023

View reviewed changes

loganek added 2 commits February 6, 2023 10:22

Fix data race when terminating or waiting for a thread.

cc27e8d

There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.

Don't use element_removed flag to avoid multi-threading issue; rest…

1ae3042

…art iteration after processing every node.

loganek force-pushed the loganek/fix-data-race branch from 26ab559 to 1ae3042 Compare February 6, 2023 10:23

loganek changed the base branch from dev/wasi_threads to main February 6, 2023 10:23

wenyongh merged commit e2c754f into bytecodealliance:main Feb 6, 2023

wenyongh mentioned this pull request Feb 7, 2023

Allow interrupting blocking instructions #1930

Closed

loganek deleted the loganek/fix-data-race branch June 10, 2024 12:48

Conversation

loganek commented Jan 31, 2023

Uh oh!

loganek commented Jan 31, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xujuntwt95329 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loganek Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

loganek Feb 6, 2023 •

edited

Loading