Fix data race when terminating or waiting for a thread.#1924
Conversation
|
@xujuntwt95329 I'd appreciate your feedback as I think it's related to your recent changes in thread manager: #1889 |
42bf4bd to
d918ade
Compare
d918ade to
0d3e236
Compare
|
|
||
| next = bh_list_elem_next(node); | ||
| os_mutex_unlock(&cluster->lock); | ||
| visitor(node, user_data); |
There was a problem hiding this comment.
what keeps "node" from being freed while we release cluster->lock here?
There was a problem hiding this comment.
It's visitor's responsibility. See wait_for_thread_visitor and terminate_thread_visitor - they both check if the exec_env if clusters_have_exec_env() while keeping the lock.
| if (already_processed) { | ||
| node = bh_list_elem_next(node); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
i guess for many (all?) of usage it's ok to visit nodes multiple times and thus it isn't worth to have this O(x^2) check.
There was a problem hiding this comment.
Yeah, most likely that won't be an issue; having said that, I don't think x is going to be large enough to worry about that (so I'd rather keep the implementation slow but always correct, than making it a bit faster but risk somebody will use it incorrectly). If you think that will have a significant performance impact (because x can be large), I can update the code.
| while (node) { | ||
| bool already_processed = false; | ||
| void *proc_node; | ||
| for (size_t i = 0; i < bh_vector_size(&proc_nodes); i++) { |
There was a problem hiding this comment.
Had better declare size_t i; with a single line, here it it C source code, we usually keep the coding style of C.
There was a problem hiding this comment.
Ok, I'll update the code. It'd be good to have some sort of CI check to enforce that though (I guess compile with -std=c89 would do, but is the whole codebase c89 compliant?)
33d4bcc to
0d3e236
Compare
| goto final; | ||
| } | ||
|
|
||
| node = bh_list_first_elem(&cluster->exec_env_list); |
There was a problem hiding this comment.
This will make it traversing from the first node again, nodes from the list head to current will be tested again whether they are in the vector, it is nearly O(n^2), and besides the top while (node) loop, the algorithm's complexity is nearly O(n^3). Not very efficient.
I think that we can check whether the most recent visited node is still in the exec_env list, if yes, we can get its next, and then continue to traverse, if not, traversing from the head of the list:
static bool
safe_traverse_exec_env_list(WASMCluster *cluster, list_visitor visitor,
void *user_data)
{
Vector proc_nodes;
void *node;
bool ret = false;
if (!bh_vector_init(&proc_nodes, cluster->exec_env_list.len, sizeof(void *),
false)) {
goto final;
}
node = bh_list_first_elem(&cluster->exec_env_list);
while (node) {
void *proc_node;
bool found;
int i;
os_mutex_unlock(&cluster->lock);
visitor(node, user_data);
os_mutex_lock(&cluster->lock);
if (!bh_vector_append(&proc_nodes, &node)) {
goto final;
}
found = false;
/* Find the most recent visited and still existing node */
for (i = (int)bh_vector_size(&proc_nodes) - 1; i >= 0; i--) {
if (!bh_vector_get(&proc_nodes, i, &proc_node)) {
goto final;
}
node = bh_list_first_elem(&cluster->exec_env_list);
while (node) {
if (proc_node == node) {
found = true;
break;
}
node = bh_list_elem_next(node);
}
if (found)
break;
else {
/* The node has been removed from cluster->exec_env_list */
(void)bh_vector_remove(&proc_nodes, i, NULL);
}
}
if (found)
node = bh_list_elem_next(node);
else
node = bh_list_first_elem(&cluster->exec_env_list);
}
ret = true;
final:
bh_vector_destroy(&proc_nodes);
return ret;
}Could you help check whether it works?
There was a problem hiding this comment.
It is still not very efficient, seems to be O(n^2).
Wondering why we need to unlock and lock again when visiting the node? Will the visiting lock the cluster->lock again?
There was a problem hiding this comment.
@wenyongh please note the critical operation in the loop is:
os_mutex_unlock(&cluster->lock);
visitor(node, user_data);
os_mutex_lock(&cluster->lock);
and that in either of the implementations is executed only N times. Also, I think in the current implementation we only go beyond for loop (and therefore, goes at the beginning of the list) if the node is not in the vector, so the actual complexity is quadratic. I think there's a room for improvement (not only in the iteration itself, but e.g. by using hash set instead of vector), but I don't think this will significantly improve performance of this function.
Wondering why we need to unlock and lock again when visiting the node? Will the visiting lock the cluster->lock again?
Yes, they lock the cluster's lock internally.
My suggestion is to push the fix as it is (unless you see bugs in it) to not block potential customers (please note the problem is also with the main branch, so I'm updating the PR to point to main) and we can tune the performance of the function later if needed. What do you think?
There was a problem hiding this comment.
OK, let's merge this PR first.
There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.
…art iteration after processing every node.
26ab559 to
1ae3042
Compare
…nce#1924) There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.
…nce#1924) There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.
There is a possibility that while iterating through a list of threads, the threads finish by themselves, making a list of exec_env invalid.