fix compiler optimize thread local variable access#2156
Conversation
能否自动根据当前编译器和平台来决定是否开启这个宏? |
done, 根据issue的反馈信息,目前仅针对 Clang/AppleClang 并且是 ARM/AArch 架构平台关闭优化。 |
| endif() | ||
| endif() | ||
|
|
||
| if(CMAKE_CXX_COMPILER_ID MATCHES "(AppleClang)|(Clang)" AND CMAKE_SYSTEM_PROCESSOR MATCHES "(aarch64)|(arm64)") |
There was a problem hiding this comment.
在这里判断的话只对cmake生效,我们还有make和bazel两种编译方式
是不是在头文件里进行判断,普适性更好一些
|
LGTM |
* fix compiler optimize thread local variable access * change __thread to BAIDU_THREAD_LOCAL * Add NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS according to compiler and arch * move thread local access optimization condition to thread_local.h
* fix compiler optimize thread local variable access * change __thread to BAIDU_THREAD_LOCAL * Add NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS according to compiler and arch * move thread local access optimization condition to thread_local.h
是在哪个平台上出现呢?arm 还是 x86 ? |
x86, core在了其他地方,找了下调用栈,有地方是这么用的 我之前没用过brpc,如何能够构造一个,问题里提到到示例呢? |
如果使用 thread_local 变量时,前后两次访问该变量的中间阶段发生了 bthread 的切换(由于
这个得分情况:如果想要在 bthread 中使用类似 "thread_local" 的行为,即无论 bthread 被调度到哪,都获得相同的 thread_local 值,建议使用 如果代码就是想要获得当前被调度运行的 pthread 线程的 |
|
感觉还是 thread_local 读写的问题, 请问下除了这个patch外,后面还有其他关于这个问题的修复吗? task_runner 循环外的 tls,也加上 BAIDU_GET_VOLATILE_THREAD_LOCAL,后似乎问题不再出现了。 |
后续还有3个改动:
怀疑可能更跟 #2262 这个修改有关。 |
|
另外还有一个疑问是,函数 task_runner 和 tls_task_group 不都是在一个编译单元内吗?为什么是开启lto后才会触发这个问题呢。 |
不开启 |
|
目前从clang11 切换 到clang16 遇到了一些新的问题。和 #1776 提到的问题应该是一样的,不过我看1776对应的代码并没有合入, 是有做了其他修复吗? 对比了汇编,这样差异在于326-331 行这段代码(version 0.96), 并且把这段代码拆到一个函数里,加上禁用优化的标注,就没问题了。 // Clean tls variables, must be done before changing version_butex
// otherwise another thread just joined this thread may not see side
// effects of destructing tls variables.
KeyTable* kt = tls_bls.keytable;
if (kt != NULL) {
return_keytable(m->attr.keytable_pool, kt);
// After deletion: tls may be set during deletion.
tls_bls.keytable = NULL;
m->local_storage.keytable = NULL; // optional
}clang11 对于 tls_bls 成员变量的读写,每次都会经过一次call。而clang16会在 ;third_party/brpc-0.9.6/src/bthread/task_group.cpp:325
# KeyTable* kt = tls_bls.keytable;
a9e: 66 48 8d 3d 00 00 00 data16 lea 0x0(%rip),%rdi # aa6 <bthread::TaskGroup::task_runner(long)+0x196>
aa5: 00
aa6: 66 66 48 e8 00 00 00 data16 data16 callq aae <bthread::TaskGroup::task_runner(long)+0x19e>
aad: 00
aae: 48 8b 30 mov (%rax),%rsi
third_party/brpc-0.9.6/src/bthread/task_group.cpp:326
ab1: 48 85 f6 test %rsi,%rsi
ab4: 74 28 je ade <bthread::TaskGroup::task_runner(long)+0x1ce>
...
third_party/brpc-0.9.6/src/bthread/task_group.cpp:329
# tls_bls.keytable = NULL;
abf: 66 48 8d 3d 00 00 00 data16 lea 0x0(%rip),%rdi # ac7 <bthread::TaskGroup::task_runner(long)+0x1b7>
ac6: 00
ac7: 66 66 48 e8 00 00 00 data16 data16 callq acf <bthread::TaskGroup::task_runner(long)+0x1bf>
ace: 00
acf: 48 c7 00 00 00 00 00 movq $0x0,(%rax)clang16 ;third_party/brpc-0.9.6/src/bthread/task_group.cpp:265
838: 48 8b 40 30 mov 0x30(%rax),%rax
83c: 48 85 c0 test %rax,%rax
83f: 75 df jne 820 <bthread::TaskGroup::task_runner(long)+0x30>
841: 4c 89 65 d0 mov %r12,-0x30(%rbp)
845: 66 48 8d 3d 00 00 00 data16 lea 0x0(%rip),%rdi # 84d <bthread::TaskGroup::task_runner(long)+0x5d>
84c: 00
84d: 66 66 48 e8 00 00 00 data16 data16 callq 855 <bthread::TaskGroup::task_runner(long)+0x65>
854: 00
855: 48 89 c3 mov %rax,%rbx # do 循环外保存 tls_bls 的地址到rbx
...
;third_party/brpc-0.9.6/src/bthread/task_group.cpp:325
# KeyTable* kt = tls_bls.keytable;
96e: 48 8b 33 mov (%rbx),%rsi
;third_party/brpc-0.9.6/src/bthread/task_group.cpp:326
971: 48 85 f6 test %rsi,%rsi
974: 74 18 je 98e <bthread::TaskGroup::task_runner(long)+0x19e>
...
;third_party/brpc-0.9.6/src/bthread/task_group.cpp:329
# tls_bls.keytable = NULL;
97f: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) |
这个差异找到了(定位不易,踩坑的场景比较特殊,不知道有没有人会遇到)。和clang对tls变量访问的选择有关。如果使用-fPIC编译 + 分布式 thinlto ,clang-16之前都会出现这种行为差异,具体原因。 |
`bthread_usleep`/`bthread_yield` contains access to TLS variables. In LTO mode, exceptions may occur due to cross-module optimization. For example, bthread_usleep is inlined into WatchConnections, the compiler(clang-17.0.6) cache the address outside the loop, triggering the error mentioned in apache#2156.
`bthread_usleep`/`bthread_yield` contains access to TLS variables. In LTO mode, exceptions may occur due to cross-module optimization. For example, bthread_usleep is inlined into WatchConnections, the compiler(clang-17.0.6) cache the address outside the loop, triggering the error mentioned in apache#2156.
`bthread_usleep`/`bthread_yield` contains access to TLS variables. In LTO mode, exceptions may occur due to cross-module optimization. For example, bthread_usleep is inlined into WatchConnections, the compiler(clang-17.0.6) cache the address outside the loop, triggering the error mentioned in apache#2156.
`bthread_usleep`/`bthread_yield` contains access to TLS variables. In LTO mode, exceptions may occur due to cross-module optimization. For example, bthread_usleep is inlined into WatchConnections, the compiler(clang-17.0.6) cache the address outside the loop, triggering the error mentioned in apache#2156.
`bthread_usleep`/`bthread_yield` contains access to TLS variables. In LTO mode, exceptions may occur due to cross-module optimization. For example, bthread_usleep is inlined into WatchConnections, the compiler(clang-17.0.6) cache the address outside the loop, triggering the error mentioned in apache#2156.
What problem does this PR solve?
Issue Number:#1776, #1407, #845, #1860, apache/doris#13270
Problem Summary:
一些编译器会优化 thread_local 的变量的读写,在同一个函数中间如果再次获取相同变量的值时,会直接从缓存的地址里面进行加载。但是如果在函数过程中间,线程发生切换( bthread 切换到另一个线程执行),那么 thread_local 变量需要重新获当前运行线程的,而不应该使用之前的旧值。
经测试,Apple M1 Apple clang version 14.0.0, 编译依然存在该问题:
task_runner 函数的部分汇编代码:
1933 处首次读取了 tls_task_group,并放入了
[sp, #96]位置,2043 行时,线程可能已经切换,但是依然从[sp, #96]处中加载 tls_task_group ,导致读取到了上一个旧值。导致错误发生。gcc 通过开启 fno-gcse 选项禁止该优化,clang 未能找到对应控制选项。
参考:
What is changed and the side effects?
Changed:
NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS用于控制是否允许编译器优化,如果为 true 是,对应的 thread local 变量必须通过 noinline 的函数调用方式来获取最新值,绕开编译器的优化。Side effects:
如果开启
NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS, 会造成多一次的函数调用开销。无
Check List: