Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release

From: Adrian Hunter
Date: Tue Jun 01 2021 - 05:07:22 EST


On 1/06/21 9:33 am, Leo Yan wrote:
> On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote:
>> On 31/05/21 6:10 pm, Leo Yan wrote:
>>> Hi Peter, Adrian,
>>>
>>> On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
>>>> Load-acquire and store-release are one-way permeable barriers, which can
>>>> be used to guarantee the memory ordering between accessing the buffer
>>>> data and the buffer's head / tail.
>>>>
>>>> This patch optimizes the memory ordering with the load-acquire and
>>>> store-release barriers.
>>>
>>> Is this patch okay for you?
>>>
>>> Besides this patch, I have an extra question. You could see for
>>> accessing the AUX buffer's head and tail, it also support to use
>>> compiler build-in functions for atomicity accessing:
>>>
>>> __sync_val_compare_and_swap()
>>> __sync_bool_compare_and_swap()
>>>
>>> Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
>>> to support __sync_xxx_compare_and_swap() atomicity?
>>
>> I don't remember, but it seems to me atomicity is needed only
>> for a 32-bit perf running with a 64-bit kernel.
>
> 32-bit perf wants to access 64-bit value atomically, I think it tries to
> avoid the issue caused by scenario:
>
> CPU0 (64-bit kernel) CPU1 (32-bit user)
>
> read head_lo
> WRITE_ONCE(head)
> read head_hi
>
>
> I dumped the disassembly for reading 64-bit value for perf Arm32 and get
> below results:
>
> perf Arm32 for READ_ONCE():
>
> case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
> 84a: 68fb ldr r3, [r7, #12]
> 84c: e9d3 2300 ldrd r2, r3, [r3]
> 850: 6939 ldr r1, [r7, #16]
> 852: e9c1 2300 strd r2, r3, [r1]
> 856: e007 b.n 868 <auxtrace_mmap__read_head+0xb0>
>
> It uses the instruction ldrd which is "Load Register Dual (register)",
> but this doesn't mean the instruction is atomic, especially based on
> the comment in the kernel header include/asm-generic/rwonce.h, I think
> the instruction ldrd/strd will be "atomic in some cases (namely Armv7 +
> LPAE), but for others we rely on the access being split into 2x32-bit
> accesses".
>
>
> perf Arm32 for __sync_val_compare_and_swap():
>
> u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0);
> 7d6: 68fb ldr r3, [r7, #12]
> 7d8: f503 6484 add.w r4, r3, #1056 ; 0x420
> 7dc: f04f 0000 mov.w r0, #0
> 7e0: f04f 0100 mov.w r1, #0
> 7e4: f3bf 8f5b dmb ish
> 7e8: e8d4 237f ldrexd r2, r3, [r4]
> 7ec: ea52 0c03 orrs.w ip, r2, r3
> 7f0: d106 bne.n 800 <auxtrace_mmap__read_head+0x48>
> 7f2: e8c4 017c strexd ip, r0, r1, [r4]
> 7f6: f1bc 0f00 cmp.w ip, #0
> 7fa: f1bc 0f00 cmp.w ip, #0
> 7fe: d1f3 bne.n 7e8 <auxtrace_mmap__read_head+0x30>
> 800: f3bf 8f5b dmb ish
> 804: e9c7 2304 strd r2, r3, [r7, #16]
>
> For __sync_val_compare_and_swap(), it uses the instructions
> ldrexd/ldrexd, these two instructions rely on the exclusive monitor
> for accessing 64-bit value, so seems to me this is more reliable way
> for accessing 64-bit value in CPU's 32-bit mode.
>
> Conclusion: seems to me __sync_xxx_compare_and_swap() should be kept
> in this case rather than using READ_ONCE() for 32-bit building. Or
> any other suggestions? Thanks!

__sync_xxx_compare_and_swap is out-of-date now. This page:

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins

recommends '__atomic' builtins instead.

Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel)
you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics
for a 32-bit perf with 32-bit kernel.