Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

From: Uros Bizjak
Date: Tue Oct 10 2023 - 14:23:00 EST


On Tue, Oct 10, 2023 at 7:32 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
> >
> > Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
> > better optimizations, such as setting an appropriate base to compute
> > the address instead of an add instruction.
>
> Hmm. I wonder..
>
> > + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \
>
> Do we really even want to use __raw_cpu_read(this_cpu_off) at all?

Please note that besides propagation of the addition into address, the
patch also exposes memory load to the compiler, with the anticipation
that the compiler CSEs the load from this_cpu_off from eventual
multiple addresses. For this to work, we have to get rid of the asms.
It is important that the compiler knows that this is a memory load, so
it can also apply other compiler magic to it.

BTW: A follow-up patch will also use__raw_cpu_read to implement
this_cpu_read_stable. We can then read "const aliased" current_task to
CSE the load even more, something similar to [1].

[1] https://lore.kernel.org/lkml/20190823224424.15296-8-namit@xxxxxxxxxx/

Uros.

> On my machines (I tested an Intel 8th gen laptop, and my AMD Zen 2
> Threadripper machine), 'rdgsbase' seems to be basically two cycles.
>
> I wonder if we'd be better off using that, rather than doing the load.
>
> Yes, a load that hits in L1D$ will schedule better, so it's "cheaper"
> in that sense. The rdgsbase instruction *probably* will end up only
> decoding in the first decoder etc. But we're talking single-cycle kind
> of effects, and the rdgsbase case should be much better from a cache
> perspective and might use fewer memory pipeline resources to offset
> the fact that it uses an unusual front end decoder resource...
>
> Yes, yes, we'd have to make it an alternative, and do something like
>
> static __always_inline unsigned long new_cpu_offset(void)
> {
> unsigned long res;
> asm(ALTERNATIVE(
> "movq %%gs:this_cpu_off,%0",
> "rdgsbase %0",
> X86_FEATURE_FSGSBASE)
> : "=r" (res));
> return res;
> }
>
> but it still seems fairly straightforward. Hmm?
>
> Linus