Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

From: Linus Torvalds
Date: Tue Oct 10 2023 - 13:32:45 EST


On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
> better optimizations, such as setting an appropriate base to compute
> the address instead of an add instruction.

Hmm. I wonder..

> + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \

Do we really even want to use __raw_cpu_read(this_cpu_off) at all?

On my machines (I tested an Intel 8th gen laptop, and my AMD Zen 2
Threadripper machine), 'rdgsbase' seems to be basically two cycles.

I wonder if we'd be better off using that, rather than doing the load.

Yes, a load that hits in L1D$ will schedule better, so it's "cheaper"
in that sense. The rdgsbase instruction *probably* will end up only
decoding in the first decoder etc. But we're talking single-cycle kind
of effects, and the rdgsbase case should be much better from a cache
perspective and might use fewer memory pipeline resources to offset
the fact that it uses an unusual front end decoder resource...

Yes, yes, we'd have to make it an alternative, and do something like

static __always_inline unsigned long new_cpu_offset(void)
{
unsigned long res;
asm(ALTERNATIVE(
"movq %%gs:this_cpu_off,%0",
"rdgsbase %0",
X86_FEATURE_FSGSBASE)
: "=r" (res));
return res;
}

but it still seems fairly straightforward. Hmm?

Linus