Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

From: Uros Bizjak
Date: Sat Oct 14 2023 - 06:35:11 EST


On Sat, Oct 14, 2023 at 12:04 PM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> * Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
> > add the ptr value to the base. This way, the compiler can propagate
> > addend to the following instruction and simplify address calculation.
> >
> > E.g.: address calcuation in amd_pmu_enable_virt() improves from:
> >
> > 48 c7 c0 00 00 00 00 mov $0x0,%rax
> > 87b7: R_X86_64_32S cpu_hw_events
> >
> > 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax
> > 00
> > 87bf: R_X86_64_PC32 this_cpu_off-0x4
> >
> > 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax)
> > 00 00 00 00
> >
> > to:
> >
> > 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
> > 00
> > 8798: R_X86_64_PC32 this_cpu_off-0x4
> > 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax)
> > 00 00 00 00
> > 87a6: R_X86_64_32S cpu_hw_events+0x1328
> >
> > The compiler can also eliminate redundant loads from this_cpu_off,
> > reducing the number of percpu offset reads (either from this_cpu_off
> > or with rdgsbase) from 1663 to 1571.
> >
> > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> > only decoding in the first decoder etc. But we're talking single-cycle
> > kind of effects, and the rdgsbase case should be much better from
> > a cache perspective and might use fewer memory pipeline resources to
> > offset the fact that it uses an unusual front end decoder resource...
>
> So the 'additionally' wording in the changelog should have been a big hint
> already that the introduction of RDGSBASE usage needs to be a separate
> patch. ;-)

Indeed. I think that the first part should be universally beneficial,
as it converts

mov symbol, %rax
add %gs:this_cpu_off, %rax

to:

mov %gs:this_cpu_off, %rax
add symbol, %rax

and allows the compiler to propagate addition into address calculation
(the latter is also similar to the code, generated by _seg_gs
approach).

At this point, the "experimental" part could either

a) introduce RDGSBASE:

As discussed with Sean, this could be problematic, at least with KVM,
and has some other drawbacks (e.g. larger binary size, limited CSE of
asm).

b) move to __seg_gs approach via _raw_cpu_read[1]:

This approach solves the "limited CSE with assembly" compiler issue,
since it exposes load to the compiler, and has greater optimization
potential.

[1] https://lore.kernel.org/lkml/20231010164234.140750-1-ubizjak@xxxxxxxxx/

Unfortunately, these two are mutually exclusive, since RDGSBASE is
implemented as asm.

To move things forward, I propose to proceed conservatively with the
original patch [1], but this one should be split into two parts. The
first will introduce the switch to MOV with tcp_ptr__ += (unsigned
long)(ptr), and the second will add __seg_gs part.

At this point, we can experiment with RDGSBASE, and compare it with
both approaches, with and without __seg_gs, by just changing the asm
template to:

+ asm (ALTERNATIVE("mov " __percpu_arg(1) ", %0", \
+ "rdgsbase %0", \
+ X86_FEATURE_FSGSBASE) \

Uros.