Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

From: Uros Bizjak
Date: Fri Oct 13 2023 - 15:40:25 EST


On Fri, Oct 13, 2023 at 6:04 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Oct 11, 2023, Uros Bizjak wrote:
> > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> > only decoding in the first decoder etc. But we're talking single-cycle
> > kind of effects, and the rdgsbase case should be much better from
> > a cache perspective and might use fewer memory pipeline resources to
> > offset the fact that it uses an unusual front end decoder resource...
>
> The switch to RDGSBASE should be a separate patch, and should come with actual
> performance numbers.

This *is* the patch to switch to RDGSBASE. The propagation of
arguments is a nice side-effect of the patch. due to the explicit
addition of the offset addend to the %gs base. This patch is
alternative implementation of [1]

[1] x86/percpu: Use C for arch_raw_cpu_ptr(),
https://lore.kernel.org/lkml/20231010164234.140750-1-ubizjak@xxxxxxxxx/

Unfortunately, I have no idea on how to measure the impact of such a
low-level feature, so I'll at least need some guidance. The "gut
feeling" says that special instruction, intended to support the
feature, is always better than emulating said feature with a memory
access.

> A significant percentage of data accesses in Intel's TDX-Module[*] use this
> pattern, e.g. even global data is relative to GS.base in the module due its rather
> odd and restricted environment. Back in the early days of TDX, the module used
> RD{FS,GS}BASE instead of prefixes to get pointers to per-CPU and global data
> structures in the TDX-Module. It's been a few years so I forget the exact numbers,
> but at the time a single transition between guest and host would have something
> like ~100 reads of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed
> accesses reduced the latency for a guest<->host transition through the TDX-Module
> by several thousand cycles, as every RD{FS,GS}BASE had a latency of ~18 cycles
> (again, going off 3+ year old memories).
>
> The TDX-Module code is pretty much a pathological worth case scenario, but I
> suspect its usage is very similar to most usage of raw_cpu_ptr(), e.g. get a
> pointer to some data structure and then do multiple reads/writes from/to that
> data structure.
>
> The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy to emulate.
> If a hypervisor/VMM is advertising FSGSBASE even when it's not supported by
> hardware, e.g. to migrate VMs to older hardware, then every RDGSBASE will end up
> taking a few thousand cycles (#UD -> VM-Exit -> emulate). I would be surprised
> if any hypervisor actually does this as it would be easier/smarter to simply not
> advertise FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
> doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole reason
> I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was because I had
> hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM TDX development on older
> hardware. I.e. it's not impossible that this code could run on hardware where
> RDGSBASE is emulated in software.

There are some other issues when memory access to the percpu area is
implemented with an asm. An ongoing analysis shows that compilers
can't CSE asm over basic-block boundaries, the CSE of asm is a very
simple pattern matching through the BB. So, taking into account all
presented facts and noticeable increase in binary size due to the use
of alternatives, I'm leaning towards using C for arch_raw_cpu_ptr().
Compilers are extremely good at optimizing memory access, so I guess
the original patch [1] outweighs the trouble with RDGSBASE. Unless
someone proves that RDGSBASE is noticeably *better* than current (or
future optimized) implementation.

> [*] https://www.intel.com/content/www/us/en/download/738875/intel-trust-domain-extension-intel-tdx-module.html

Thanks,
Uros.