Re: [PATCH v5 7/8] rust: percpu: Add pin-hole optimizations for numerics

From: Yury Norov

Date: Wed Apr 15 2026 - 17:58:06 EST

On Wed, Apr 15, 2026 at 01:34:28PM -0700, Mitchell Levy wrote:
> On Fri, Apr 10, 2026 at 11:06:22PM -0400, Yury Norov wrote:
> > On Fri, Apr 10, 2026 at 02:35:37PM -0700, Mitchell Levy wrote:
> > > The C implementations of `this_cpu_add`, `this_cpu_sub`, etc., are
> > > optimized to save an instruction by avoiding having to compute
> > > `this_cpu_ptr(&x)` for some per-CPU variable `x`. For example, rather
> > > than
> > >
> > > u64 *x_ptr = this_cpu_ptr(&x);
> > > *x_ptr += 5;
> > >
> > > the implementation of `this_cpu_add` is clever enough to make use of the
> > > fact that per-CPU variables are implemented on x86 via segment
> > > registers, and so we can use only a single instruction (where we assume
> > > `&x` is already in `rax`)
> > >
> > > add gs:[rax], 5
> > >
> > > Add this optimization via a `PerCpuNumeric` type to enable code-reuse
> > > between `DynamicPerCpu` and `StaticPerCpu`.
> > >
> > > Signed-off-by: Mitchell Levy <levymitchell0@xxxxxxxxx>
> > > ---
> > > rust/kernel/percpu.rs | 1 +
> > > rust/kernel/percpu/dynamic.rs | 10 ++-
> > > rust/kernel/percpu/numeric.rs | 138 ++++++++++++++++++++++++++++++++++++++++++
> > > samples/rust/rust_percpu.rs | 36 +++++++++++
> > > 4 files changed, 184 insertions(+), 1 deletion(-)
> > >

...

> > > + impl PerCpuNumeric<'_, $ty> {
> > > + /// Adds `rhs` to the per-CPU variable.
> > > + #[inline]
> > > + pub fn add(&mut self, rhs: $ty) {
> > > + // SAFETY: `self.ptr.0` is a valid offset into the per-CPU area (i.e., valid as a
> > > + // pointer relative to the `gs` segment register) by the invariants of this type.
> > > + unsafe {
> > > + asm!(
> > > + concat!("add gs:[{off}], {val:", $reg, "}"),
> > > + off = in(reg) self.ptr.0.cast::<$ty>(),
> > > + val = in(reg) rhs,
> >
> > So, every user of .add() now will be only compilable against x86_64?
> > I don't think it's right. Can you make it in a more convenient way:
> > implement a generic version, and then an x86_64-optimized.
> >
> > How bad the generic x86_64 version looks comparing to the optimized
> > one?
>
> Currently, all of `mod percpu` is behind `#[cfg(X86_64)]`, so usage of
> per-CPU variables in general is only compatible against x86_64.

Yes, and seemingly for no good reason. The only assembler function in
that patch (#4) is the get_ptr(), and to me it's quite easy to get it
implemented it in C.

> I believe a generic implementation would require implicitly creating a
> `CpuGuard` since in general you require two steps: computing the pointer
> to the per-CPU variable's slot in the current CPU's area and actually
> doing the write. On x86_64 we can get around this because segment
> register relative writes let us combine these two ops into one
> instruction which can't be torn across CPUs. But in the general case you
> could have the task get preempted between those two operations and end
> up with a data race.

Yes, you need CpuGuard to protect the .add(), and there's nothing
wrong with that. Especially if you provide better alternative for
the x86.

> As I understand it, x86 is the only arch where this is possible, so even
> once `mod percpu` supports more architectures, I think it'd still make
> some sense to have `PerCpuNumeric` specifically be x86 exclusive. This
> means that the user must always explicitly disable preemption rather
> than having a `PerCpuNumeric` type that sometimes does and sometimes
> doesn't.

GS register is the x86-only feature, but per-CPU variables are not. In
the mother kernel, we implement them for all architectures. For x86,
we've got the effective arch implementation where needed. But the
generic one is the first and foremost. Refer raw_cpu_ptr() for
example.

Have you checked performance difference of generic vs arch per-CPU
API on your workload? Is there any measurable difference? If not, I'd
rather focus on the generic version.

If you prefer the arch one, it's OK, after all. But can you please put
it under rust/arch/x86_64 please? And your driver, correspondingly,
would be hosted there as well.

Looking in the rust codebase, I see, it uses assembler inlines for
things like compile-time barriers, labels, sections. Those are
features of assembler language, a generic thing.

You're the first trying to add something really arch-specific, unless
I overlooked something. So please, either stay (happy) on the generic
ground, or create all the required infrastructure for architectures,
like the mother kernel does. Particularly, such a basic thing like
per-CPU API should be available for all architectures.

Thanks,
Yury