Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

From: Yang Shi

Date: Wed May 13 2026 - 20:00:50 EST




On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
=========
The benchmarks are done on 160 core AmpereOne machine. The baseline is
v7.1-rc1 kernel.

1. Kernel Build
---------------
Run kernel build (make -j160) with the default Fedora kernel config in a
memcg.
13% - 18% sys time improvment
3% - 7% wall time improvement
This is pretty impressive!

Thank you.


There was quite some feedback during the LSF/MM session, what's the current plan?

We didn't talk about the plan in the LSFMM session due to time ran out. I had some hallway conversation with Ryan. He said he will try to replicate the performance benchmarks on some other ARM64 machines.

He raised the concern about CNP (Common not Private), but neither I nor he can find machines with shared TLB. We do need some help to run the patchset on those machines because disabling CNP may have some performance implication.

I plan to polish up the patchset. There are still a lot work to do to make it in a better shape. Sounds likes a plan?

I'm not sure whether S390 folks will implement this on S390 or not, anyway they are cc'ed.


Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
there a way forward?

Yeah, it was discussed. My point is it makes some sense for x86 to not have per cpu page table because userspace and kernel share the same page table on x86, so the number of kernel page tables is actually unbounded. But ARM64 is different. The hardware supports separate userspace and kernel page tables, so the number of kernel page tables is actually bounded by the number of CPUs. And my regression tests didn't show noticeable regression for setting up percpu local mapping for 160 cores (means 160 kernel page tables).

So we should maximize the hardware benefit IMHO. And it should be up to the architecture maintainers.



Finally, in the LSF/MM session, there was the question why the preemption
handling is even required. Can you describe what the problem is?

Someone questioned why not just remove preempt_disable/enable because we just care about the sum of the counters. It may be ok for some cases, for example, some simple statistics, but it may cause problems for a lot usecases, for example:
    - __this_cpu_*() ops don't use atomic instructions. If they happen to access the same counter with this_cpu_*() concurrently, the counter may be corrupted.
    - this_cpu_write() may write a value or pointer, it may corrupt the remote CPU's copy.
    - The percpu counter may call into slow path to flush the per cpu counters to a global counter if some threshold is reached, the imprecise per cpu counter may result in suboptimal behavior, for example, calling in slow path more than necessary.
    - Cause the statistics out of sync or larger deviation than expected because the counter flush is not done due to comparing the threshold with wrong value.
    - AFAIK, scheduler may use percpu counter for some percpu lock, the imprecise counter may cause lockup and misbehavior.
    - And some subsystems maintain percpu state, then make decision based on the percpu state. The corrupted percpu state may cause various problems.
    - this_cpu_cmpxchg() may compare the remote CPU's value and result in indefinite loop.

There are a lot other cases that I may be not aware of because percpu is widely used by various subsystems. Anyway the spec is this_cpu_*() ops just can access local CPU copy. Accessing remote CPU's data is definitely not expected and may cause various problems.

Thanks,
Yang