Re: [PATCH] arm64: remove HAVE_CMPXCHG_LOCAL

From: Catalin Marinas

Date: Tue Feb 17 2026 - 08:53:34 EST


On Mon, Feb 16, 2026 at 08:59:17PM +0530, Dev Jain wrote:
> On 16/02/26 4:30 pm, Will Deacon wrote:
> > On Sun, Feb 15, 2026 at 11:39:44AM +0800, Jisheng Zhang wrote:
> >> It turns out the generic disable/enable irq this_cpu_cmpxchg
> >> implementation is faster than LL/SC or lse implementation. Remove
> >> HAVE_CMPXCHG_LOCAL for better performance on arm64.
> >>
> >> Tested on Quad 1.9GHZ CA55 platform:
> >> average mod_node_page_state() cost decreases from 167ns to 103ns
> >> the spawn (30 duration) benchmark in unixbench is improved
> >> from 147494 lps to 150561 lps, improved by 2.1%
> >>
> >> Tested on Quad 2.1GHZ CA73 platform:
> >> average mod_node_page_state() cost decreases from 113ns to 85ns
> >> the spawn (30 duration) benchmark in unixbench is improved
> >> from 209844 lps to 212581 lps, improved by 1.3%
> >>
> >> Signed-off-by: Jisheng Zhang <jszhang@xxxxxxxxxx>
> >> ---
> >> arch/arm64/Kconfig | 1 -
> >> arch/arm64/include/asm/percpu.h | 24 ------------------------
> >> 2 files changed, 25 deletions(-)
> > That is _entirely_ dependent on the system, so this isn't the right
> > approach. I also don't think it's something we particularly want to
> > micro-optimise to accomodate systems that suck at atomics.
>
> Hi Will,
>
> As I mention in the other email, the suspect is not the atomics, but
> preempt_disable(). On Apple M3, the regression reported in [1] resolves
> by removing preempt_disable/enable in _pcp_protect_return. To prove
> this another way, I disabled CONFIG_ARM64_HAS_LSE_ATOMICS and the
> regression worsened, indicating that at least on Apple M3 the
> atomics are faster.

Then why don't we replace the preempt disabling with local_irq_save()
in the arm64 code and still use the LSE atomics?

IIUC (lots of macro indirection), the generic cmpxchg is not atomic, so
another CPU is allowed to mess this up if it accesses current CPU's
variable via per_cpu_ptr().

--
Catalin