Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: Yang Shi

Date: Thu May 21 2026 - 16:48:10 EST




On 5/21/26 10:55 AM, David Laight wrote:
On Thu, 21 May 2026 09:57:37 -0700
Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> wrote:

On 5/21/26 3:17 AM, David Laight wrote:
On Wed, 20 May 2026 17:23:37 -0700
Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> wrote:
On 5/20/26 3:34 PM, David Laight wrote:
...
But I'm sure I remember that some cpu don't like having the same
physical address at different virtual addresses (and not just those
with VIVT caches like some sparc cpu).
Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
really percpu, so the mapping to the physical address belonging to
another CPU should never pollute the current CPU's cache if I don't miss
something.
I'm sure code can end up accessing the current cpu's percpu data
using the same address that other cpu use - there are definitely
places where it needs that address.
No, it is not. In the percpu page table approach, we use different
address for this_cpu_*() and per_cpu_ptr() which is mainly used to
initialize percpu data for all CPUs.
You missed something.

Look, for example, at kernel/locking/osq_lock.c
The code uses this_cpu_ptr() and then both dereferences the pointer
and writes it to places that other cpu will use.
It also uses per_cpu_ptr() to get an address it can use for the per-cpu
data of another cpu.
(That code all assumes preemption is disabled.)
this_cpu_ptr() uses different addresses for different CPUs. It is a
special case, it is not due to VIVT, but because it may confuse list
API. Because list API determines list is empty by comparing pointers
(head->next == head). this_cpu_read/write/add/sub, etc, are fine.
But you could quite easily get code that manages to mix accesses through
this_cpu_ptr() with direct accesses to per-cpu variables in the same
cache line.

I can see potential cache alias issue with VIVT cache with the below pattern:

for_each_cpu(cpu)
    per_cpu_ptr(cpu) <-- Initialize per cpu data

this_cpu_inc(current_cpu) <-- Inc the current cpu copy

this_cpu_inc() may see stale copy (uninitialized) if there is no cache flush after initialization.

I'm sure some arm cpu really don't like you doing that.
(But it is a foggy memory from somewhere.)

ARMv8 requires PIPT if I remember correctly. Some old 32 bit arm machines may have VIVT cache, but 32 bit arm is not the target user TBH. I can see some potential issues with VIVT cache, I don't think they are the target users and VIVT cache is rare now.


You can use per-cpu page tables, but it really only helps for a
few items.
Anything that is RMW (eg add on pretty much everything except x86)
either has to disable preemption or use a compare and exchange loop.

It only helps this_cpu ops because they use atomic instructions (at least on ARM64). __this_cpu ops still require preemption disabled. But the performance improvement is still impressive even though it just can help this_cpu ops.

Variables like 'current' can be written into the per-cpu page table
data area by the process switch code (as I believe s390 does).

That may be a useful usecase.


The 'trick' here will work for reading/writing values if you don't
care that the value read is stale (or might have been written to
the memory for a different cpu).

If you don't care the stale value, you can just call __this_cpu_read().

Thanks,
Yang

It might work for updating the preemption disable count - because
you can only be preempted while it is zero.

-- David

And per_cpu_ptr() also uses different addresses for different CPUs.

The lwn article explained it.
https://lwn.net/SubscriberLink/1073395/12c08f128e515809/

Thanks,
Yang

-- David
Thanks,
Yang