Re: [RFC v4 3/5] atomic: Specify alignment for atomic_t and atomic64_t
From: Finn Thain
Date: Mon Nov 24 2025 - 22:52:30 EST
On Mon, 24 Nov 2025, Daniel Palmer wrote:
> On Tue, 21 Oct 2025 at 07:39, Finn Thain <fthain@xxxxxxxxxxxxxx> wrote:
> >
> > Some recent commits incorrectly assumed 4-byte alignment of locks.
> > That assumption fails on Linux/m68k (and, interestingly, would have
> > failed on Linux/cris also). Specify the minimum alignment of atomic
> > variables for fewer surprises and (hopefully) better performance.
>
> FWIW I implemented jump labels for m68k and I think there is a problem
> with this in there too.
> jump_label_init() calls static_key_set_entries() and setting
> key->entries in there is corrupting 'atomic_t enabled' at the start of
> key.
>
> With this patch the problem goes away.
>
That's interesting. I wonder whether the alignment requirements of machine
instructions permitted the "appropriation" of the low bits from those
pointers...
In anycase, a modified jump label algorithm that did not use/abuse pointer
bits would need to execute as fast as the existing implementation. And
that might be quite difficult (especially a portable algorithm).
Recently I had an opportunity to do some performance measurements on m68k
for this atomic_t alignment patch. I tested some kernel stressors on an
AWS 95 (33 MHz 68040, 128 MB RAM, 512 KiB L2$) and also on a Mac IIfx (40
MHz 68030, 80 MB RAM, 32 KiB L2$).
The patch makes the kernel faster or slower, depending the workload. For
example, the fifo, futex and shm stressors were consistently faster
whereas the splice, signal and msg stressors were consistently slower.
There are no hardware counters for cache misses that might account for
part of the slowdown. OTOH, alignment also reduces instances of locks
split across page boundaries, which might account for the speed-up. (I
didn't look at VM performance counters.)
Finally, I should note that the stress-ng man page says "do NOT use" as a
benchmark. OK, well, if anyone wishes to reproduce my results, I can send
you the statically linked binary I used. The job file is attached.
I wonder whether others have done any throughput measurement for this
patch, using their favourite workloads?run sequential
metrics-brief
timeout 180s
no-rand-seed
oomable
temp-path /tmp
clone 1
clone-ops 4
dentry 1
dentry-ops 8192
#dev 1
#dev-ops 300
dev-shm 1
dev-shm-ops 20
dnotify 1
dnotify-ops 1200
fault 1
fault-ops 8000
fifo 1
fifo-ops 24000
file-ioctl 1
file-ioctl-ops 20000
futex 1
futex-ops 40000
get 1
get-ops 3000
getdent 1
getdent-ops 10000
icmp-flood 1
icmp-flood-ops 40000
inotify 1
inotify-ops 400
ioprio 1
ioprio-ops 8000
kill 1
kill-ops 150000
memfd 1
memfd-bytes 32m
memfd-ops 8
mmapfork 1
mmapfork-ops 4
msg 1
msg-ops 300000
nop 1
nop-ops 3000
poll 1
poll-ops 8000
ptrace 1
ptrace-ops 50000
pty 1
pty-ops 2
rawpkt 1
rawpkt-ops 80000
rawudp 1
rawudp-ops 15000
resources 1
resources-ops 300
revio 1
revio-ops 50000
seek 1
seek-ops 12000
#sem 1
#sem-ops 4000
sem-sysv 1
sem-sysv-ops 300000
sendfile 1
sendfile-ops 1500
set 1
set-ops 20000
shm 1
shm-ops 15
sigchld 1
sigchld-ops 5000
signal 1
signal-ops 150000
sigsegv 1
sigsegv-ops 100000
sock 1
sock-ops 50
splice 1
splice-ops 10000
tee 1
tee-ops 1500
udp 1
udp-ops 30000
utime 1
utime-ops 4000
vm 1
vm-ops 2500