Re: [PATCH 0/1] mm: improve folio refcount scalability

From: Pedro Falcato

Date: Sun Mar 01 2026 - 15:27:05 EST

On Sun, Mar 01, 2026 at 10:52:57AM -0800, Linus Torvalds wrote:
> On Sat, 28 Feb 2026 at 19:27, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > This attached patch is ENTIRELY UNTESTED.
>
> Here's a slightly cleaned up and further simplified version, which is
> also actually tested, although only in the "it boots for me" sense.
>
> It generates good code at least with clang:
>
> .LBB76_7:
> movl $1, %eax
> .LBB76_8:
> leal 1(%rax), %ecx
> lock cmpxchgl %ecx, 52(%rdi)
> sete %cl
> je .LBB76_10
> testl %eax, %eax
> jne .LBB76_8
> .LBB76_10:
>
> which actually looks both simple and fairly optimal for that sequence.
>
> Of course, since this is very much about cacheline access patterns,
> actual performance will depend on random microarchitectural issues
> (and not just the CPU core, but the whole memory subsystem).
>
> Can somebody with a good - and relevant - benchmark system try this out?
>
> Linus

Here are some perhaps interesting numbers from an extremely synthetic
benchmark[1] I wrote just now:

note: xadd_bench is lock addl, cmpxchg is the typical load + lock cmpxchg loop,
and optimistic_cmpxchg_benchmark is similar to what you wrote, where we assume
1 and only later do we do the actual loop. I also don't claim this is
representative of page cache performance, but this is quite a lot simpler
to set up and play around with.

On my zen4 AMD Ryzen 7 PRO 7840U laptop:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
xadd_bench/threads:1 2.76 ns 2.76 ns 250435782
xadd_bench/threads:4 42.1 ns 42.1 ns 15969296
xadd_bench/threads:8 84.8 ns 84.8 ns 8920800
xadd_bench/threads:16 226 ns 211 ns 2446928
cmpxchg_bench/threads:1 3.12 ns 3.12 ns 220339301
cmpxchg_bench/threads:4 51.1 ns 51.1 ns 12372808
cmpxchg_bench/threads:8 112 ns 112 ns 6228056
cmpxchg_bench/threads:16 679 ns 648 ns 930832
optimistic_cmpxchg_bench/threads:1 2.95 ns 2.95 ns 233704391
optimistic_cmpxchg_bench/threads:4 56.2 ns 56.2 ns 11780588
optimistic_cmpxchg_bench/threads:8 140 ns 140 ns 4606440
optimistic_cmpxchg_bench/threads:16 789 ns 746 ns 806400

Here we can see that the optimistic cmpxchg still can't match the xadd/lock addl
performance in single-thread, and degrades quickly and worse than straight up
cmpxchg under load (perhaps presumably because of the cmpxchg miss).

On our internal large 160-core Intel(R) Xeon(R) CPU E7-8891 v4 (older uarch,
sad) machine:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
xadd_bench/threads:1 13.6 ns 13.6 ns 51445934
xadd_bench/threads:4 41.4 ns 166 ns 4211940
xadd_bench/threads:8 30.3 ns 242 ns 2190488
xadd_bench/threads:16 37.3 ns 596 ns 1162336
xadd_bench/threads:64 24.9 ns 1376 ns 640000
xadd_bench/threads:128 27.3 ns 3108 ns 1054592
cmpxchg_bench/threads:1 17.9 ns 17.9 ns 38992029
cmpxchg_bench/threads:4 54.8 ns 219 ns 3431076
cmpxchg_bench/threads:8 39.0 ns 312 ns 1698712
cmpxchg_bench/threads:16 62.2 ns 994 ns 530672
cmpxchg_bench/threads:64 28.5 ns 1479 ns 665280
cmpxchg_bench/threads:128 17.2 ns 1838 ns 517376
optimistic_cmpxchg_bench/threads:1 13.6 ns 13.6 ns 51384286
optimistic_cmpxchg_bench/threads:4 70.2 ns 281 ns 2585092
optimistic_cmpxchg_bench/threads:8 58.1 ns 465 ns 1598592
optimistic_cmpxchg_bench/threads:16 106 ns 1694 ns 420832
optimistic_cmpxchg_bench/threads:64 30.8 ns 1767 ns 499264
optimistic_cmpxchg_bench/threads:128 39.3 ns 4632 ns 447104

Here, optimistic seems to match xadd in single-threaded, but then very quickly
degrades. In general optimistic_cmpxchg seems to degrade worse than cmpxchg,
but there is a lot of variance here (and other users lightly using it) so
results (particularly those with higher thread counts) should be taken with a
grain of salt (for example, lock add scaling dratistically worse than cmpxchg
seems to be a fluke).

TL;DR I don't think the idea quite works, particularly when a folio is under
contention, because if you have traffic on a cacheline then you certainly have
a couple of threads trying to grab a refcount. And doing two cmpxchgs just
increases traffic and pessimises things. Also perhaps worth noting that neither
solution scales in any way.

[1] https://gist.github.com/heatd/2a6e6c778c3cfd4aa6804b2d598c7a4c (excuse my C++)
--
Pedro