Re: [PATCH 0/1] mm: improve folio refcount scalability

From: Linus Torvalds

Date: Sun Mar 01 2026 - 13:53:31 EST


On Sat, 28 Feb 2026 at 19:27, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxx> wrote:
>
> This attached patch is ENTIRELY UNTESTED.

Here's a slightly cleaned up and further simplified version, which is
also actually tested, although only in the "it boots for me" sense.

It generates good code at least with clang:

.LBB76_7:
movl $1, %eax
.LBB76_8:
leal 1(%rax), %ecx
lock cmpxchgl %ecx, 52(%rdi)
sete %cl
je .LBB76_10
testl %eax, %eax
jne .LBB76_8
.LBB76_10:

which actually looks both simple and fairly optimal for that sequence.

Of course, since this is very much about cacheline access patterns,
actual performance will depend on random microarchitectural issues
(and not just the CPU core, but the whole memory subsystem).

Can somebody with a good - and relevant - benchmark system try this out?

Linus
include/linux/page_ref.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..d8e4f175f74c 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -234,8 +234,15 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)

rcu_read_lock();
/* avoid writing to the vmemmap area being remapped */
- if (page_count_writable(page, u))
- ret = atomic_add_unless(&page->_refcount, nr, u);
+ if (page_count_writable(page, u)) {
+ /* Assume count == 1, don't read it! */
+ int old = 1;
+ do {
+ ret = atomic_try_cmpxchg(&page->_refcount, &old, old+1);
+ if (likely(ret))
+ break;
+ } while (old);
+ }
rcu_read_unlock();

if (page_ref_tracepoint_active(page_ref_mod_unless))