[RFC] making nested spin_trylock() work on UP?

Next message: Vlastimil Babka: "[RFC] making nested spin_trylock() work on UP?"
Previous message: Kiryl Shutsemau: "Re: [BUG] Fault during memory acceptance for TDX VMs with certain memory sizes"
Next in thread: Matthew Wilcox: "Re: [RFC] making nested spin_trylock() work on UP?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Vlastimil Babka

Date: Fri Feb 13 2026 - 06:57:54 EST

Hi,

this is not a real RFC PATCH, but more like discussion about possible
direction. I wanted to have a patch at hand, but the layers of spinlock APIs
are rather complex for me to untangle, so I'd rather know first if it's even
worth trying.

The page allocator has been using a locking scheme for its percpu page
caches (pcp) for years now, based on spin_trylock() with no _irqsave() part.
The point is that if we interrupt the locked section, we fail the trylock
and just fallback to something that's more expensive, but it's rare so we
don't need to pay the irqsave cost all the time in the fastpaths.

It's similar to but not exactly local_trylock_t (which is also newer anyway)
because in some cases we do lock the pcp of a non-local cpu to flush it, in
a way that's cheaper than IPI or queue_work_on().

The complication of this scheme has been UP non-debug spinlock
implementation which assumes spin_trylock() can't fail on UP and has no
state to track it. It just doesn't anticipate this usage scenario. So to
work around that we disable IRQs on UP, complicating the implementation.
Also recently we found years old bug in the implementation - see
038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n").

So my question is if we could have spinlock implementation supporting this
nested spin_trylock() usage, or if the UP optimization is still considered
too important to lose it. I was thinking:

- remove the UP implementation completely - would it increase the overhead
on SMP=n systems too much and do we still care?

- make the non-debug implementation a bit like the debug one so we do have
the 'locked' state (see include/linux/spinlock_up.h and lock->slock). This
also adds some overhead but not as much as the full SMP implementation?

Below is how this would simplify page_alloc.c.

Thanks,
Vlastimil