Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

From: Andres Freund

Date: Sat Apr 04 2026 - 21:40:42 EST


Hi,

On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
> On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
> shows up this heavily:
>
> - For something like the benchmark here, it should only be used until
> postgres' buffer pool is fully used, as the freelist only contains buffers
> not in use, and we check without a lock whether it contains buffers. Once
> running, buffers are only added to the freelist if tables/indexes are
> dropped/truncated. And the benchmark seems like it runs long enough that we
> should actually reach the point the freelist should be empty?
>
> - The section covered by the spinlock is only a few instructions long and it
> is only hit if we have to do a somewhat heavyweight operation afterwards
> (read in a page into the buffer pool), it seems surprising that this short
> section gets interrupted frequently enough to cause a regression of this
> magnitude.
>
> For a moment I thought it might be because, while holding the spinlock, some
> memory is touched for the first time, but that is actually not the case.
>

I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
update to an unreleased kernel.


So far I don't see such a regression and I basically see no time spent
GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).

Which I don't find surprising, this workload doesn't read enough to have
contention in there. Salvatore reported on the order of 100k transactions/sec
(with one update, one read and one insert). Even if just about all of those
were misses - and they shouldn't be with 25% of 384G as postgres'
shared_buffers as the script indicates, and we know that s_b is not full due
to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
page cache. That's not that much.


Now, this machine is smaller and a different arch, so who knows.

The 7.0-rc numbers I am getting are higher than what Salvatore reported on a
bigger machine. It's hard to compare though, as I am testing with local
storage, and this workload should be extremely write latency bound (but my
storage has crappy fsync latency, so ...).


I *do* see some contention where it's conceivable that rseq slice extension
could help some, but

a) It's a completely different locks: On the WALWrite lock

Which is precisely the lock you'd expect in a commit latency bound workload
with a lot of clients (the lock is used to wait for an in-flight WAL flush
to complete).

b) So far I have not observed a regression from 6.18.


For me a profile looks like this:
- 60.99% 0.95% postgres postgres [.] PostgresMain
- 60.04% PostgresMain
- 22.57% PortalRun
- 20.88% PortalRunMulti
- 16.70% standard_ExecutorRun
- 16.55% ExecModifyTable
+ 10.78% ExecScan
+ 3.19% ExecUpdate
+ 1.53% ExecInsert
+ 2.94% standard_ExecutorStart
0.54% standard_ExecutorEnd
+ 1.60% PortalRunSelect
- 15.89% CommitTransactionCommand
- 15.50% CommitTransaction
- 11.90% XLogFlush
- 7.66% LWLockAcquireOrWait
6.70% LWLockQueueSelf
0.57% perform_spin_delay

Which is about what I would expect.


Salvatore, is there a chance your profile is corrupted and you did observe
contention, but on a different lock? E.g. due to out-of-date debug symbols or
such?


Could you run something like the following while the benchmark is running:

SELECT backend_type, wait_event_type, wait_event, state, count(*) FROM pg_stat_activity where wait_event_type NOT IN ('Activity') GROUP BY backend_type, wait_event_type, wait_event, state order by count(*) desc \watch 1

and show what you see at the time your profile shows the bad contention?



On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
> On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > The fix here is to make PostgreSQL make use of rseq slice extension:
> >
> > https://lkml.kernel.org/r/20251215155615.870031952@xxxxxxxxxxxxx
> >
> > That should limit the exposure to lock holder preemption (unless
> > PostgreSQL is doing seriously egregious things).
>
> Maybe we should, but requiring the use of a new low level facility that was
> introduced in the 7.0 kernel, to address a regression that exists only in
> 7.0+, seems not great.
>
> It's not like it's a completely trivial thing to add support for either, so I
> doubt it'll be the right thing to backpatch it into already released major
> versions of postgres.

It's not even suggested to be enabled by default:

CONFIG_RSEQ_SLICE_EXTENSION:

Allows userspace to request a limited time slice extension when
returning from an interrupt to user space via the RSEQ shared
data ABI. If granted, that allows to complete a critical section,
so that other threads are not stuck on a conflicted resource,
while the task is scheduled out.

If unsure, say N.

And enabling it requires EXPERT=1.

If this somehow does end up being a reproducible performance issue (I still
suspect something more complicated is going on), I don't see how userspace
could be expected to mitigate a substantial perf regression in 7.0 that can
only be mitigated by a default-off non-trivial functionality also introduced
in 7.0.

Greetings,

Andres Freund