Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

From: Andres Freund

Date: Sat Apr 04 2026 - 13:42:31 EST

Hi,

On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> > We are reporting a throughput and latency regression on PostgreSQL
> > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
> > ("sched: Further restrict the preemption modes") introduced in
> > v7.0-rc1.
> >
> > The regression manifests as a 0.51x throughput drop on a pgbench
> > simple-update workload with 1024 clients on a 96-vCPU
> > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
> > shows 55% of CPU time is consumed spinning in PostgreSQL's
> > userspace spinlock (s_lock()) under PREEMPT_LAZY:
> >
> > |- 56.03% - StartReadBuffer
> > |- 55.93% - GetVictimBuffer
> > |- 55.93% - StrategyGetBuffer
> > |- 55.60% - s_lock <<<< 55% of time
> > | |- 0.39% - el0t_64_irq
> > | |- 0.10% - perform_spin_delay
> > |- 0.08% - LockBufHdr
> > |- 0.07% - hash_search_with_hash_value
> > |- 0.40% - WaitReadBuffers
>
> The fix here is to make PostgreSQL make use of rseq slice extension:
>
> https://lkml.kernel.org/r/20251215155615.870031952@xxxxxxxxxxxxx
>
> That should limit the exposure to lock holder preemption (unless
> PostgreSQL is doing seriously egregious things).

Maybe we should, but requiring the use of a new low level facility that was
introduced in the 7.0 kernel, to address a regression that exists only in
7.0+, seems not great.

It's not like it's a completely trivial thing to add support for either, so I
doubt it'll be the right thing to backpatch it into already released major
versions of postgres.

This specific spinlock doesn't actually exist anymore in postgres' trunk
(feature freeze in a few days, release early autumn). But there is at least
one other one that can often be quite hotly contended (although there is a
relatively low limit to the number of backends that can acquire it
concurrently, which might be the saving grace here).

I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
shows up this heavily:

- For something like the benchmark here, it should only be used until
postgres' buffer pool is fully used, as the freelist only contains buffers
not in use, and we check without a lock whether it contains buffers. Once
running, buffers are only added to the freelist if tables/indexes are
dropped/truncated. And the benchmark seems like it runs long enough that we
should actually reach the point the freelist should be empty?

- The section covered by the spinlock is only a few instructions long and it
is only hit if we have to do a somewhat heavyweight operation afterwards
(read in a page into the buffer pool), it seems surprising that this short
section gets interrupted frequently enough to cause a regression of this
magnitude.

For a moment I thought it might be because, while holding the spinlock, some
memory is touched for the first time, but that is actually not the case.

The benchmark script seems to indicate that huge pages aren't in use:
https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15

I wonder if somehow the pages underlying the portions of postgres' shared
memory are getting paged out for some reason, leading to page faults while
holding the spinlock?

Salvatore, could you repeat that benchmark in some variations?
1) Use huge pages
2) 1) + prewarm the buffer pool pool before running the benchmark
CREATE EXTENSION pg_prewarm;
-- prewarm table data
SELECT pg_prewarm(oid) FROM pg_class WHERE relname LIKE 'pgbench_accounts%' and relkind = 'r';
-- prewarm indexes, do so after tables, as indexes are more important, and
-- the buffer pool might not be big enough
SELECT pg_prewarm(oid) FROM pg_class WHERE relname LIKE 'pgbench_accounts%' and relkind = 'i';

I assume postgres was built with either an -march suffient to use atomic
operations (i.e. -march=armv8.1-a or such) instead of ll/sc? Or at least
-moutline-atomics was used?

Greetings,

Andres Freund