Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

From: Andres Freund

Date: Sun Apr 05 2026 - 00:22:05 EST

Hi,

On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
> > shows up this heavily:
> >
> > - For something like the benchmark here, it should only be used until
> > postgres' buffer pool is fully used, as the freelist only contains buffers
> > not in use, and we check without a lock whether it contains buffers. Once
> > running, buffers are only added to the freelist if tables/indexes are
> > dropped/truncated. And the benchmark seems like it runs long enough that we
> > should actually reach the point the freelist should be empty?
> >
> > - The section covered by the spinlock is only a few instructions long and it
> > is only hit if we have to do a somewhat heavyweight operation afterwards
> > (read in a page into the buffer pool), it seems surprising that this short
> > section gets interrupted frequently enough to cause a regression of this
> > magnitude.
> >
> > For a moment I thought it might be because, while holding the spinlock, some
> > memory is touched for the first time, but that is actually not the case.
> >
>
> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
> update to an unreleased kernel.
>
>
> So far I don't see such a regression and I basically see no time spent
> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>
> Which I don't find surprising, this workload doesn't read enough to have
> contention in there. Salvatore reported on the order of 100k transactions/sec
> (with one update, one read and one insert). Even if just about all of those
> were misses - and they shouldn't be with 25% of 384G as postgres'
> shared_buffers as the script indicates, and we know that s_b is not full due
> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
> page cache. That's not that much.

> The benchmark script seems to indicate that huge pages aren't in use:
> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>
>
> I wonder if somehow the pages underlying the portions of postgres' shared
> memory are getting paged out for some reason, leading to page faults while
> holding the spinlock?

Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
do with 10s to 100s of GB of shared memory and thus part of all my
benchmarking infrastructure - during the benchmark runs mentioned above.

Turns out, if I *disable* huge pages, I actually can reproduce the contention
that Salvatore reported (didn't see whether it's a regression for me
though). Not anywhere close to the same degree, because the bottleneck for me
is the writes.

If I change the workload to a read-only benchmark, which obviously reads a lot
more due to not being bottleneck by durable-write-latency, I see more
contention:

- 12.76% postgres postgres [.] s_lock
- 12.75% s_lock
- 12.69% StrategyGetBuffer
GetVictimBuffer
- StartReadBuffer
- 12.69% ReleaseAndReadBuffer
+ 12.65% heapam_index_fetch_tuple

While what I said above is true, the memory touched at the time of contention
it isn't the first access to the relevant shared memory (i.e. it is already
backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
will be the first access of the connection processes to the relevant 4kB
pages.

Thus there will be a *lot* of minor faults and tlb misses while holding a
spinlock. Unsurprisingly that's bad for performance.

I don't see a reason to particularly care about the regression if that's the
sole way to trigger it. Using a buffer pool of ~100GB without huge pages is
not an interesting workload. With a smaller buffer pool the problem would not
happen either.

Note that the performance effect of not using huge pages is terrible
*regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
but not using huge pages is still utterly terrible (like 1/3 of the
throughput).

I did run some benchmarks here and I don't see a clearly reproducible
regression with huge pages.

Greetings,

Andres Freund