Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

From: IBM

Date: Sun Apr 05 2026 - 04:04:55 EST

Andres Freund <andres@xxxxxxxxxxx> writes:

> Hi,
>
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
>> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
>> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
>> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
>> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
>> > shows up this heavily:
>> >
>> > - For something like the benchmark here, it should only be used until
>> > postgres' buffer pool is fully used, as the freelist only contains buffers
>> > not in use, and we check without a lock whether it contains buffers. Once
>> > running, buffers are only added to the freelist if tables/indexes are
>> > dropped/truncated. And the benchmark seems like it runs long enough that we
>> > should actually reach the point the freelist should be empty?
>> >
>> > - The section covered by the spinlock is only a few instructions long and it
>> > is only hit if we have to do a somewhat heavyweight operation afterwards
>> > (read in a page into the buffer pool), it seems surprising that this short
>> > section gets interrupted frequently enough to cause a regression of this
>> > magnitude.
>> >
>> > For a moment I thought it might be because, while holding the spinlock, some
>> > memory is touched for the first time, but that is actually not the case.
>> >
>>
>> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
>> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
>> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
>> update to an unreleased kernel.
>>
>>
>> So far I don't see such a regression and I basically see no time spent
>> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>>
>> Which I don't find surprising, this workload doesn't read enough to have
>> contention in there. Salvatore reported on the order of 100k transactions/sec
>> (with one update, one read and one insert). Even if just about all of those
>> were misses - and they shouldn't be with 25% of 384G as postgres'
>> shared_buffers as the script indicates, and we know that s_b is not full due
>> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
>> page cache. That's not that much.
>
>
>> The benchmark script seems to indicate that huge pages aren't in use:
>> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>>
>>
>> I wonder if somehow the pages underlying the portions of postgres' shared
>> memory are getting paged out for some reason, leading to page faults while
>> holding the spinlock?
>
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.
>
> Turns out, if I *disable* huge pages, I actually can reproduce the contention
> that Salvatore reported (didn't see whether it's a regression for me
> though). Not anywhere close to the same degree, because the bottleneck for me
> is the writes.
>
> If I change the workload to a read-only benchmark, which obviously reads a lot
> more due to not being bottleneck by durable-write-latency, I see more
> contention:
>
> - 12.76% postgres postgres [.] s_lock
> - 12.75% s_lock
> - 12.69% StrategyGetBuffer
> GetVictimBuffer
> - StartReadBuffer
> - 12.69% ReleaseAndReadBuffer
> + 12.65% heapam_index_fetch_tuple
>
>
> While what I said above is true, the memory touched at the time of contention
> it isn't the first access to the relevant shared memory (i.e. it is already
> backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> will be the first access of the connection processes to the relevant 4kB
> pages.
>
> Thus there will be a *lot* of minor faults and tlb misses while holding a
> spinlock. Unsurprisingly that's bad for performance.
>
>
> I don't see a reason to particularly care about the regression if that's the
> sole way to trigger it. Using a buffer pool of ~100GB without huge pages is
> not an interesting workload. With a smaller buffer pool the problem would not
> happen either.
>
> Note that the performance effect of not using huge pages is terrible
> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
>
>
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.
>

However, for curiosity, I was hoping if someone more familiar with the
scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
performance regression w/o huge pages?

Minor page fault handling has micro-secs latency, where as sched ticks
is in milli-secs. Besides, both preemption models should anyway
schedule() if TIF_NEED_RESCHED is set on return to userspace, right?

So was curious to understand how is the preemption model causing
performance regression with no hugepages in this case?

-ritesh