Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

From: Andres Freund

Date: Sun Apr 05 2026 - 10:09:52 EST

Hi,

On 2026-04-05 11:38:59 +0530, Ritesh Harjani wrote:
> Andres Freund <andres@xxxxxxxxxxx> writes:
> > Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> > do with 10s to 100s of GB of shared memory and thus part of all my
> > benchmarking infrastructure - during the benchmark runs mentioned above.
> >
> > Turns out, if I *disable* huge pages, I actually can reproduce the contention
> > that Salvatore reported (didn't see whether it's a regression for me
> > though). Not anywhere close to the same degree, because the bottleneck for me
> > is the writes.
> >
> > If I change the workload to a read-only benchmark, which obviously reads a lot
> > more due to not being bottleneck by durable-write-latency, I see more
> > contention:
> >
> > - 12.76% postgres postgres [.] s_lock
> > - 12.75% s_lock
> > - 12.69% StrategyGetBuffer
> > GetVictimBuffer
> > - StartReadBuffer
> > - 12.69% ReleaseAndReadBuffer
> > + 12.65% heapam_index_fetch_tuple
> >
> >
> > While what I said above is true, the memory touched at the time of contention
> > it isn't the first access to the relevant shared memory (i.e. it is already
> > backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> > will be the first access of the connection processes to the relevant 4kB
> > pages.
> >
> > Thus there will be a *lot* of minor faults and tlb misses while holding a
> > spinlock. Unsurprisingly that's bad for performance.
> >
> >
> > I don't see a reason to particularly care about the regression if that's the
> > sole way to trigger it. Using a buffer pool of ~100GB without huge pages is
> > not an interesting workload. With a smaller buffer pool the problem would not
> > happen either.
> >
> > Note that the performance effect of not using huge pages is terrible
> > *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> > but not using huge pages is still utterly terrible (like 1/3 of the
> > throughput).
> >
> >
> > I did run some benchmarks here and I don't see a clearly reproducible
> > regression with huge pages.
> >
>
> However, for curiosity, I was hoping if someone more familiar with the
> scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
> performance regression w/o huge pages?
>
> Minor page fault handling has micro-secs latency, where as sched ticks
> is in milli-secs. Besides, both preemption models should anyway
> schedule() if TIF_NEED_RESCHED is set on return to userspace, right?

> So was curious to understand how is the preemption model causing
> performance regression with no hugepages in this case?

An attempt at answering that, albeit not from the angle of somebody knowing
the scheduler code to a meaningful degree:

I think the effect of 4KB pages (and the associated minor faults and TLB
misses) is just to create contention on a spinlock that would normally never
be contended, due to occuring while the spinlock is held in this quite extreme
workload. This contention happens with PREEMPT_NONE as well - the performance
is quite bad compared to when using huge pages.

My guess is PREEMPT_LAZY just exacerbates the terrible contention by
scheduling out the lock holder more often. But you're already in deep trouble
at this point, even without PREEMPT_LAZY making it "worse".

On my machine (smaller than Salvatore's) PREEMPT_LAZY is a worse, but not by
that much. I suspect for Salvatore PREEMPT_LAZY just made already terrible
contention worse. I think we really need a comparison run from Salvatore with
huge pages with both PREEMPT_NONE and PREEMPT_LAZY.

FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part
has just about nothing to do with the problem. My understanding is that
default futexes don't transfer the lock waiter's scheduler slice to the lock
holder (there's no information about who the lock holder is unless it's a PI
futex), Postgres' spinlock have randomized exponential backoff and the amount
of spinning is adjusted over time, so you don't actually end up with spinlock
waiters preventing the lock owner from getting scheduled to a significant
degree.

Greetings,

Andres Freund