Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
From: Mitsumasa KONDO
Date: Fri Apr 10 2026 - 11:43:32 EST
Hi,
Thank you Salvatore for the thorough cross-architecture benchmarks
and pg_stat_activity data, and Andres for the insightful analysis
of the spinlock and huge page behavior.
Apologies for yet another hypothesis, but I suspect the regression
may not be limited to spinlock contention alone -- it could be
triggering a secondary feedback loop in the kernel's Buffered I/O
throttling:
PREEMPT_LAZY -> lock holders preempted during page faults
-> spinlock contention (838/1024 backends stalled)
-> dirty page generation rate drops
-> bdi->write_bandwidth estimate converges to artificially
low value (exponential smoothing makes recovery slow)
-> balance_dirty_pages() over-throttles on next burst
-> throughput cannot recover -> 0.51x
Huge pages break this loop at the entry point: fewer TLB misses
mean fewer page faults while holding spinlocks, so contention
never escalates and the bandwidth estimator stays calibrated.
Note that the rseq slice extension was validated with Oracle,
which uses Direct I/O, bypassing the page cache and
balance_dirty_pages() entirely. PostgreSQL uses Buffered I/O
by default, so this class of regression would not have been
caught in that validation.
My worst-case concern is that this loop could affect any
Buffered I/O workload where PREEMPT_LAZY disrupts dirty page
generation patterns, not just PostgreSQL.
Regards,
--
Mitsumasa KONDO
NTT Software Innovation Center