Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

Next message: Peter Xu: "Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()"
Previous message: Nikita Kalyazin: "Re: [PATCH v11 02/16] set_memory: add folio_{zap, restore}_direct_map helpers"
In reply to: Andres Freund: "Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Mitsumasa KONDO

Date: Fri Apr 10 2026 - 11:43:32 EST

Hi,

Thank you Salvatore for the thorough cross-architecture benchmarks
and pg_stat_activity data, and Andres for the insightful analysis
of the spinlock and huge page behavior.

Apologies for yet another hypothesis, but I suspect the regression
may not be limited to spinlock contention alone -- it could be
triggering a secondary feedback loop in the kernel's Buffered I/O
throttling:

PREEMPT_LAZY -> lock holders preempted during page faults
-> spinlock contention (838/1024 backends stalled)
-> dirty page generation rate drops
-> bdi->write_bandwidth estimate converges to artificially
low value (exponential smoothing makes recovery slow)
-> balance_dirty_pages() over-throttles on next burst
-> throughput cannot recover -> 0.51x

Huge pages break this loop at the entry point: fewer TLB misses
mean fewer page faults while holding spinlocks, so contention
never escalates and the bandwidth estimator stays calibrated.

Note that the rseq slice extension was validated with Oracle,
which uses Direct I/O, bypassing the page cache and
balance_dirty_pages() entirely. PostgreSQL uses Buffered I/O
by default, so this class of regression would not have been
caught in that validation.

My worst-case concern is that this loop could affect any
Buffered I/O workload where PREEMPT_LAZY disrupts dirty page
generation patterns, not just PostgreSQL.

Regards,
--
Mitsumasa KONDO
NTT Software Innovation Center