Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

From: Mitsumasa KONDO

Date: Sun Apr 05 2026 - 21:46:33 EST

Hi Andres,

Thank you for testing this.

On 2026-04-06, Andres Freund wrote:
> It's not sustained, the spinning just lasts a between 10 and 1000
> iterations, after that there's randomized exponential backoff using
> nanosleep.
> Which actually will happen after a smaller number of cycles of with a
> shorter SPIN_DELAY.

> If I remove the rep nop on x86-64, the performance of the 4kB pages
> workload is basically unaffected, even with PREEMPT_LAZY.

The fact that removing rep nop made no difference suggests that the
spinlock is not the bottleneck in your environment. Could you share
your storage configuration? Salvatore's setup uses 12x 1TB AWS io2 at
32000 IOPS each (384K IOPS total in RAID0), which effectively
eliminates WAL fsync as a bottleneck. In a storage-limited
environment, changes to spin delay behavior would naturally be
invisible because throughput is capped by I/O before spinlock
contention becomes material.

Also worth noting: Salvatore's environment is an EC2 instance
(m8g.24xlarge), not bare metal. Hypervisor-level vCPU scheduling
adds another layer on top of PREEMPT_LAZY -- a lock holder can be
descheduled not only by the kernel scheduler but also by the
hypervisor, and the guest kernel has no visibility into this. This
could amplify the regression in ways that are not reproducible on
bare-metal systems, regardless of architecture.

If you want to isolate the effect of SPIN_DELAY on throughput
under PREEMPT_LAZY, I would suggest:

1. Use synchronous_commit = off or unlogged tables to remove
I/O from the critical path entirely.
2. Use a read-only workload (pgbench -S) with shared_buffers
sized to force buffer eviction contention.
3. Run on a high-core-count system with all CPUs saturated
under PREEMPT_LAZY.

This should expose the pure impact of spin loop behavior without
I/O or WAL masking the results.

> The spinning helps with workloads that are contended for very short
> amounts of time. But that's not the case in this workload without
> huge pages, instead of low 10s of cycles, we regularly spend a few
> orders of magnitude more cycles holding the lock.

I agree that the 4kB page / huge page difference is significant.
But even when individual spin durations are short, the cumulative
effect across hundreds of backends matters. Small per-iteration
overhead in the spin loop, multiplied by high concurrency, can
add up to measurable throughput loss -- the effect that becomes
visible only when I/O is not the dominant bottleneck.

Regards,
--
Mitsumasa KONDO
NTT Software Innovation Center