Re: [PATCH] sched: Further restrict the preemption modes

From: Ilya Leoshkevich

Date: Tue Feb 24 2026 - 21:31:40 EST

On 2/24/26 16:45, Ciunas Bennett wrote:

On 19/12/2025 10:15, Peter Zijlstra wrote:

Hi Peter,
We are observing a performance regression on s390 since enabling PREEMPT_LAZY.
Test Environment
Architecture: s390
Setup:

Single KVM host running two identical guests
Guests are connected virtually via Open vSwitch
Workload: uperf streaming read test with 50 parallel connections
One guest acts as the uperf client, the other as the server

Open vSwitch configuration:

OVS bridge with two ports
Guests attached via virtio‑net
Each guest configured with 4 vhost‑queues

Problem Description
When comparing PREEMPT_LAZY against full PREEMPT, we see a substantial drop in throughput—on some systems up to 50%.

Observed Behaviour
By tracing packets inside Open vSwitch (ovs_do_execute_action), we see:
Packet drops
Retransmissions
Reductions in packet size (from 64K down to 32K)

Capturing traffic inside the VM and inspecting it in Wireshark shows the following TCP‑level differences between PREEMPT_FULL and PREEMPT_LAZY:
|--------------------------------------+--------------+--------------+------------------|
| Wireshark Warning / Note             | PREEMPT_FULL | PREEMPT_LAZY | (lazy vs full)   |
|--------------------------------------+--------------+--------------+------------------|
| D-SACK Sequence                      |          309 | 2603 | ×8.4             |
| Partial Acknowledgement of a segment |           54 | 279 | ×5.2             |
| Ambiguous ACK (Karn)                 |           32 | 747 | ×23              |
| (Suspected) spurious retransmission |          205 | 857 | ×4.2             |
| (Suspected) fast retransmission      |           54 | 1622 | ×30              |
| Duplicate ACK                        |          504 | 3446 | ×6.8             |
| Packet length exceeds MSS (TSO/GRO) |        13172 | 34790 | ×2.6             |
| Previous segment(s) not captured     |         9205 | 6730 | -27%             |
| ACKed segment that wasn't captured   |         7022 | 8272 | +18%             |
| (Suspected) out-of-order segment     |          436 | 303 | -31%             |
|--------------------------------------+--------------+--------------+------------------|
This pattern indicates reordering, loss, or scheduling‑related delays, but it is still unclear why PREEMPT_LAZY is causing this behaviour in this workload.

Additional observations:

Monitoring the guest CPU run time shows that it drops from 16% with PREEMPT_FULL to 9% with PREEMPT_LAZY.

The workload is dominated by voluntary preemption (schedule()), and PREEMPT_LAZY is, as far as I understand, mainly concerned with forced preemption.
It is therefore not obvious why PREEMPT_LAZY has an impact here.

Changing guest configuration to disable mergeable RX buffers:
      <host mrg_rxbuf="off"/>
      had a clear effect on throughput:
      PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s

When I look at top sched_switch kstacks on s390 with this workload, 20% of them are worker_thread() -> schedule(), both with CONFIG_PREEMPT and CONFIG_PREEMPT_LAZY. The others are vhost and idle.

On x86 I see only vhost and idle, but not worker_thread().

According to runqlat.bt, average run queue latency goes up from 4us to 18us when switching from CONFIG_PREEMPT to CONFIG_PREEMPT_LAZY.

I modified the script to show per-comm latencies, and it shows that worker_thread() is disproportionately penalized: the latency increases from 2us to 60us!

For vhost it's better: 5us -> 2us, and for KVM it's better too: 8us -> 2us.

Finally, what is the worker doing? I looked at __queue_work() kstacks, and they all come from irqfd_wakeup().

irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which is implemented on x86 and not implemented on s390.

This may explain why we on s390 are the first to see this.

Christian, do you think if it would make sense to implement kvm_arch_set_irq_inatomic() on s390?