Re: [PATCH] sched: Further restrict the preemption modes

From: Ciunas Bennett

Date: Fri Jun 05 2026 - 06:47:50 EST

On 03/03/2026 11:52, Peter Zijlstra wrote:

This has two ramifications:

1) some ping-pong workloads will turn into block+wakeup, adding
overhead.

FULL: running your task A, an interrupt would come in, wake task B and
set Need Resched and the interrupt return path calls schedule() and
you're task B. B does its thing, 'wakes' A and blocks.

LAZY: running your task A, an interrupt would come in, wake task B (no
NR set), you continue running A, A blocks for it needs something of B,
now you schedule() [*] B runs, does its thing, does an actual wakeup of
A and blocks.

The distinct difference here is that LAZY does a block of A and
consequently B has to do a full wakeup of A, whereas FULL doesn't do a
block of A, and hence the wakeup of A is NOP as well.

2) Since the schedule() is delayed, it might happen that by the time it
does get around to it, your task B is no longer the most eligible
option.

Same as above, except now, C is also woken, and the schedule marked with
[*] picks C, this then results in a detour, delaying things further.

Hi Peter,
I wanted to share an update/findings from the investigations that I carried out for the issue mentioned above.

Quick refresh:
Workload: uperf sending TCP data between two VMs (client and server), each configured with a single vhost queue (min vhost ques for testing)
Issue: With lazy preemption as the default preemption mode where previously it was full preemption, there is a significant drop in performance for this workload

Simplification of the issue
We have two tasks:

TaskA produces data
TaskB consumes the data produced by TaskA

Notification path: TaskA informs TaskB that new data is available by adding a new item to a workqueue. This triggers a kworker which runs and notifies TaskB.

Issue
TaskA is configured to use schedule_work(). Internally, schedule_work() uses system_percpu_wq, which is configured as:
<WQ_PERCPU = 1 << 8, /* bound to a specific cpu */>

This means the workqueue item will be woken up and executed on the same CPU that queued the work.
If the task that queues the work (TaskA) is a long-running task with limited opportunities to call schedule(), then the kworker may be delayed significantly before it gets CPU time.
In our scenario:

TaskA continuously produces data
There is no dependency requiring TaskA to yield due to TaskB
As a result, TaskA can occupy the CPU for an entire tick before being preempted by the kworker

Observed behavior
This is exactly what we observe in practice:

TaskB corresponds to the VM consuming data generated by our vhost task
When running uperf, this behavior leads to a significant drop in throughput (Gb/s)
The VM is unable to consume data in a timely manner
When it is finally notified of new data, the delayed signaling introduces jitter
This causes TCP issues, including retransmissions and out-of-order packets

Results:
|--------------+-----+------------------+------------------------|
| preempt mode | Gbs | workqueue pool | kworker latency avg ms |
|--------------+-----+------------------+------------------------|
| full | ~50 | system_percpu_wq | 0.002 |
| lazy | ~13 | system_percpu_wq | 0.721 |
| lazy | ~50 | system_dfl_wq | 0.005 |
|--------------+-----+------------------+------------------------|

So I did some more testing and if I use a different workqueue pool the system_dfl_wq the TP was good again, as you can see in the results table.
Since the kworker is not CPU-bound, the scheduler has flexibility to select a more suitable CPU for execution.

/* system_dfl_wq is unbound workqueue. Workers are not bound to
* any specific CPU, not concurrency managed, and all queued works are
* executed immediately as long as max_active limit is not reached and
* resources are available. */

Given this understanding, what would be the best approach here? Should we consider changing the workqueue usage in the KVM code, or do you see an alternative way to address this issue?