Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Chen Yu
Date: Sat Apr 06 2024 - 05:25:01 EST


On 2024-04-05 at 12:28:02 +0200, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
>
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
>
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
>
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
>
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
>
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
>
> XXX test the cfs-throttle stuff
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---

Tested schbench on xeon server, which has 240 CPUs/2 sockets.
schbench -m 2 -r 100
the result seems ok to me.

baseline:
NO_DELAY_DEQUEUE
NO_DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1658446 total samples)
50.0th: 5 (361126 samples)
90.0th: 11 (654121 samples)
* 99.0th: 25 (123032 samples)
99.9th: 673 (13845 samples)
min=1, max=8337
Request Latencies percentiles (usec) runtime 100 (s) (1662381 total samples)
50.0th: 14992 (524771 samples)
90.0th: 15344 (657370 samples)
* 99.0th: 15568 (129769 samples)
99.9th: 15888 (10017 samples)
min=3529, max=43841
RPS percentiles (requests) runtime 100 (s) (101 total samples)
20.0th: 16544 (37 samples)
* 50.0th: 16608 (30 samples)
90.0th: 16736 (31 samples)
min=16403, max=17698
average rps: 16623.81


DELAY_DEQUEUE
DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1668161 total samples)
50.0th: 6 (394867 samples)
90.0th: 12 (653021 samples)
* 99.0th: 31 (142636 samples)
99.9th: 755 (14547 samples)
min=1, max=5226
Request Latencies percentiles (usec) runtime 100 (s) (1671859 total samples)
50.0th: 14384 (511809 samples)
90.0th: 14992 (653508 samples)
* 99.0th: 15408 (149257 samples)
99.9th: 15984 (12090 samples)
min=3546, max=38360
RPS percentiles (requests) runtime 100 (s) (101 total samples)
20.0th: 16672 (45 samples)
* 50.0th: 16736 (52 samples)
90.0th: 16736 (0 samples)
min=16629, max=16800
average rps: 16718.59


The 99th wakeup latency increases a little bit, and should be in the acceptible
range(25 -> 31 us). Meanwhile the throughput increases accordingly. Here are
the possible reason I can think of:

1. wakeup latency: The time to find an eligible entity in the tree
during wakeup might take longer - if there are more delayed-dequeue
tasks in the tree.
2. throughput: Inhibit task dequeue can decrease the ratio to touch the
task group's load_avg: dequeue_entity()-> { update_load_avg(), update_cfs_group()),
which reduces the cache contention among CPUs, and improves throughput.


> + } else {
> + bool sleep = flags & DEQUEUE_SLEEP;
> +
> + SCHED_WARN_ON(sleep && se->sched_delayed);
> + update_curr(cfs_rq);
> +
> + if (sched_feat(DELAY_DEQUEUE) && sleep &&
> + !entity_eligible(cfs_rq, se)) {

Regarding the elibigle check, it was found that there could be an overflow
issue, and it brings false negative of entity_eligible(), which was described here:
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@xxxxxxxxx/
and also reported on another machine
https://lore.kernel.org/lkml/ZeCo7STWxq+oyN2U@xxxxxxxxx/
I don't have good idea to avoid that overflow properly, while I'm trying to
reproduce it locally, do you have any guidance on how to address it?

thanks,
Chenyu