Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

From: K Prateek Nayak
Date: Tue Nov 26 2024 - 00:33:12 EST


Hello Mike,

On 11/23/2024 2:14 PM, Mike Galbraith wrote:
[..snip..]

Question: did wiping off the evil leave any meaningful goodness behind?

---

sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Phil Auld (Redhat) reported an fio benchmark regression having been found
to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
may be related to wakees losing the ability to migrate, and confirmed that
restoration of same indeed did restore previous performance.

V2: do not rip buddies apart, convenient on/off switch

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Mike Galbraith <efault@xxxxxx>
---
kernel/sched/core.c | 51 ++++++++++++++++++++++++++++++------------------
kernel/sched/features.h | 5 ++++
kernel/sched/sched.h | 5 ++++
3 files changed, 42 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
*/
static int ttwu_runnable(struct task_struct *p, int wake_flags)
{
- struct rq_flags rf;
- struct rq *rq;
- int ret = 0;
-
- rq = __task_rq_lock(p, &rf);
- if (task_on_rq_queued(p)) {
- update_rq_clock(rq);
- if (p->se.sched_delayed)
- enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
- if (!task_on_cpu(rq, p)) {
- /*
- * When on_rq && !on_cpu the task is preempted, see if
- * it should preempt the task that is current now.
- */
- wakeup_preempt(rq, p, wake_flags);
+ CLASS(__task_rq_lock, rq_guard)(p);
+ struct rq *rq = rq_guard.rq;
+
+ if (!task_on_rq_queued(p))
+ return 0;
+
+ update_rq_clock(rq);
+ if (p->se.sched_delayed) {
+ int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+ int dequeue = sched_feat(DEQUEUE_DELAYED);
+
+ /*
+ * Since sched_delayed means we cannot be current anywhere,
+ * dequeue it here and have it fall through to the
+ * select_task_rq() case further along in ttwu() path.
+ * Note: Do not rip buddies apart else chaos follows.
+ */
+ if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&

Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
encounter a delayed task?

+ !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {

Technically, we are still looking at the last wakeup here since
record_wakee() is only called later. If we care about 1:1 buddies,
should we just see "current == p->last_wakee", otherwise, there is a
good chance "p" has a m:n waker-wakee relationship in which case
perhaps a "want_affine" like heuristic can help?

For science, I was wondering if the decision to dequeue + migrate or
requeue the delayed task can be put off until after the whole
select_task_rq() target selection (note: without the h_nr_delayed
stuff, some of that wake_affine_idle() logic falls apart). Hackbench
(which saw some regression with EEVDF Complete) seem to like it
somewhat, but it still falls behind NO_DELAY_DEQUEUE.

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
NO_DELAY_DEQUEUE Mike's v2 Full ttwu + requeue/migrate
5.76 5.72 ( 1% ) 5.82 ( -1% )
6.53 6.56 ( 0% ) 6.65 ( -2% )
6.79 7.04 ( -4% ) 7.02 ( -3% )
6.91 7.04 ( -2% ) 7.03 ( -2% )
7.63 8.05 ( -6% ) 7.88 ( -3% )

Only subtle changes in IBS profiles; there aren't any obvious shift
in hotspots with hackbench at least. Not sure if it is just the act of
needing to do a dequeue + enqueue from the wakeup context that adds to
the overall regression.

--
Thanks and Regards,
Prateek

+ dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+ return 0;
}
- ttwu_do_wakeup(p);
- ret = 1;
+
+ enqueue_task(rq, p, queue_flags);
+ }
+ if (!task_on_cpu(rq, p)) {
+ /*
+ * When on_rq && !on_cpu the task is preempted, see if
+ * it should preempt the task that is current now.
+ */
+ wakeup_preempt(rq, p, wake_flags);
}
- __task_rq_unlock(rq, &rf);
+ ttwu_do_wakeup(p);

- return ret;
+ return 1;
}

#ifdef CONFIG_SMP
[..snip..]