Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: Gautham R. Shenoy
Date: Fri Sep 15 2023 - 11:18:58 EST


Hello Chen Yu,

On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote:
[..snip..]

> >
> > So despite "reserving" the CPU for p1, which is likely to have its
> > data still hot in the case, we would have scheduled p1', thus
> > defeating the whole purpose of reservation.
> >
>
> I see. So you mean, although we reserve the CPU for the wakee,
> the wakee might not choose its previous CPU, which is against our
> goal.


Yes, but only because some other task could have run on the previous
CPU. That other task could be something that was woken up on that CPU
due to:

1) wake-affine choosing that CPU
2) newidle-balance pulling the other task on that CPU
3) !wake-affine && that CPU was also the other task's previous CPU

It could also be due to this wakee task being woken up on the waker
CPU due to wake-affine.

>
> The reason to prevent the wakee choosing its previous CPU could be:
> 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
> 2. the wakee's CPU has already been taken by someone else, via newidle_balance().
>


> For 1, I think Prateek has expressed the concern. One mitigation method could be
> that, we give penalty to that wakee, if it decides not to choose its previous CPU:

We would be penalizing the task for something that the scheduler
decides :-)

As you point out below, in the presence of the WF_SYNC flag,
wake_affine_idle() prefer the waker CPU over the previous CPU when
they are on different LLCs and when the waker is the only task.

This strategy makes sense for two reasons:

1) The wakee may be consuming the data produced by the waker.
2) Since the wakeup will happen on the local CPU, there is no risk of
task-stacking, exactly what your SIS_CURRENT patchset was
attempting.

But this strategy would also result in increased task-migration. Which
both Mattieu and you have found is not so beneficial for workloads
such as hackbench. Is it only because task's data is still hot in the
previous CPU's cache ? Or is there more to it ?


It would be good to confirm if this is why lower migration is better
for these kinds of workloads.

>
> "
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.
> "
>
> For 2, maybe inhit the newidle balance, something in my mind:
>
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> u64 t0, t1, curr_cost = 0;
> struct sched_domain *sd;
> int pulled_task = 0;
> + bool cache_hot = false;
>
> update_misfit_status(NULL, this_rq);
>
> @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> + if (sched_feat(SIS_CACHE)) {
> + s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
> +
> + /*
> + * If a short time later, a short sleeping task will be woken up
> + * on this idle CPU, do not launch the newidle balance.
> + */
> + if (delta > 0 && delta < this_rq->max_idle_balance_cost)
> + cache_hot = true;
> + }
> +
> if (!READ_ONCE(this_rq->rd->overload) ||
> - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {

>
> if (sd)
> update_next_balance(sd, &next_balance);

If the benefit that the workload obtains is really due to the data
being hot near its previous CPU, then this seems a sensible thing to
do.

It would be good to confirm this. Let me get some IBS data for
hackbench which is the workload which likes a sticky wakeup.

--
Thanks and Regards
gautham.



>
>
> > To be honest, this isn't so bad, because we have been able to avoid a
> > migration in this case.
> >
> > >
> > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > > CPU but it is skipped due to cache-hot?
> >
> > I had originally thought about this, but then as you pointed out we
> > have an opportunity to pick the previous cpu in the early checks
> > inside select_idle_sibling().
> >
> > >
> > > > Have you considered recording p1's identity in the
> > > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > > return the previous CPU if it is cache hot and the wakee is
> > > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > > scan.
> > > >
> > >
> > > Yes this seems to be donable, and one problem would be, if there are
> > > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > > want to deal with multiple dequeued tasks. If we record all of them,
> > > it might be costly.
> >
> > If there are multiple dequeued tasks, then it doesn't make sense to
> > record the identity of the tasks. However, we need the bail out to be
> > much earlier, in select_task_rq_fair(), perhaps even before the
> > want_affine() checks.
> >
> > After all, if the previous CPU is idle, and its cache_hot_timeout
> > hasn't expired, and if the wakee's sleep duration is less than the
> > cache_hot_timeout, why don't we just pick it here and be done with it?
> >
>
> Yes we can return the previous CPU earlier, one concern is that, should
> we honor WF_SYNC flag or not, because in wake_affine_idle(), WF_SYNC
> seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
> if the current CPU has 1 running task, and the previous CPU is idle,
> wake_affine_idle() still prefers the current CPU.
>
> thanks,
> Chenyu