Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: Gautham R. Shenoy
Date: Thu Sep 14 2023 - 03:07:07 EST


Hello Chen Yu,

On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote:
> Hi Gautham,
>
> thanks for the review,
>
> On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > > Hi Aaron,
> > >
> > > thanks for the review,
> > >
> > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> >
> > [..snip..]
> >
> > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > > {
> > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > > + if (sched_feat(SIS_CACHE) &&
> > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > > + return -1;
> > > > > +
> > > >
> > > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > > like available_idle_cpu_migrate() so that above and below logic can be
> > > > simplified a bit?
> > > >
> > >
> > > Yes, that would be simpler, I'll do in next version.
> > >
> > > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > > available_idle_cpu() but then a long sleeping task could be forced to
> > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > > >
> > >
> > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > > itself.
> >
> > This happens even with this patch right? It is possible for a task p1
> > whose avg sleep time is "t" to go to sleep which causes its CPU to go
> > idle. When it wakes up after a time t' < t, the logic above skips the
> > idle CPU because it is still cache-hot, despite the fact that it is
> > cache hot for p1!
> >
> Not sure if I understand correctly, in select_idle_sibling() p1's previous
> CPU will be checked first, and that check does not involve cache-hot. So if
> p1's previous CPU is idle, it will be picked, no?
>
> if (prev != target && cpus_share_cache(prev, target) &&
> (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> asym_fits_cpu(task_util, util_min, util_max, prev))
> return prev;


You are right, but the "if" condition is the one prior to "if (prev !=
target ...)" which causes p1 can pick the previous CPU here. However,
note that it is not just p1 which can pick the previous CPU as
intended by your patch.

Consider the following case:

* p1' ran on CPUX, and went to sleep. We now update the
cache_hot_timeout for CPUX based on the sleep-avg of p1'

* When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or
due to shared_runq search), the CPU could pull p1 from another CPU
Y.

* p1 now runs on CPUX, and goes to sleep.

* We update the cache_hot_timeout for CPUX based on the sleep-avg of p1.

* p1' wakesup and picks prev as the target after doing wake_affine()
check in select_task_rq_fair().

* In select_idle_sibling() it encounters the following, which checks
out and thus returns from target.

if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

* p1 now wakes up, picks the previous cpu as the target. However,
available_idle_cpu() is no longer true.

So despite "reserving" the CPU for p1, which is likely to have its
data still hot in the case, we would have scheduled p1', thus
defeating the whole purpose of reservation.

To be honest, this isn't so bad, because we have been able to avoid a
migration in this case.

>
> Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> CPU but it is skipped due to cache-hot?

I had originally thought about this, but then as you pointed out we
have an opportunity to pick the previous cpu in the early checks
inside select_idle_sibling().

>
> > Have you considered recording p1's identity in the
> > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > return the previous CPU if it is cache hot and the wakee is
> > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > scan.
> >
>
> Yes this seems to be donable, and one problem would be, if there are
> more than 2 dequeued tasks prefer the same (previous) CPU, which task
> should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> want to deal with multiple dequeued tasks. If we record all of them,
> it might be costly.

If there are multiple dequeued tasks, then it doesn't make sense to
record the identity of the tasks. However, we need the bail out to be
much earlier, in select_task_rq_fair(), perhaps even before the
want_affine() checks.

After all, if the previous CPU is idle, and its cache_hot_timeout
hasn't expired, and if the wakee's sleep duration is less than the
cache_hot_timeout, why don't we just pick it here and be done with it?

>
> [1] https://lore.kernel.org/lkml/2a47ae82-b8cd-95db-9f48-82b3df0730f3@xxxxxxxxxxxx/
>
> thanks,
> Chenyu

--
Thanks and Regards
gautham.