Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: Chen Yu
Date: Thu Sep 14 2023 - 08:10:02 EST


Hi Gautham,

On 2023-09-14 at 12:36:33 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
>
> On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote:
> > Hi Gautham,
> >
> > thanks for the review,
> >
> > On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> > > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > > > Hi Aaron,
> > > >
> > > > thanks for the review,
> > > >
> > > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> > >
> > > [..snip..]
> > >
> > > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > > > {
> > > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > > > + if (sched_feat(SIS_CACHE) &&
> > > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > > > + return -1;
> > > > > > +
> > > > >
> > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > > > like available_idle_cpu_migrate() so that above and below logic can be
> > > > > simplified a bit?
> > > > >
> > > >
> > > > Yes, that would be simpler, I'll do in next version.
> > > >
> > > > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > > > available_idle_cpu() but then a long sleeping task could be forced to
> > > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > > > >
> > > >
> > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > > > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > > > itself.
> > >
> > > This happens even with this patch right? It is possible for a task p1
> > > whose avg sleep time is "t" to go to sleep which causes its CPU to go
> > > idle. When it wakes up after a time t' < t, the logic above skips the
> > > idle CPU because it is still cache-hot, despite the fact that it is
> > > cache hot for p1!
> > >
> > Not sure if I understand correctly, in select_idle_sibling() p1's previous
> > CPU will be checked first, and that check does not involve cache-hot. So if
> > p1's previous CPU is idle, it will be picked, no?
> >
> > if (prev != target && cpus_share_cache(prev, target) &&
> > (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > asym_fits_cpu(task_util, util_min, util_max, prev))
> > return prev;
>
>
> You are right, but the "if" condition is the one prior to "if (prev !=
> target ...)" which causes p1 can pick the previous CPU here. However,
> note that it is not just p1 which can pick the previous CPU as
> intended by your patch.
>
> Consider the following case:
>
> * p1' ran on CPUX, and went to sleep. We now update the
> cache_hot_timeout for CPUX based on the sleep-avg of p1'
>
> * When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or
> due to shared_runq search), the CPU could pull p1 from another CPU
> Y.

Agree, newidle balance could scribble the cache-hot CPU.

>
> * p1 now runs on CPUX, and goes to sleep.
>
> * We update the cache_hot_timeout for CPUX based on the sleep-avg of p1.
>
> * p1' wakesup and picks prev as the target after doing wake_affine()
> check in select_task_rq_fair().
>
> * In select_idle_sibling() it encounters the following, which checks
> out and thus returns from target.
>
> if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> asym_fits_cpu(task_util, util_min, util_max, target))
> return target;
>
> * p1 now wakes up, picks the previous cpu as the target. However,
> available_idle_cpu() is no longer true.
>
> So despite "reserving" the CPU for p1, which is likely to have its
> data still hot in the case, we would have scheduled p1', thus
> defeating the whole purpose of reservation.
>

I see. So you mean, although we reserve the CPU for the wakee,
the wakee might not choose its previous CPU, which is against our
goal.

The reason to prevent the wakee choosing its previous CPU could be:
1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
2. the wakee's CPU has already been taken by someone else, via newidle_balance().

For 1, I think Prateek has expressed the concern. One mitigation method could be
that, we give penalty to that wakee, if it decides not to choose its previous CPU:

"
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
if (new_cpu != prev_cpu)
p->burst_sleep_avg >>= 1;
So the duration of reservation could be shrinked.
"

For 2, maybe inhit the newidle balance, something in my mind:


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
int pulled_task = 0;
+ bool cache_hot = false;

update_misfit_status(NULL, this_rq);

@@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);

+ if (sched_feat(SIS_CACHE)) {
+ s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
+
+ /*
+ * If a short time later, a short sleeping task will be woken up
+ * on this idle CPU, do not launch the newidle balance.
+ */
+ if (delta > 0 && delta < this_rq->max_idle_balance_cost)
+ cache_hot = true;
+ }
+
if (!READ_ONCE(this_rq->rd->overload) ||
- (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {

if (sd)
update_next_balance(sd, &next_balance);


> To be honest, this isn't so bad, because we have been able to avoid a
> migration in this case.
>
> >
> > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > CPU but it is skipped due to cache-hot?
>
> I had originally thought about this, but then as you pointed out we
> have an opportunity to pick the previous cpu in the early checks
> inside select_idle_sibling().
>
> >
> > > Have you considered recording p1's identity in the
> > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > return the previous CPU if it is cache hot and the wakee is
> > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > scan.
> > >
> >
> > Yes this seems to be donable, and one problem would be, if there are
> > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > want to deal with multiple dequeued tasks. If we record all of them,
> > it might be costly.
>
> If there are multiple dequeued tasks, then it doesn't make sense to
> record the identity of the tasks. However, we need the bail out to be
> much earlier, in select_task_rq_fair(), perhaps even before the
> want_affine() checks.
>
> After all, if the previous CPU is idle, and its cache_hot_timeout
> hasn't expired, and if the wakee's sleep duration is less than the
> cache_hot_timeout, why don't we just pick it here and be done with it?
>

Yes we can return the previous CPU earlier, one concern is that, should
we honor WF_SYNC flag or not, because in wake_affine_idle(), WF_SYNC
seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
if the current CPU has 1 running task, and the previous CPU is idle,
wake_affine_idle() still prefers the current CPU.

thanks,
Chenyu