Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

From: Vincent Guittot
Date: Thu Nov 26 2020 - 04:49:43 EST


On Thu, 26 Nov 2020 at 10:35, Li, Aubrey <aubrey.li@xxxxxxxxxxxxxxx> wrote:
>
> On 2020/11/26 16:14, Vincent Guittot wrote:
> > On Wed, 25 Nov 2020 at 14:37, Li, Aubrey <aubrey.li@xxxxxxxxxxxxxxx> wrote:
> >>
> >> On 2020/11/25 16:31, Vincent Guittot wrote:
> >>> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <aubrey.li@xxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 2020/11/25 1:01, Vincent Guittot wrote:
> >>>>> Hi Aubrey,
> >>>>>
> >>>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
> >>>>>> Hi Vincent,
> >>>>>>
> >>>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
> >>>>>>> Hi Aubrey,
> >>>>>>>
> >>>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
> >>>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
> >>>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
> >>>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
> >>>>>>>> tick to ratelimit idle cpumask update.
> >>>>>>>>
> >>>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>>>>>> has low cost than scanning all the cpus in last level cache domain,
> >>>>>>>> especially when the system is heavily loaded.
> >>>>>>>>
> >>>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
> >>>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
> >>>>>>>> schbench have no notable change, uperf has:
> >>>>>>>>
> >>>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
> >>>>>>>>
> >>>>>>>> threads baseline-avg %std patch-avg %std
> >>>>>>>> 96 1 0.83 1.23 3.27
> >>>>>>>> 144 1 1.03 1.67 2.67
> >>>>>>>> 192 1 0.69 1.81 3.59
> >>>>>>>> 240 1 2.84 1.51 2.67
> >>>>>>>>
> >>>>>>>> v4->v5:
> >>>>>>>> - add update_idle_cpumask for s2idle case
> >>>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>>>>>>> idle_cpumask() everywhere
> >>>>>>>>
> >>>>>>>> v3->v4:
> >>>>>>>> - change setting idle cpumask from every idle entry to tickless idle
> >>>>>>>> if cpu driver is available.
> >>>>>>>
> >>>>>>> Could you remind me why you did this change ? Clearing the cpumask is
> >>>>>>> done during the tick to rate limit the number of updates of the
> >>>>>>> cpumask but It's not clear for me why you have associated the set with
> >>>>>>> the tick stop condition too.
> >>>>>>
> >>>>>> I found the current implementation has better performance at a more
> >>>>>> suitable load range.
> >>>>>>
> >>>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
> >>>>>> tick) to shrink idle cpumask when the system is busy, but
> >>>>>
> >>>>> I'm ok with the part above
> >>>>>
> >>>>>>
> >>>>>> - Setting the idle mask everytime the cpu enters idle requires a much
> >>>>>> heavier load level to preserve the idle cpumask(not call into idle),
> >>>>>> otherwise the bits cleared in scheduler tick will be restored when the
> >>>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
> >>>>>> cpumask during task wakeup if the system load is not heavy enough.
> >>>>>
> >>>>> But setting the idle cpumask is useful because it helps to select an idle
> >>>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> >>>>> the idle cpu mask is useful in heavy cases because a system, which is
> >>>>> already fully busy with work, doesn't want to waste time looking for an
> >>>>> idle cpu that doesn't exist.
> >>>>
> >>>> Yes, this is what v3 does.
> >>>>
> >>>>> But if there is an idle cpu, we should still looks for it.
> >>>>
> >>>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
> >>>> in different idle state, the idle duration could be long or could be very short.
> >>>> For example, if there are two idle cpus:
> >>>>
> >>>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
> >>>> - CPU2 is in idle for a tick length and wake up to do the regular work
> >>>>
> >>>> If both added to the idle cpumask, we want the latter one, or we can just add
> >>>> the later one into the idle cpumask. That's why I want to associate tick stop
> >>>> signal with it.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> - Associating with tick stop tolerates idle to preserve the idle cpumask
> >>>>>> but only short idle, which causes tick retains. This is more fitable for
> >>>>>> the real workload.
> >>>>>
> >>>>> I don't agree with this and real use cases with interaction will probably
> >>>>> not agree as well as they want to run on an idle cpu if any but not wait
> >>>>> on an already busy one.
> >>>>
> >>>> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
> >>>> is in the short idle mode, it's very likely that when it's picked up for a
> >>>> wakeup task, it goes back to work again, and the wakeup task has to wait too,
> >>>> maybe longer because the running task just starts.
> >>>>
> >>>> One benefit of waiting on the previous one is warm cache.
> >>>>
> >>>>> Also keep in mind that a tick can be up to 10ms long
> >>>>
> >>>> Right, but the point here is, if this 10ms tick retains, the CPU should be
> >>>> in the short idle mode.
> >>>
> >>> But 10, 4 or even 1ms is quite long for a system and that's even more
> >>> true compared to scanning the idle cpu mask
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> This change means that a cpu will not be part of the idle mask if the
> >>>>>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> >>>>>>> if the idle duration is expected to be higher than 1-2ms which starts
> >>>>>>> to be significantly long. Also, the cpuidle governor can easily
> >>>>>>> mis-predict a short idle duration whereas it will be finally a long
> >>>>>>> idle duration; In this case, the next tick will correct the situation
> >>>>>>> and select a deeper state, but this can happen up to 4ms later on
> >>>>>>> arm/arm64.
> >>>>>>
> >>>>>> Yes this is intented. If the tick is not stopped, that indicates the
> >>>>>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
> >>>>>> the expected idle duration is shorter than the tick period length. For
> >>>>>
> >>>>> As mentioned above a tick can be up to 10ms long which is not a short idle
> >>>>> duration.
> >>>>
> >>>> Usually when the tick retains, the CPU is in the short idle mode or even polling
> >>>> instead of idle.
> >>>
> >>> Also keep in mind that cpuidle can select a shallow state and retains
> >>> tick because of the wake up latency constraint and not the idle
> >>> duration. So you can't really make the assumption that retaining tick
> >>> means short idle duration
> >>>
> >> idle governor has short idle information, probably can let idle governor
> >> expose a short idle indicator?
> >>
> >>>>
> >>>>>
> >>>>> Then the governor also mispredicts the idle duration and this is one
> >>>>> reason that the tick is not stopped because it will give the opportunity
> >>>>> to reevaluate the idle state in case of misprediction.
> >>>>>
> >>>> We always predict the next state based on the past states, so misprediction
> >>>> does happen. This is not what this patch is trying to solve. I'm certainly
> >>>
> >>> My point here was to say that one original goal of cpuidle for
> >>> retaining the tick was to handle case where the governor mispredicts a
> >>> short idle time. Retaining the tick prevents the cpu to stay too long
> >>> in this shallow idle state and to waste power which seems to happen
> >>> often enough to be raised by people
> >>
> >> I see, thanks!
> >>
> >>>
> >>>> open if there is a better signal instead of stop_tick from idle governor.
> >>>>
> >>>>
> >>>>>> example, uperf enters and exits 80 times between two ticks when utilizes
> >>>>>> 100% CPU, and the average idle residency < 50us.
> >>>>>
> >>>>> But scheduler looks for idle state of prev cpu before looping the idle cpu
> >>>>> mask so i'm not sure that uperf is impacted in this case because scheduler
> >>>>> will select prev cpu before loop idle cpu mask.
> >>>>>
> >>>>>>
> >>>>>> If this CPU is added to idle cpumask, the wakeup task likely needs to
> >>>>>> wait in the runqueue as this CPU will run its current task very soon.
> >>>>>>
> >>>>>>>
> >>>>>>> So I would prefer to keep trying to set the idle mask everytime the
> >>>>>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
> >>>>>>> cpumask will not be updated and the overhead will be mostly testing if
> >>>>>>> (rq->last_idle_state == idle_state).
> >>>>>>
> >>>>>> Not sure if I addressed your concern, did you see any workloads any cases
> >>>>>> v4 performs better than v5?
> >>>>>
> >>>>> Yes, I see some perf regression on my octo arm64 system for hackbench with
> >>>>> only 1 group (and for few ther ones but it's less obvious). There is no
> >>>>> perf impact with more groups most probably because the cpus are no more idle.
> >>>>>
> >>>>> The regression happens even if the shallowest idle state is the only one to
> >>>>> be enabled.
> >>>>
> >>>> Thanks for the data.
> >>>>
> >>>>>
> >>>>> - 2 x 4 cores arm64 system
> >>>>>
> >>>>> 12 iterations of hackbench -l (256000/#grp) -g #grp
> >>>>>
> >>>>> Only the shallowest state enabled
> >>>>
> >>>>> (as a sidenote, we don't have polling mode on arm64)
> >>>> Okay, this might be the cause of the difference between yours and mine. So do you
> >>>> think if it makes sense to let idle governor to return a polling flag and associate
> >>>> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
> >>>> may not be suitable for the wake up target.
> >>>
> >>> I don't know much about polling but can't this mode be used up to a tick too ?
> >>> I think so. So short idle need a definition. I'm not sure if it's a good idea to define
> >> the short idle as a tunable and default set it to tick >> 2?
> >
> > I have never been fond of heuristic like tick << 2 or sys tunable
> >
> > TBH, I'm not sure that using the tick is a good idea. And such kind of
> > parameter need more thought
> >
> >>
> >> Updating idle cpumask everytime cpu enters idle works for me, as we have state change
> >> check, so we won't actually update idle cpumask everytime the cpu enters idle.
> >
> > Yes, In this case, the overhead stays reasonable and is limited to the
> > test of a rq->last_idle_state
> > This will benefit heavy use a case by reducing the scanning time and
> > will not regress other use case.
> >
> >>
> >> But I'm still willing to exclude short idle case, what do you think?
> >
> > something similar to patch v3 or patch v5 + my changes seems to be a
> > good 1st step that will benefit heavy use cases with regressing other
> > ones.
> >
> > Trying to exclude short idle case will need more thoughts and changes
> > especially about to how to get this information and if it is reliable.
> >
>
> okay, I'll post a v6 with v5 + your change below after data measurement.
> May I add you a signed-off-by to the patch?

my signed-off is not needed. My changes should be considered as
comments of your patch but sometimes a code example is easier than a
long comment

>
> Thanks,
> -Aubrey
>
> ---
> kernel/sched/idle.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a38d8822ce0d..ca32197778b0 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -156,6 +156,7 @@ static void cpuidle_idle_call(void)
> return;
> }
>
> + update_idle_cpumask(this_rq(), true);
> /*
> * The RCU framework needs to be told that we are entering an idle
> * section, so no more rcu read side critical sections and one more
> @@ -163,7 +164,6 @@ static void cpuidle_idle_call(void)
> */
>
> if (cpuidle_not_available(drv, dev)) {
> - update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> default_idle_call();
> @@ -194,7 +194,6 @@ static void cpuidle_idle_call(void)
> max_latency_ns = dev->forced_idle_latency_limit_ns;
> }
>
> - update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
>
> next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> @@ -208,7 +207,6 @@ static void cpuidle_idle_call(void)
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> if (stop_tick || tick_nohz_tick_stopped()) {
> - update_idle_cpumask(this_rq(), true);
> tick_nohz_idle_stop_tick();
> } else {
> tick_nohz_idle_retain_tick();
> --
> 2.17.1
>