RE: [PATCH 7/9] sched/fair:

From: Song Bao Hua (Barry Song)
Date: Mon Aug 02 2021 - 06:45:01 EST




> -----Original Message-----
> From: Mel Gorman [mailto:mgorman@xxxxxxxxxxxxxxxxxxx]
> Sent: Monday, July 26, 2021 10:23 PM
> To: LKML <linux-kernel@xxxxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>;
> Vincent Guittot <vincent.guittot@xxxxxxxxxx>; Valentin Schneider
> <valentin.schneider@xxxxxxx>; Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>; Mel
> Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Subject: [PATCH 7/9] sched/fair:
>
> When scanning for a single CPU, the scan is limited based on the estimated
> average idle time for a domain to reduce the risk that more time is spent
> scanning for idle CPUs than we are idle for.
>
> With SMT, if an idle core is expected to exist there is no scan depth
> limits so the scan depth may or may not be related to average idle time.
> Unfortunately has_idle_cores can be very inaccurate when workloads are
> rapidly entering/exiting idle (e.g. hackbench).
>
> As the scan depth is now proportional to cores and not CPUs, enforce
> SIS_PROP for idle core scans.
>
> The performance impact of this is variable and is neither a universal
> gain nor loss. In some cases, has_idle_cores will be cleared prematurely
> because the whole domain was not scanned but has_idle_cores is already
> known to be an inaccurate heuristic. There is also additional cost because
> time calculations are made even for an idle core scan and the delta is
> calculated for both scan successes and failures. Finally, SMT siblings
> may be used prematurely due to scan depth limitations.
>
> On the flip side, scan depth is now consistent for both core and smt
> scans. The reduction in scan depth improves performance in some cases
> and wakeup latency is reduced in some cases.
>
> There were few changes identified in the SIS statistics but notably,
> "SIS Core Hit" was slightly reduced in tbench as thread counts increased,
> presumably due to the core search depth being throttled.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 33 +++++++++++++++++++--------------
> 1 file changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 20b9255ebf97..b180205e6b25 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6232,7 +6232,7 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
>
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> - if (sched_feat(SIS_PROP) && !has_idle_core) {
> + if (sched_feat(SIS_PROP)) {
> u64 avg_cost, avg_idle, span_avg;
> unsigned long now = jiffies;
>
> @@ -6265,30 +6265,35 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> if ((unsigned int)i < nr_cpumask_bits)
> - return i;
> + break;
>
> + nr -= sched_smt_weight;
> } else {
> - if (!--nr)
> - return -1;
> idle_cpu = __select_idle_cpu(cpu, p);
> if ((unsigned int)idle_cpu < nr_cpumask_bits)
> break;
> + nr--;
> }
> +
> + if (nr < 0)
> + break;
> }
>
> - if (has_idle_core)
> - set_idle_cores(target, false);
> + if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> + if (has_idle_core)
> + set_idle_cores(target, false);

For example, if we have 16 cpus(8 SMT2 cores). In case core7 is idle,
we only have scanned core0+core1(cpu0-cpu3) and if these two cores
are not idle, but here we set has_idle_cores to false while core7 is
idle. It seems incorrect.

>
> - if (sched_feat(SIS_PROP) && !has_idle_core) {
> - time = cpu_clock(this) - time;
> + if (sched_feat(SIS_PROP)) {
> + time = cpu_clock(this) - time;
>
> - /*
> - * Account for the scan cost of wakeups against the average
> - * idle time.
> - */
> - this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> + /*
> + * Account for the scan cost of wakeups against the average
> + * idle time.
> + */
> + this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
>
> - update_avg(&this_sd->avg_scan_cost, time);
> + update_avg(&this_sd->avg_scan_cost, time);
> + }
> }
>
> return idle_cpu;
> --
> 2.26.2

Thanks
Barry