Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Tue Apr 04 2023 - 11:16:07 EST


On Mon, Mar 27, 2023 at 01:39:55PM +0800, Aaron Lu wrote:
[...]
> Another observation of this workload is: it has a lot of wakeup time
> task migrations and that is the reason why update_load_avg() and
> update_cfs_group() shows noticeable cost. Running this workload in N
> instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> task migrations on wake up time are greatly reduced and the overhead from
> the two above mentioned functions also dropped a lot. It's not clear to
> me why running in multiple instances can reduce task migrations on
> wakeup path yet.

Regarding this observation, I've some finding. The TLDR is: 1 instance
setup's overall CPU util is lower than N >= 2 instances setup and as a
result, under 1 instance setup, sis() is more likely to find idle cpus
than N >= 2 instances setup and that is the reason why 1 instance setup
has more migrations.

More details:

For 1 instance with nr_thread=nr_cpu=224 setup, during a 5s window,
there are 10 million calls of select_idle_sibling() and 6.1 million
migrations. Of these migrations, 4.6 million comes from select_idle_cpu(),
1.3 million comes from recent_cpu.
mpstat of this time window:
Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 45.15 0.00 18.59 0.00 0.00 17.29 0.00 0.00 0.00 18.98
Average: 0 38.14 0.00 17.29 0.00 0.00 14.77 0.00 0.00 0.00 29.80
Average: 1 52.07 0.00 19.88 0.00 0.00 19.78 0.00 0.00 0.00 8.28


For 4 instance with nr_thread=56 setup, during a 5s window, there are 15
million calls of select_idle_sibling() and only 30k migrations.
select_idle_cpu() is called 15 million times but only 23k of them passed
the sd_share->nr_idle_scan != 0 test.
mpstat of this time window:
Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 68.54 0.00 21.54 0.00 0.00 8.35 0.00 0.00 0.00 1.58
Average: 0 70.05 0.00 20.92 0.00 0.00 8.17 0.00 0.00 0.00 0.87
Average: 1 67.03 0.00 22.16 0.00 0.00 8.53 0.00 0.00 0.00 2.29

For 8 instance with nr_thread=28 setup, during a 5s window, there are
16 million calls of select_idle_sibling() and 9.6k migrations.
select_idle_cpu() is called 16 million times but none of them passed the
sd_share->nr_idle_scan != 0 test.
mpstat of this time window:
Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 70.29 0.00 20.99 0.00 0.00 8.28 0.00 0.00 0.00 0.43
Average: 0 71.58 0.00 19.98 0.00 0.00 8.04 0.00 0.00 0.00 0.40
Average: 1 69.00 0.00 22.01 0.00 0.00 8.52 0.00 0.00 0.00 0.47

On a side note: when sd_share->nr_idle_scan > 0 and has_idle_core is true,
then sd_share->nr_idle_scan is not actually respected. Is this intended?
It seems to say: if there is idle core, then let's try hard and ignore
SIS_UTIL to find that idle core, right?