Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Chen Yu
Date: Tue Apr 04 2023 - 11:39:11 EST


On 2023-04-04 at 23:15:40 +0800, Aaron Lu wrote:
> On Mon, Mar 27, 2023 at 01:39:55PM +0800, Aaron Lu wrote:
> [...]
> > Another observation of this workload is: it has a lot of wakeup time
> > task migrations and that is the reason why update_load_avg() and
> > update_cfs_group() shows noticeable cost. Running this workload in N
> > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > task migrations on wake up time are greatly reduced and the overhead from
> > the two above mentioned functions also dropped a lot. It's not clear to
> > me why running in multiple instances can reduce task migrations on
> > wakeup path yet.
>
> Regarding this observation, I've some finding. The TLDR is: 1 instance
> setup's overall CPU util is lower than N >= 2 instances setup and as a
> result, under 1 instance setup, sis() is more likely to find idle cpus
> than N >= 2 instances setup and that is the reason why 1 instance setup
> has more migrations.
>
> More details:
>
> For 1 instance with nr_thread=nr_cpu=224 setup, during a 5s window,
> there are 10 million calls of select_idle_sibling() and 6.1 million
> migrations. Of these migrations, 4.6 million comes from select_idle_cpu(),
> 1.3 million comes from recent_cpu.
> mpstat of this time window:
> Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: all 45.15 0.00 18.59 0.00 0.00 17.29 0.00 0.00 0.00 18.98
> Average: 0 38.14 0.00 17.29 0.00 0.00 14.77 0.00 0.00 0.00 29.80
> Average: 1 52.07 0.00 19.88 0.00 0.00 19.78 0.00 0.00 0.00 8.28
>
>
> For 4 instance with nr_thread=56 setup, during a 5s window, there are 15
> million calls of select_idle_sibling() and only 30k migrations.
> select_idle_cpu() is called 15 million times but only 23k of them passed
> the sd_share->nr_idle_scan != 0 test.
> mpstat of this time window:
> Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: all 68.54 0.00 21.54 0.00 0.00 8.35 0.00 0.00 0.00 1.58
> Average: 0 70.05 0.00 20.92 0.00 0.00 8.17 0.00 0.00 0.00 0.87
> Average: 1 67.03 0.00 22.16 0.00 0.00 8.53 0.00 0.00 0.00 2.29
>
> For 8 instance with nr_thread=28 setup, during a 5s window, there are
> 16 million calls of select_idle_sibling() and 9.6k migrations.
> select_idle_cpu() is called 16 million times but none of them passed the
> sd_share->nr_idle_scan != 0 test.
> mpstat of this time window:
> Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: all 70.29 0.00 20.99 0.00 0.00 8.28 0.00 0.00 0.00 0.43
> Average: 0 71.58 0.00 19.98 0.00 0.00 8.04 0.00 0.00 0.00 0.40
> Average: 1 69.00 0.00 22.01 0.00 0.00 8.52 0.00 0.00 0.00 0.47
>
> On a side note: when sd_share->nr_idle_scan > 0 and has_idle_core is true,
> then sd_share->nr_idle_scan is not actually respected. Is this intended?
> It seems to say: if there is idle core, then let's try hard and ignore
> SIS_UTIL to find that idle core, right?
Yes, SIS_UTIL inherits the logic of SIS_PROP, which honors has_idle_core and
scans at any cost. Abel previously proposed a patch to make this more aggressive
by not allowing SIS_UTIL to take effect even when the system is overloaded.
https://lore.kernel.org/lkml/20221019122859.18399-3-wuyun.abel@xxxxxxxxxxxxx/

thanks,
Chenyu