Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Tue Apr 04 2023 - 09:34:46 EST


On Tue, Apr 04, 2023 at 04:25:04PM +0800, Chen Yu wrote:
> On 2023-03-27 at 13:39:55 +0800, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
> >
> > 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> > 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
> >
> > While cpus of the other node normally sees a lower cycle percent:
> >
> > 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> > 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
> >
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side.
> >
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> >
> > Tim Chen told me that PeterZ once mentioned a way to solve a similar
> > problem by making a counter per node so do the same for tg->load_avg.
> > After this change, the worst number I saw during a 5 minutes run from
> > both nodes are:
> >
> > 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> > 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
> >
> The same issue was found when running netperf on this platform.
> According to the perf profile:

Thanks for the info!

>
> 11.90% 11.84% swapper [kernel.kallsyms] [k] update_cfs_group
> 9.79% 9.43% swapper [kernel.kallsyms] [k] update_load_avg
>
> these two functions took quite some cycles.
>
> 1. cpufreq governor set to performance, turbo disabled, C6 disabled

I didn't make any changes to the above and then tried netperf as you
described below, using UDP_RR, and the cycle percent of update_cfs_group
is even worse on my SPR system:

v6.3-rc5:
update_cfs_group()%: 27.39% on node0, 31.18% on node1

wakeups[0]: 5623199
wakeups[1]: 7919937
migrations[0]: 3871773
migrations[1]: 5606894

v6.3-rc5 + this_patch:
update_cfs_group()%: 24.12% on node0, 26.15% on node1
wakeups[0]: 13575203
wakeups[1]: 10749893
migrations[0]: 9153060
migrations[1]: 7508095

This patch helps a little bit, but not much. Will take a closer look.

> 2. launches 224 instances of netperf, and each instance is:
> netperf -4 -H 127.0.0.1 -t UDP_RR/TCP_RR -c -C -l 100 &
> 3. perf record -ag sleep 4
>
> Also the test script could be downloaded via
> https://github.com/yu-chen-surf/schedtests.git

Thanks,
Aaron