Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Chen Yu
Date: Tue Apr 04 2023 - 04:27:46 EST


On 2023-03-27 at 13:39:55 +0800, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
>
> 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
>
> While cpus of the other node normally sees a lower cycle percent:
>
> 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
>
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side.
>
> The reason why only cpus of one node has bigger overhead is: task_group
> is allocated on demand from a slab and whichever cpu happens to do the
> allocation, the allocated tg will be located on that node and accessing
> to tg->load_avg will have a lower cost for cpus on the same node and
> a higer cost for cpus of the remote node.
>
> Tim Chen told me that PeterZ once mentioned a way to solve a similar
> problem by making a counter per node so do the same for tg->load_avg.
> After this change, the worst number I saw during a 5 minutes run from
> both nodes are:
>
> 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
>
The same issue was found when running netperf on this platform.
According to the perf profile:

11.90% 11.84% swapper [kernel.kallsyms] [k] update_cfs_group
9.79% 9.43% swapper [kernel.kallsyms] [k] update_load_avg

these two functions took quite some cycles.

1. cpufreq governor set to performance, turbo disabled, C6 disabled
2. launches 224 instances of netperf, and each instance is:
netperf -4 -H 127.0.0.1 -t UDP_RR/TCP_RR -c -C -l 100 &
3. perf record -ag sleep 4

Also the test script could be downloaded via
https://github.com/yu-chen-surf/schedtests.git


thanks,
Chenyu