Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Tue Mar 28 2023 - 08:57:04 EST


Hi Dietmar,

Thanks for taking a look.

On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> On 27/03/2023 07:39, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
> >
> > 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> > 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
> >
> > While cpus of the other node normally sees a lower cycle percent:
> >
> > 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> > 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
> >
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side.
> >
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> >
> > Tim Chen told me that PeterZ once mentioned a way to solve a similar
> > problem by making a counter per node so do the same for tg->load_avg.
> > After this change, the worst number I saw during a 5 minutes run from
> > both nodes are:
> >
> > 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> > 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
> >
> > Another observation of this workload is: it has a lot of wakeup time
> > task migrations and that is the reason why update_load_avg() and
> > update_cfs_group() shows noticeable cost. Running this workload in N
> > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > task migrations on wake up time are greatly reduced and the overhead from
> > the two above mentioned functions also dropped a lot. It's not clear to
> > me why running in multiple instances can reduce task migrations on
> > wakeup path yet.
> >
> > Reported-by: Nitin Tekchandani <nitin.tekchandani@xxxxxxxxx>
> > Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx>
>
> I'm so far not seeing this issue on my Arm64 server.
>
> $ numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
> 44 45 46 47
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
> 68 69 70 71
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
> 92 93 94 95
> node distances:
> node 0 1 2 3
> 0: 10 12 20 22
> 1: 12 10 22 24
> 2: 20 22 10 12
> 3: 22 24 12 10
>
> sysbench --table-size=100000 --tables=24 --threads=96 ...
> /usr/share/sysbench/oltp_read_write.lua run
>
> perf report | grep kernel | head
>
> 9.12% sysbench [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
> 5.26% sysbench [kernel.vmlinux] [k] finish_task_switch
> 1.56% sysbench [kernel.vmlinux] [k] __do_softirq
> 1.22% sysbench [kernel.vmlinux] [k] arch_local_irq_restore
> 1.12% sysbench [kernel.vmlinux] [k] __arch_copy_to_user
> 1.12% sysbench [kernel.vmlinux] [k] el0_svc_common.constprop.1
> 0.95% sysbench [kernel.vmlinux] [k] __fget_light
> 0.94% sysbench [kernel.vmlinux] [k] rwsem_spin_on_owner
> 0.85% sysbench [kernel.vmlinux] [k] tcp_ack
> 0.56% sysbench [kernel.vmlinux] [k] do_sys_poll

Did you test with a v6.3-rc based kernel?
I encountered another problem on those kernels and had to temporarily use
a v6.2 based kernel, maybe you have to do the same:
https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/

>
> Is your postgres/sysbench running in a cgroup with cpu controller
> attached? Mine isn't.

Yes, I had postgres and sysbench running in the same cgroup with cpu
controller enabled. docker created the cgroup directory under
/sys/fs/cgroup/system.slice/docker-XXX and cgroup.controllers has cpu
there.

>
> Maybe I'm doing something else differently?

Maybe, you didn't mention how you started postgres, if you start it from
the same session as sysbench and if autogroup is enabled, then all those
tasks would be in the same autogroup taskgroup then it should have the
same effect as my setup.

Anyway, you can try the following steps to see if you can reproduce this
problem on your Arm64 server:

1 docker pull postgres
2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
3 go inside the container
sudo docker exec -it $the_just_started_container_id bash
4 install sysbench inside container
apt update and apt install sysbench
5 prepare
root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=224 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
6 run
root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=224 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run

Note that I used 224 threads where this problem is visible. I also tried
96 and update_cfs_group() and update_load_avg() cost about 1% cycles then.