Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Wed Mar 29 2023 - 09:55:45 EST


On Wed, Mar 29, 2023 at 02:36:44PM +0200, Dietmar Eggemann wrote:
> On 28/03/2023 14:56, Aaron Lu wrote:
> > Hi Dietmar,
> >
> > Thanks for taking a look.
> >
> > On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> >> On 27/03/2023 07:39, Aaron Lu wrote:
>
> [...]
>
> > Did you test with a v6.3-rc based kernel?
> > I encountered another problem on those kernels and had to temporarily use
> > a v6.2 based kernel, maybe you have to do the same:
> > https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>
> No, I'm also on v6.2.
>
> >> Is your postgres/sysbench running in a cgroup with cpu controller
> >> attached? Mine isn't.
> >
> > Yes, I had postgres and sysbench running in the same cgroup with cpu
> > controller enabled. docker created the cgroup directory under
> > /sys/fs/cgroup/system.slice/docker-XXX and cgroup.controllers has cpu
> > there.
>
> I'm running postgresql service directly on the machine. I boot now with
> 'cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1' so I can add the
> cpu controller to:
>
> system.slice/system-postgresql.slice/postgresql@11-main.service
>
> where the 96 postgres threads run and to
>
> user.slice/user-1005.slice/session-4.scope
>
> where the 96 sysbench threads run.
>
> Checked with systemd-cgls and `cat /sys/kernel/debug/sched/debug` that
> those threads are really running there.
>
> Still not seeing `update_load_avg` or `update_cfs_group` in perf report,
> only some very low values for `update_blocked_averages`.
>
> Also added CFS BW throttling to both cgroups. No change.
>
> Then I moved session-4.scope's shell into `postgresql@11-main.service`
> so that `postgres` and `sysbench` threads run in the same cgroup.
>
> Didn't change much.
>
> >> Maybe I'm doing something else differently?
> >
> > Maybe, you didn't mention how you started postgres, if you start it from
> > the same session as sysbench and if autogroup is enabled, then all those
> > tasks would be in the same autogroup taskgroup then it should have the
> > same effect as my setup.
>
> This should be now close to my setup running `postgres` and `sysbench`
> in `postgresql@11-main.service`.

Yes.

>
> > Anyway, you can try the following steps to see if you can reproduce this
> > problem on your Arm64 server:
> >
> > 1 docker pull postgres
> > 2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
> > 3 go inside the container
> > sudo docker exec -it $the_just_started_container_id bash
> > 4 install sysbench inside container
> > apt update and apt install sysbench
> > 5 prepare
> > root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=224 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
> > 6 run
> > root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=224 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run
>
> I would have to find time to learn how to set up docker on my machine
> ... But I use very similar values for the setup and sysbench test.

Agree. And docker just made running this workload easier but since you
already grouped all tasks in the same taskgroup, there is no need to
mess with docker.

>
> > Note that I used 224 threads where this problem is visible. I also tried
> > 96 and update_cfs_group() and update_load_avg() cost about 1% cycles then.
>
> True, I was hopping to see at least the 1% ;-)

One more question: when you do 'perf report', did you use
--sort=dso,symbol to aggregate different paths of the same target? Maybe
you have already done this, just want to confirm :-)

And not sure if you did the profile on different nodes? I normally chose
4 cpus of each node and do 'perf record -C' with them, to get an idea
of how different node behaves and also to reduce the record size.
Normally, when tg is allocated on node 0, then node 1's profile would
show higher cycles for update_cfs_group() and update_load_avg().

Another thing worth mentioning about this workload is, it has a lot of
wakeups and migrations during the initial 2 minutes or so and a lot of
migrations is the reason of increased cost of update_cfs_group() and
update_load_avg(). On my side, with sysbench's nr_thread=224, the
wakeups and migration numbers during a 5s window are(recorded after
about 1 minute the workload is started):
@migrations[1]: 1821379
@migrations[0]: 4482989
@wakeups[1]: 3036473
@wakeups[0]: 6504496

The above number is derived from below script:
kretfunc:select_task_rq_fair
{
@wakeups[numaid] = count();
if (args->p->thread_info.cpu != retval) {
@migrations[numaid] = count();
}
}

interval:s:5
{
exit();
}

And during this time window, node1's profile shows update_cfs_group()'s
cycle percent is 12.45% and update_load_avg() is 7.99%.

I guess your setup may have a much lower migration number?