Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Daniel Jordan
Date: Thu Mar 30 2023 - 13:46:44 EST


Hi Aaron,

On Wed, Mar 29, 2023 at 09:54:55PM +0800, Aaron Lu wrote:
> On Wed, Mar 29, 2023 at 02:36:44PM +0200, Dietmar Eggemann wrote:
> > On 28/03/2023 14:56, Aaron Lu wrote:
> > > On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> > >> On 27/03/2023 07:39, Aaron Lu wrote:
> And not sure if you did the profile on different nodes? I normally chose
> 4 cpus of each node and do 'perf record -C' with them, to get an idea
> of how different node behaves and also to reduce the record size.
> Normally, when tg is allocated on node 0, then node 1's profile would
> show higher cycles for update_cfs_group() and update_load_avg().

Wouldn't the choice of CPUs have a big effect on the data, depending on
where sysbench or postgres tasks run?

> I guess your setup may have a much lower migration number?

I also tried this and sure enough didn't see as many migrations on
either of two systems. I used a container with your steps with a plain
6.2 kernel underneath, and the cpu controller is on (weight only). I
increased connections and buffer size to suit each machine, and took
Chen's suggestion to try without numa balancing.

AMD EPYC 7J13 64-Core Processor
2 sockets * 64 cores * 2 threads = 256 CPUs

sysbench: nr_threads=256

All observability data was taken at one minute in and using one tool at
a time.

@migrations[1]: 1113
@migrations[0]: 6152
@wakeups[1]: 8871744
@wakeups[0]: 9773321

# profiled the whole system for 5 seconds, reported w/ --sort=dso,symbol
0.38% update_load_avg
0.13% update_cfs_group

Using higher (nr_threads=380) and lower (nr_threads=128) load doesn't
change these numbers much.

The topology of my machine is different from yours, but it's the biggest
I have, and I'm assuming cpu count is more important than topology when
reproducing the remote accesses. I also tried on

Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
2 sockets * 32 cores * 2 thread = 128 CPUs

with nr_threads=128 and got similar results.

I'm guessing you've left all sched knobs alone? Maybe sharing those and
the kconfig would help close the gap. Migrations do increase to near
what you were seeing when I disable SIS_UTIL (with SIS_PROP already off)
on the Xeon, and I see 4-5% apiece for the functions you mention when
profiling, but turning SIS_UTIL off is an odd thing to do.