Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Daniel Jordan
Date: Wed Apr 05 2023 - 17:05:00 EST


On Fri, Mar 31, 2023 at 12:06:09PM +0800, Aaron Lu wrote:
> Hi Daniel,
>
> Thanks for taking a look.
>
> On Thu, Mar 30, 2023 at 03:51:57PM -0400, Daniel Jordan wrote:
> > On Thu, Mar 30, 2023 at 01:46:02PM -0400, Daniel Jordan wrote:
> > > Hi Aaron,
> > >
> > > On Wed, Mar 29, 2023 at 09:54:55PM +0800, Aaron Lu wrote:
> > > > On Wed, Mar 29, 2023 at 02:36:44PM +0200, Dietmar Eggemann wrote:
> > > > > On 28/03/2023 14:56, Aaron Lu wrote:
> > > > > > On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> > > > > >> On 27/03/2023 07:39, Aaron Lu wrote:
> > > > And not sure if you did the profile on different nodes? I normally chose
> > > > 4 cpus of each node and do 'perf record -C' with them, to get an idea
> > > > of how different node behaves and also to reduce the record size.
> > > > Normally, when tg is allocated on node 0, then node 1's profile would
> > > > show higher cycles for update_cfs_group() and update_load_avg().
> > >
> > > Wouldn't the choice of CPUs have a big effect on the data, depending on
> > > where sysbench or postgres tasks run?
> >
> > Oh, probably not with NCPU threads though, since the load would be
> > pretty even, so I think I see where you're coming from.
>
> Yes I expect the load to be pretty even within the same node so didn't
> do the full cpu record. I used to only record a single cpu on each node
> to get a fast report time but settled on using 4 due to being paranoid :-)

Mhm :-) My 4-cpu profiles do look about the same as my all-system one.

> I have a vague memory AMD machine has a smaller LLC and cpus belonging
> to the same LLC is also not many, 8-16?

Yep, 16 cpus in every one. It's a 32M LLC.

> I tend to think cpu number of LLC play a role here since that's the
> domain where idle cpu is searched on task wake up time.

That's true, I hadn't thought of that.

> > > I'm guessing you've left all sched knobs alone? Maybe sharing those and
>
> Yes I've left all knobs alone. The server I have access to has Ubuntu
> 22.04.1 installed and here are the values of these knobs:
> root@a4bf01924c30:/sys/kernel/debug/sched# sysctl -a |grep sched
> kernel.sched_autogroup_enabled = 1
> kernel.sched_cfs_bandwidth_slice_us = 5000
> kernel.sched_child_runs_first = 0
> kernel.sched_deadline_period_max_us = 4194304
> kernel.sched_deadline_period_min_us = 100
> kernel.sched_energy_aware = 1
> kernel.sched_rr_timeslice_ms = 100
> kernel.sched_rt_period_us = 1000000
> kernel.sched_rt_runtime_us = 950000
> kernel.sched_schedstats = 0
> kernel.sched_util_clamp_max = 1024
> kernel.sched_util_clamp_min = 1024
> kernel.sched_util_clamp_min_rt_default = 1024
>
> root@a4bf01924c30:/sys/kernel/debug/sched# for i in `ls features *_ns *_ms preempt`; do echo "$i: `cat $i`"; done
> features: GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD BASE_SLICE
> idle_min_granularity_ns: 750000
> latency_ns: 24000000
> latency_warn_ms: 100
> migration_cost_ns: 500000
> min_granularity_ns: 3000000
> preempt: none (voluntary) full
> wakeup_granularity_ns: 4000000

Right, figures, all the same on my machines.

> And attached kconfig, it's basically what the distro provided except I
> had to disable some configs related to module sign or something like
> that.

Thanks for all the info. I got the same low perf percentages using your
kconfig as I got before (<0.50% for both functions), so maybe this just
takes a big machine with big LLCs, which sadly I haven't got.