RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Song Bao Hua (Barry Song)
Date: Wed Feb 03 2021 - 16:32:14 EST




> -----Original Message-----
> From: Meelis Roos [mailto:mroos@xxxxxxxx]
> Sent: Thursday, February 4, 2021 12:58 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>;
> valentin.schneider@xxxxxxx; vincent.guittot@xxxxxxxxxx; mgorman@xxxxxxx;
> mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; dietmar.eggemann@xxxxxxx;
> morten.rasmussen@xxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Cc: linuxarm@xxxxxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; Liguozhu (Kenneth)
> <liguozhu@xxxxxxxxxxxxx>; tiantao (H) <tiantao6@xxxxxxxxxxxxx>; wanghuiqiang
> <wanghuiqiang@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Jonathan
> Cameron <jonathan.cameron@xxxxxxxxxx>; guodong.xu@xxxxxxxxxx
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
> 03.02.21 13:12 Barry Song wrote:
> > kernel/sched/topology.c | 85 +++++++++++++++++++++++++----------------
> > 1 file changed, 53 insertions(+), 32 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5d3675c7a76b..964ed89001fe 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
>
> This one still works on the Sun X4600-M2, on top of v5.11-rc6-55-g3aaf0a27ffc2.
>
>
> Performance-wise - is the some simple benhmark to run to meaure the impact?
> Compared to what - 5.10.0 or the kernel with the warning?

Hi Meelis,
Thanks for retesting.

Comparing to the kernel with the warning is enough. As I mentioned here:
https://lore.kernel.org/lkml/20210115203632.34396-1-song.bao.hua@xxxxxxxxxxxxx/

I have seen two major issues the broken sched_group has:

* in load_balance() and find_busiest_group()
kernel is calculating the avg_load and group_type by:

sum(load of cpus within sched_domain)
------------------------------------
capacity of the whole sched_group

since sched_group isn't a subset of sched_domain, so the load of
the problematic group is severely underestimated.

sched_domain

+----------------------------------+
| |
| +-------------------------------------------+
| | +-------+ +------+ | |
| | | cpu0 | | cpu1 | | |
| | +-------+ +------+ | |
+----------------------------------+ |
| |
| +-------+ +-------+ |
| |cpu2 | |cpu3 | |
| +-------+ +-------+ |
| |
+-------------------------------------------+
problematic sched_group


For the above example, kernel will divide "the sum load of
cpu0 and cpu1" by "the capacity of the whole group including
cpu0,1,2 and 3".

* in select_task_rq_fair() and find_idlest_group()
Kernel could push a forked/exec-ed task to the outside of the
sched_domain, but still inside the sched_group. For the above
diagram, while kernel wants to find the idlest cpu in the
sched_domain, it can result in picking cpu2 or cpu3.

I guess these two issues can potentially affect many benchmarks.
Our team have seen 5% unixbench score increase with the fix in
some machines though the real impact might be case-by-case.

>
> drop caches and time the build time of linux kernel with make -j64?
>
> --
> Meelis Roos

Thanks
Barry