RE: [PATCH] cpu-topology: warn if NUMA configurations conflicts with lower layer

From: Zengtao (B)
Date: Thu Jan 02 2020 - 07:47:18 EST


> -----Original Message-----
> From: Sudeep Holla [mailto:sudeep.holla@xxxxxxx]
> Sent: Thursday, January 02, 2020 7:30 PM
> To: Zengtao (B)
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> linux-kernel@xxxxxxxxxxxxxxx; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On Thu, Jan 02, 2020 at 03:05:40AM +0000, Zengtao (B) wrote:
> > Hi Sudeep:
> >
> > Thanks for your reply.
> >
> > > -----Original Message-----
> > > From: Sudeep Holla [mailto:sudeep.holla@xxxxxxx]
> > > Sent: Wednesday, January 01, 2020 12:41 AM
> > > To: Zengtao (B)
> > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > linux-kernel@xxxxxxxxxxxxxxx; Sudeep Holla; Morten Rasmussen
> > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts
> > > with lower layer
> > >
> > > On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> > > > As we know, from sched domain's perspective, the DIE layer should
> be
> > > > larger than or at least equal to the MC layer, and in some cases, MC
> > > > is defined by the arch specified hardware, MPIDR for example, but
> > > NUMA
> > > > can be defined by users,
> > >
> > > Who are the users you are referring above ?
> > For example, when I use QEMU to start a guest linux, I can define the
> > NUMA topology of the guest linux whatever i want.
>
> OK and how is the information passed to the kernel ? DT or ACPI ?
> We need to fix the miss match if any during the initial parse of those
> information.
>

Both, For the current QEMU, we don't have the correct cpu topology
passed to linux. Luckily drjones planed to deal with the issue.
https://patchwork.ozlabs.org/cover/939301/

> > > > with the following system configrations:
> > >
> > > Do you mean ACPI tables or DT or some firmware tables ?
> > >
> > > > *************************************
> > > > NUMA: 0-2, 3-7
> > >
> > > Is the above simply wrong with respect to hardware and it actually
> match
> > > core_siblings ?
> > >
> > Actually, we can't simply say this is wrong, i just want to show an
> example.
> > And this example also can be:
> > NUMA: 0-23, 24-47
> > core_siblings: 0-15, 16-31, 32-47
> >
>
> Are you sure of the above ? Possible values w.r.t hardware config:
> core_siblings: 0-15, 16-23, 24-31, 32-47
>
> But what you have specified above is still wrong core_siblings IMO.
>
It depends on the hardware, on my platform, 16 cores per cluster.

>
> [...]
>
> > > > diff --git a/drivers/base/arch_topology.c
> b/drivers/base/arch_topology.c
> > > > index 1eb81f11..5fe44b3 100644
> > > > --- a/drivers/base/arch_topology.c
> > > > +++ b/drivers/base/arch_topology.c
> > > > @@ -439,10 +439,18 @@ const struct cpumask
> > > *cpu_coregroup_mask(int cpu)
> > > > if (cpumask_subset(&cpu_topology[cpu].core_sibling,
> core_mask)) {
> > > > /* not numa in package, lets use the package siblings */
> > > > core_mask = &cpu_topology[cpu].core_sibling;
> > > > - }
> > > > + } else
> > > > + pr_warn_once("Warning: suspicous broken topology:
> cpu:[%d]'s
> > > core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > > + cpu,
> cpumask_pr_args(&cpu_topology[cpu].core_sibling),
> > > > + cpumask_pr_args(core_mask));
> > > > +
> > >
> > > Won't this print warning on all systems that don't have numa within a
> > > package ? What are you trying to achieve here ?
> >
> > Since in my case, when this corner case happens, the linux kernel just fall
> into
> > dead loop with no prompt, here this is a helping message will help a lot.
> >
>
> As I said, wrong configurations need to be detected when generating
> DT/ACPI if possible. The above will print warning on systems with NUMA
> within package.
>
> NUMA: 0-7, 8-15
> core_siblings: 0-15
>
> The above is the example where the die has 16 CPUs and 2 NUMA nodes
> within a package, your change throws error to the above config which is
> wrong.
>
>From your example, the core 7 and core 8 has got different LLC but the same Low
Level cache?
>From schedule view of point, lower level sched domain should be a subset of higher
Level sched domain.

> > >
> > > > if (cpu_topology[cpu].llc_id != -1) {
> > > > if (cpumask_subset(&cpu_topology[cpu].llc_sibling,
> core_mask))
> > > > core_mask = &cpu_topology[cpu].llc_sibling;
> > > > + else
> > > > + pr_warn_once("Warning: suspicous broken topology:
> > > cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > > + cpu,
> > > cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
> > > > + cpumask_pr_args(core_mask));
> > > > }
> > > >
> > >
> > > This will trigger warning on all systems that lack cacheinfo topology.
> > > I don't understand the intent of this patch at all. Can you explain
> > > all the steps you follow and the issue you face ?
> >
> > Can you show me an example, what I really want to warn is the case that
> > NUMA topology conflicts with lower level.
> >
>
> I was wrong here, I mis-read this section. I still fail to understand
> why the above change is needed. I understood the QEMU example, but you
> haven't specified how cacheinfo looks like there.
>

My idea:
(1) If the dtb/acpi is not legal, the linux kernel should check or prompt to help debug.
(2)From scheduler view of point, low level sched domain should be a subset of high level
Sched domain.

I am very sure about (2), any example to show me (2) is wrong?

Thanks.
Zengtao