Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.

From: Jonathan Cameron
Date: Mon Oct 19 2020 - 10:27:20 EST


On Mon, 19 Oct 2020 14:48:02 +0100
Valentin Schneider <valentin.schneider@xxxxxxx> wrote:

> +Cc Jeremy
>
> On 19/10/20 14:10, Morten Rasmussen wrote:
> > Hi Jonathan,
> > The problem I see is that the benefit of keeping tasks together due to
> > the interconnect layout might vary significantly between systems. So if
> > we introduce a new cpumask for cluster it has to have represent roughly
> > the same system properties otherwise generic software consuming this
> > information could be tricked.
> >
> > If there is a provable benefit of having interconnect grouping
> > information, I think it would be better represented by a distance matrix
> > like we have for NUMA.
> >
> > Morten
>
> That's my queue to paste some of that stuff I've been rambling on and off
> about!
>
> With regards to cache / interconnect layout, I do believe that if we
> want to support in the scheduler itself then we should leverage some
> distance table rather than to create X extra scheduler topology levels.
>
> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given
> that SLIT gives us a distance value between any two PXM, we could directly
> express core-to-core distance in that table. With that (and if that still
> lets us properly discover NUMA node spans), we could let the scheduler
> build dynamic NUMA-like topology levels representing the inner quirks of
> the cache / interconnect layout.

You would rapidly run into the problem SLIT had for numa node description.
There is no consistent description of distance and except in the vaguest
sense or 'nearer' it wasn't any use for anything. That is why HMAT
came along. It's far from perfect but it is a step up.

I can't see how you'd generalize those particular tables to do anything
for intercore comms without breaking their use for NUMA, but something
a bit similar might work.

A lot of thought has gone in (and meeting time) to try an improve the
situation for complex topology around NUMA. Whilst there are differences
in representing the internal interconnects and caches it seems like a somewhat
similar problem. The issue there is it is really really hard to describe
this stuff with enough detail to be useful, but simple enough to be usable.

https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@xxxxxxxxxx/

>
> It's mostly pipe dreams for now, but there seems to be more and more
> hardware where that would make sense; somewhat recently the PowerPC guys
> added something to their arch-specific code in that regards.

Pipe dream == something to work on ;)

ACPI has a nice code first model of updating the spec now, so we can discuss
this one in public, and propose spec changes only once we have an implementation
proven.

Note I'm not proposing we put the cluster stuff in the scheduler, just
provide it as a hint to userspace.

Jonathan