Re: [PATCH v3 7/7] sched/topology: Use prebuilt SD flag degeneration mask

From: Peter Zijlstra
Date: Mon Jul 13 2020 - 09:43:55 EST


On Mon, Jul 13, 2020 at 02:28:29PM +0100, Valentin Schneider wrote:
>
> On 13/07/20 13:55, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2020 at 08:06:55PM +0100, Valentin Schneider wrote:
> >> Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
> >> sd_degenerate_parent().
> >>
> >> Note that this changes sd_degenerate() somewhat: I'm using the negation of
> >> SD_DEGENERATE_GROUPS_MASK as the mask of flags not requiring groups, which
> >> is equivalent to:
> >>
> >> SD_WAKE_AFFINE | SD_SERIALIZE | SD_NUMA
> >>
> >> whereas the current mask for that is simply
> >>
> >> SD_WAKE_AFFINE
> >>
> >> I played with a few toy NUMA topologies on QEMU and couldn't cause a
> >> different degeneration than what mainline does currently. If that is deemed
> >> too risky, we can go back to using SD_WAKE_AFFINE explicitly.
> >
> > Arguably SD_SERIALIZE needs groups, note how we're only having that
> > effective for machines with at least 2 nodes. It's a bit shit how we end
> > up there, but IIRC that's what it ends up as.
> >
>
> Right, AFAICT we get SD_SERIALIZE wherever we have SD_NUMA, which is any
> level above NODE.

Oh, right, I forgot we have NODE, d'0h. But in that case these lines:

if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;

are dead code, right?

> > SD_NUMA is descriptive, and not marking a group as degenerates because
> > it has SD_NUMA seems a bit silly.
>
> It does, although we can still degenerate it, see below.
>
> > But then, it would be the top domain
> > and would survive anyway?
>
> So from what I've tested we still get rid of those via
> sd_parent_degenerate(): child and parent have the same flags and same span,
> so parent goes out.
>
> That happens in the middle of the NUMA topology levels on that borked
> topology with weird distances, aka
>
> node distances:
> node 0 1 2 3
> 0: 10 12 20 22
> 1: 12 10 22 24
> 2: 20 22 10 12
> 3: 22 24 12 10
>
> which ought to look something like (+local distance to end result)
>
> 2 10 2
> 1 <---> 0 <---> 2 <---> 3
>
> We end up with the following NUMA levels (i.e. deduplicated distances)
> NUMA (<= 12)
> NUMA (<= 20)
> NUMA (<= 22)
> NUMA (<= 24)
>
> For e.g. any CPU of node1, NUMA(<=20) is gonna have the same span as
> NUMA(<=12), so we'll degenerate it.

Man, that's horrible :-) OK, fair enough, keep it as is, we'll see what
if anything breaks.