Re: [PATCH 4/4] sched,fair: remove effective_load

From: Peter Zijlstra
Date: Tue Jun 27 2017 - 01:39:25 EST


On Mon, Jun 26, 2017 at 03:34:49PM -0400, Rik van Riel wrote:
> On Mon, 2017-06-26 at 18:12 +0200, Peter Zijlstra wrote:
> > On Mon, Jun 26, 2017 at 11:20:54AM -0400, Rik van Riel wrote:
> >
> > > Oh, indeed.  I guess in wake_affine() we should test
> > > whether the CPUs are in the same NUMA node, rather than
> > > doing cpus_share_cache() ?
> >
> > Well, since select_idle_sibling() is on LLC; the early test on
> > cpus_share_cache(prev,this) seems to actually make sense.
> >
> > But then cutting out all the other bits seems wrong. Not in the least
> > because !NUMA_BALACING should also still keep working.
>
> Even when !NUMA_BALANCING, I suspect it makes little sense
> to compare the loads just one the cores in question, since
> select_idle_sibling() will likely move the task somewhere
> else.
>
> I suspect we want to compare the load on the whole LLC
> for that reason, even with NUMA_BALANCING disabled.

But we don't have that data around :/ One thing we could do is try and
keep a copy of the last s*_lb_stats around in the sched_domain_shared
stuff or something and try and use that.

That way we can strictly keep things at the LLC level and not confuse
things with NUMA.

Similarly, we could use that same data to then avoid re-computing things
for the NUMA domain as well and do away with numa_stats.

> > > Or, alternatively, have an update_numa_stats() variant
> > > for numa_wake_affine() that works on the LLC level?
> >
> > I think we want to retain the existing behaviour for everything
> > larger than LLC, and when NUMA_BALANCING, smaller than NUMA.
>
> What do you mean by this, exactly?

As you noted, when prev and this are in the same LLC, it doesn't matter
and select_idle_sibling() will do its thing. So anything smaller than
the LLC need not do anything.

When NUMA_BALANCING we have the numa_stats thing and we can, as you
propose use that.

If LLC < NUMA or !NUMA_BALANCING we have a region that needs to do
_something_.

> How does the "existing behaviour" of only looking at
> the load on two cores make sense when doing LLC-level
> task placement?

Right, might not be ideal, but its what we have now. Supposedly its
better than not doing anything at all.

But see above for other ideas.

> > Also note that your use of task_h_load() in the new numa thing
> > suffers
> > from exactly the problem effective_load() is trying to solve.
>
> Are you saying task_h_load is wrong in task_numa_compare()
> too, then? Should both use effective_load()?

I need more than the few minutes I currently have, but probably. The
question is of course, how much does it matter and how painful will it
be to do it better.