Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code

From: Peter Zijlstra
Date: Fri Jul 01 2016 - 10:59:49 EST


On Fri, Jul 01, 2016 at 09:15:55AM -0500, James Hartsock wrote:
> On Fri, Jul 1, 2016 at 3:24 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote:
> > > well this is issue our partner met in the setup,
> > > and I'm not sure what was their motivation for that,
> > > perhaps James could clarify in here..
> > >
> > > I tried to make the 'scratch that itch' solution as
> > > mentioned in earlier discussion.
> > >
> > > So IIUIC what you say is that it needs to be more generic
> > > solution..? I'll continue staring at it then ;-)
> >
> > I just want to know what problem we're trying to solve..
> >
> > Because it appears this is running 1 task each on a 'weird' subset of
> > cpus and things going badly. If this really is the case, then teaching
> > active balance to only move tasks to idle cpus or something should also
> > cure things.
> >
> > Also, I'm curious why people set such weird masks.
> >
>
> âI think the original issue was reported/seen when using a straight range
> of CPUs, but in that range they crossed numa nodes. Then in the no knowing
> what was triggering the issue and trying to reproduce the issue we started
> getting some crazy masks.
>
> The work-around customer has been using is to use SCHED_RR as it doesn't
> have this balance across NUMA issueâ. But it is also the fact that
> SCHED_RR doesn't have this issue that makes it "seem" like a defect in
> SCHED_OTHER. I have shared that this is triggered by the across numa node
> taskset with customer so they are aware of that. But if this is something
> that is seen as a limitation of SCHED_OTHER and not reasonable to be
> addressed upstream I think maybe it as least something we should get
> documented.

But what exact usecase? A single task per cpu, or something else?

Note that RR has different constraints than OTHER, but in both cases
having skewed masks across a topology divide is unlikely to be good for
performance.

Esp. in the extreme case reported here, where one task is on an entirely
different node than the rest of them, that task will cause cacheline
transfers between the nodes slowing down itself as well as all the other
tasks that have to pull it back in.