Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

From: Peter Zijlstra
Date: Tue Jun 10 2014 - 08:12:34 EST


On Tue, Jun 10, 2014 at 04:56:12PM +0800, Michael wang wrote:
> On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
> [snip]
> >
> > Hmm, that _should_ more or less work and does indeed suggest there's
> > something iffy.
> >
>
> I think we locate the reason why cpu-cgroup doesn't works well on dbench
> now... finally.
>
> I'd like to link the reproduce way of the issue here since long time
> passed...
>
> https://lkml.org/lkml/2014/5/16/4
>
> Now here is the analysis:
>
> So our problem is when put tasks like dbench which sleep and wakeup each other
> frequently into a deep-group, they will gathered on same CPU when workload like
> stress are running, which lead to that the whole group could gain no more than
> one CPU.
>
> Basically there are two key points here, load-balance and wake-affine.
>
> Wake-affine for sure pull tasks together for workload like dbench, what make
> it difference when put dbench into a group one level deeper is the
> load-balance, which happened less.

We load-balance less (frequently) or we migrate less tasks due to
load-balancing ?

> Usually, when system is busy, during the wakeup when we could not locate
> idle cpu, we pick the search point instead, whatever how busy it is since
> we count on the balance routine later to help balance the load.

But above you said that dbench usually triggers the wake-affine logic,
but now you say it doesn't and we rely on select_idle_sibling?

Note that the comparison isn't fair, running dbench on an idle system vs
running dbench on a busy system is the first step.

The second is adding the cgroup crap on.

> However, in our cases the load balance could not help on that, since deeper
> the group is, less the load effect it means to root group.

But since all actual load is on the same depth, the relative threshold
(imbalance pct) should work the same, the size of the values don't
matter, the relative ratios do.

> By which means even tasks in deep group all gathered on one CPU, the load
> could still balanced from the view of root group, and the tasks lost the
> only chances (balance) to spread when they already on the same CPU...

Sure, but see above.

> Furthermore, for tasks flip frequently like dbench, it'll become far more
> harder for load balance to help, it could even rarely catch them on rq.

And I suspect that is the main problem; so see what it does on a busy
system: !cgroup: nr_cpus busy loops + dbench, because that's your
benchmark for adding cgroups, the cgroup can only shift that behaviour
around.

> So in such cases, the only chance to do balance for these tasks is during
> the wakeup, however it will be expensive...
>
> Thus the cheaper way is something just like select_idle_sibling(), the only
> difference is now we balance tasks inside the group to prevent them from
> gathered.
>
> Below patch has solved the problem during the testing, I'd like to do more
> testing on other benchmarks before send out the formal patch, any comments
> are welcomed ;-)

So I think that approach is wrong, select_idle_siblings() works because
we want to keep CPUs from being idle, but if they're not actually idle,
pretending like they are (in a cgroup) is actively wrong and can skew
load pretty bad.

Furthermore, if as I expect, dbench sucks on a busy system, then the
proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
alter behaviour like that.

More so, I suspect that patch will tend to overload cpu0 (and lower cpu
numbers in general -- because its scanning in the same direction for
each cgroup) for other workloads. You can't just go pile more and more
work on cpu0 just because there's nothing running in this particular
cgroup.

So dbench is very sensitive to queueing, and select_idle_siblings()
avoids a lot of queueing on an idle system. I don't think that's
something we should fix with cgroups.

Attachment: pgpi9DqwRBZLv.pgp
Description: PGP signature