Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

From: Michael wang
Date: Wed Jun 11 2014 - 02:14:01 EST


Hi, Peter

Thanks for the reply :)

On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
[snip]
>> Wake-affine for sure pull tasks together for workload like dbench, what make
>> it difference when put dbench into a group one level deeper is the
>> load-balance, which happened less.
>
> We load-balance less (frequently) or we migrate less tasks due to
> load-balancing ?

IMHO, when we put tasks one group deeper, in other word the totally
weight of these tasks is 1024 (prev is 3072), the load become more
balancing in root, which make bl-routine consider the system is
balanced, which make we migrate less in lb-routine.

>
>> Usually, when system is busy, during the wakeup when we could not locate
>> idle cpu, we pick the search point instead, whatever how busy it is since
>> we count on the balance routine later to help balance the load.
>
> But above you said that dbench usually triggers the wake-affine logic,
> but now you say it doesn't and we rely on select_idle_sibling?

During wakeup, it triggered wake-affine, after that, go inside
select_idle_sibling() and found no idle cpu, than pick the search point
instead (curr cpu if wake-affine or prev cpu if not).

>
> Note that the comparison isn't fair, running dbench on an idle system vs
> running dbench on a busy system is the first step.

Our comparison is based on the same busy-system, all the two cases have
the same workload running, the only difference is that we put the same
workload (dbench + stress) one group deeper, it's like:

Good case:
root
l1-A l1-B l1-C
dbench stress stress

results:
dbench got around 300%
each stress got around 450%

Bad case:
root
l1
l2-A l2-B l2-C
dbench stress stress

results:
dbench got around 100% (throughout dropped too)
each stress got around 550%

Although the l1-group gain the same resources (1200%), it doesn't assign
to l2-ABC correctly like the root-group did.

>
> The second is adding the cgroup crap on.
>
>> However, in our cases the load balance could not help on that, since deeper
>> the group is, less the load effect it means to root group.
>
> But since all actual load is on the same depth, the relative threshold
> (imbalance pct) should work the same, the size of the values don't
> matter, the relative ratios do.

Exactly, however, when group is deep, the chance of it to make root
imbalance reduced, in good case, gathered on cpu means 1024 load, while
in bad case it dropped to 1024/3 ideally, that make it harder to trigger
imbalance and gain help from the routine, please note that although
dbench and stress are the only workload in system, there are still other
tasks serve for the system need to be wakeup (some very actively since
the dbench...), compared to them, deep group load means nothing...

>
>> By which means even tasks in deep group all gathered on one CPU, the load
>> could still balanced from the view of root group, and the tasks lost the
>> only chances (balance) to spread when they already on the same CPU...
>
> Sure, but see above.

The lb-routine could not provide enough help for deep group, since the
imbalance happened inside the group could not cause imbalance in root,
ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
easily ignored, but inside the l2-group, the gathered case could already
means imbalance like (1024 * 5) : 1024.

>
>> Furthermore, for tasks flip frequently like dbench, it'll become far more
>> harder for load balance to help, it could even rarely catch them on rq.
>
> And I suspect that is the main problem; so see what it does on a busy
> system: !cgroup: nr_cpus busy loops + dbench, because that's your
> benchmark for adding cgroups, the cgroup can only shift that behaviour
> around.

There are busy loops in good case too, and dbench behaviour in l1-groups
should not changed after put them to l2-group, what make things worse is
the chance for them to spread after gathered become less.

>
[snip]
>> Below patch has solved the problem during the testing, I'd like to do more
>> testing on other benchmarks before send out the formal patch, any comments
>> are welcomed ;-)
>
> So I think that approach is wrong, select_idle_siblings() works because
> we want to keep CPUs from being idle, but if they're not actually idle,
> pretending like they are (in a cgroup) is actively wrong and can skew
> load pretty bad.

We only choose the timing when no idle cpu located, and flips is
somewhat high, also the group is deep.

In such cases, select_idle_siblings() doesn't works anyway, it return
the target even it is very busy, we just check twice to prevent it from
making some obviously bad decision ;-)

>
> Furthermore, if as I expect, dbench sucks on a busy system, then the
> proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> alter behaviour like that.

That's true and that's why we currently still need to shut down the
GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
solve later...

What we currently expect is that the cgroup assign the resource
according to the share, it works well in l1-groups, so we expect it to
work the same well in l2-groups...

>
> More so, I suspect that patch will tend to overload cpu0 (and lower cpu
> numbers in general -- because its scanning in the same direction for
> each cgroup) for other workloads. You can't just go pile more and more
> work on cpu0 just because there's nothing running in this particular
> cgroup.

That's a good point...

However during the testing, this doesn't happen on the 3 groups, tasks
stay on high-cpu as often as low-cpu, IMHO the key point here is the
lb-routine still works, although much less than before.

So the fix just make the result of lb-routine effect longer, since the
higher cpu it picked is usually idle in group (directly pick later), in
other word, tasks on high-cpu is harder to be wake-affine to low-cpu
than before.

And when this apply to all the groups, each of them will be balanced
both internally and externally, then we will see equal tasks on each cpu.

select_idle_sibling() do pick low-cpu more often, and combined with
wake-affine, without enough load-balance, the tasks will gathered on
low-cpu more often, but our solution will make the less load-balance
become more valuable (when they need to be), IMHO, it could even
contribute to balance work in some cases...

>
> So dbench is very sensitive to queueing, and select_idle_siblings()
> avoids a lot of queueing on an idle system. I don't think that's
> something we should fix with cgroups.

It has to queue anyway after wakeup, isn't it? we just want a good
candidate which won't make things too bad inside group, and only do this
when select_idle_siblings() give up on searching...

Regards,
Michael Wang

>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/