Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

From: Mike Galbraith
Date: Mon Jun 30 2014 - 05:27:40 EST


On Mon, 2014-06-30 at 16:47 +0800, Michael wang wrote:
> Hi, Mike :)
>
> On 06/30/2014 04:06 PM, Mike Galbraith wrote:
> > On Mon, 2014-06-30 at 15:36 +0800, Michael wang wrote:
> >> On 06/18/2014 12:50 PM, Michael wang wrote:
> >>> By testing we found that after put benchmark (dbench) in to deep cpu-group,
> >>> tasks (dbench routines) start to gathered on one CPU, which lead to that the
> >>> benchmark could only get around 100% CPU whatever how big it's task-group's
> >>> share is, here is the link of the way to reproduce the issue:
> >>
> >> Hi, Peter
> >>
> >> We thought that involved too much factors will make things too
> >> complicated, we are trying to start over and get rid of the concepts of
> >> 'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
> >> make things more easier...
> >
> > While you're getting rid of the concept of 'GENTLE_FAIR_SLEEPERS', don't
> > forget to also get rid of the concept of 'over-scheduling' :)
>
> I'm new to this word... could you give more details on that?

Massive context switching. When heavily overloaded, wakeup preemption
tends to hurt. Trouble being that when overloaded, that's when
fast/light tasks also need to get in and back out quickly the most.

> > That gentle thing isn't perfect (is the enemy of good), but preemption
> > model being based upon sleep, while nice and simple, has the unfortunate
> > weakness that as contention increases, so does the quantity of sleep in
> > the system. Would be nice to come up with an alternative preemption
> > model as dirt simple as this one, but lacking the inherent weakness.
>
> The preemtion based on vruntime sounds fair enough, but vruntime-bonus
> for wakee do need few more thinking... although I don't want to count
> the gentle-stuff in any more, but disable it do help dbench a lot...

It's scaled, but that's not really enough. Zillion tasks can sleep in
parallel, and when they are doing that, sleep time becomes a rather
meaningless preemption yardstick. It's only meaningful when there is a
significant delta between task behaviors. When running a homogeneous
load of sleepers, eg a zillion java threads all doing the same damn
thing, you're better off turning wakeup preemption off, because trying
to smooth out microscopic vruntime deltas via wakeup preemption then
does nothing but trashes caches.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/