Houston, I think we have a problem

From: Mike Galbraith (efault@gmx.de)
Date: Sun Apr 27 2003 - 05:52:49 EST


<SQUEAK! SQUEAK! SQUEAK!>

Hi Folks,

I don't generally squeak unless I'm pretty darn sure I see a genuine
problem. I think I see one right now, so here I am squeaking my little
lungs out ;-) Perhaps I'm being stupid, and if that's the case, someone
please apply a size 15EE boot vigorously to my tail-feathers (jump-start
brain), and I'll shut up.

The problem I see is terrible terrible semaphore starvation. It comes in
two varieties, and might apply to other locks as well [1]. Variety 1 is
owners of semaphores being sent off to the expired array, which happens
with remarkable frequency. This variant is the lesser of the two evils,
because here at least you have _some_ protection via EXPIRED_STARVING(),
even if you have interactive tasks doing round robin. The worst variant is
when you have a steady stream of tasks being upgraded to TASK_INTERACTIVE()
while someone of low/modest priority has a semaphore downed... the poor guy
can (seemingly) wait for _ages_ to get a chance to release it, and will
starve all comers in the meantime. I regularly see a SCHED_RR and
mlockall() vmstat stall for several seconds, and _sometimes_ my poor little
box goes utterly insane and stalls vmstat for over a MINUTE [2].

To reproduce this 100% of the time, simply compile virgin 2.5.68
up/preempt, reduce your ram to 128mb, and using gcc-2.95.3 as to not
overload the vm, run a make -j30 bzImage in an ext3 partition on a P3/500
single ide disk box. No, you don't really need to meet all of those
restrictions... you'll see the problem on a big hairy chested box as well,
just not as bad as I see it on my little box. The first symptom of the
problem you will notice is a complete lack of swap activity along with
highly improbable quantities of unused ram were all those hungry cc1's
getting regular CPU feedings.

If the huge increase in hold time (induced by a stream of elevated priority
tasks who may even achieve their elevated status via _one_ wakeup), is the
desired behavior now, so be it. If that's the case, someone please say so,
that I may cease and desist fighting with the dang thing. I'm having lots
of fun mind you, but testing is supposed to be mind-numbingly boring ;-)

Anyway, grep for pid:prio pair 301:-2 in the attached log to see vmstat
being nailed for over 8 seconds. Then, grep for pid:prio pair 1119:23 to
see a task holding up a parade for 7 seconds. The patch I used to generate
this log is also attached for idiot-reproachment purposes.

(um, don't anyone try running it on an SMP or NUMA beast [those folks would
surely know better, but...] as it's highly likely to explode violently)

        halbaderi,

        -Mike

1. I'm pretty sure it does... might really be that Heisenberg fellow
messing with me again.

2. The 100% simple and effective way to "fix" this problem for this work
load is to "just say no" to coughing up more than HZ worth of cpu time in
activate_task(). This seems perfectly obvious and correct to me... though
I'll admit it would seem much _more_ perfectly obvious and correct if
MAX_SLEEP_AVG were 11000 instead of 10000... or maybe even
40000. Whatever. I posted one X-patch that worked pretty darn well, but
nobody tried it. Not even the folks who were _griping_ about
interactivity, fairness and whatnot. How boring.

btw, what happens when kjournald yields and goes off to expired land? see
log2.txt






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Apr 30 2003 - 22:00:26 EST