Re: [ANNOUNCE] 3.0.14-rt31

From: Tim Sander
Date: Thu Jan 12 2012 - 11:58:13 EST


Hi Steven

Thanks for your reply.

> > I have just tested 3.0.14 with some local adaptions. Unfortunatly there
> > we have still two errors here:
> > Reboot ( of upstart) fails sometimes fails with the following message:
> > "reboot: Unable to execute shutdown: Bad address"
>
> What's the bad address? Was there a kernel oops?
Well i traced it down to the error message of reboot from upstart.
(upstart-1.3/util/reboot.c:211). Presumably it was a "bad page" error out of
the memory from the developer who reported that. Unfortunatly this error does
not happen to often. I am not aware of a kernel oops.
(My other mail to this list: https://lkml.org/lkml/2011/12/7/657)

> > This problem can probably easily worked around by catching a failed
> > execution and retry, but i am afraid that execution fails more often in
> > other places and leads to silent functionality failures.
> >
> > and the running wild ksoftirqd0 most probably after the kernel message:
> > "sched: RT throttling activated"
>
> Hmm, that's not good. It means that an RT task is spinning too much.
Mh, sorry i was to terse on that. This only happens after first boot on UBIFS
update, but it shows that somehow there seems to be a corner case when
throtteling is activated. Since this seems to be the reason for ksoftirq0
running as much cpu as it gets. I just patched out switch to rt throtteling
and i will ask the mtd guys about the work they presumably do in interrupt
context which causes this throtteling in the first place.

> > It also seems as if the system looks up after running ifconfig. But it
> > seems as if the error only shows up most of the times if i am not
> > around.
>
> s/looks/locks/ ?
Ups yes.

> If it happens after ifconfig, then obviously that looks to be something
> to do with either the network driver or the network stack.
Strangely top,dmesg works. We have a second avahi autoip network interface
(eth0:avahi). Probably there is s.t. in this codepath.

> But there's really nothing I can do to look into this without more
> information.
It would be good if i could get some advice how to get useful information out
of the system to pinpoint these errors (e.g. special sysrequests or s.t. like
that). For the reboot case i will try to find the exact return value on
failure, but for the ksoftirq0 case i currently try to work around that
problem (by patching out the switch to throtteling) but currently i don't see
a way to tackle the root of this behaviour.

Best regards
Tim

Please ignore:

Hottinger Baldwin Messtechnik GmbH, Im Tiefen See 45, 64293 Darmstadt, Germany | www.hbm.com

Registered as GmbH (German limited liability corporation) in the commercial register at the local court of Darmstadt, HRB 1147
Company domiciled in Darmstadt | CEO: Andreas Huellhorst | Chairman of the board: James Charles Webster

Als Gesellschaft mit beschraenkter Haftung eingetragen im Handelsregister des Amtsgerichts Darmstadt unter HRB 1147
Sitz der Gesellschaft: Darmstadt | Geschaeftsfuehrung: Andreas Huellhorst | Aufsichtsratsvorsitzender: James Charles Webster

The information in this email is confidential. It is intended solely for the addressee. If you are not the intended recipient, please let me know and delete this email.

Die in dieser E-Mail enthaltene Information ist vertraulich und lediglich für den Empfaenger bestimmt. Sollten Sie nicht der eigentliche Empfaenger sein, informieren Sie mich bitte kurz und loeschen diese E-Mail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/