Re: 2.6.25-rc5-mm1: "consolechars" hangs on boot

From: Oleg Nesterov
Date: Sat Mar 15 2008 - 07:59:23 EST


On 03/14, Laurent Riffard wrote:
>
> >>>With 2.6.25-rc5-mm1, my system (Ubuntu 7.10/Gutsy) reliably hangs on
> >>>boot. Sysrq-T shows 12 "consolechars" processes stuck in do_exit call.
> >>>
> >>>The bisection said "Sucker is
> >>>patches/signals-send_signal-factor-out-signal_group_exit-checks.patch"
> >>>
> consolechars ? de8925bc 3432 2795 1
> .
> .
> .
> Call Trace:
> do_exit+0x5dd/0x5e1
> do_group_exit+0x5e/0x86
> sys_exit_group+0xf/0x11
> sysenter_past_esp+0x5f/0xa5

Aha, the task doesn't hang, it has exited (as expected), please see below...

> >And. While doing this patch I forgot we should fix the bugs with init
> >first!
> >(will try to make the patch soon).
> >
> >Laurent, any chance you can try 2.6.25-rc5-mm1 + the patch below?
> >Unlikely it can help, but would be great to be sure.
>
> Yes it does help ! Thanks.
>
> Despite a big ERR in dmesg, the system now runs fine.
>
> [ 26.780261] ERR!! init is killed by 10

Great. Thanks a lot Laurent!

So what happens is:

We have the very old bug (bugs, actually) with the global init && signals
which I tried to fix many times but can't find a simple solution. The fatal
signal sent to init doesn't really kill it (we have the check in
get_signal_to_deliver) but it sets SIGNAL_GROUP_EXIT. This is wrong, now
init can't exec, this has other bad implications, and this is just insane.

With the signals-send_signal-factor-out-signal_group_exit-checks.patch the
task with SIGNAL_GROUP_EXIT doesn't recieve the signals. While this change
itself is (I hope) correct, the "killed" /sbin/init now can't see SIGCHLD
and the system hangs on boot.

> [ 26.781709] [<c0120fa9>] complete_signal+0x163/0x1eb
> [ 26.781719] [<c01211d4>] send_signal+0x1a3/0x1cf
> [ 26.781729] [<c0121216>] __group_send_sig_info+0xa/0xc
> [ 26.781737] [<c01217cc>] group_send_sig_info+0x44/0x62
> [ 26.781747] [<c0121de4>] kill_pid_info+0x33/0x47
> [ 26.781757] [<c0122443>] sys_kill+0x73/0x145
> [ 26.781767] [<c014c655>] ? handle_mm_fault+0x21d/0x4f6
> [ 26.781791] [<c012af3c>] ? up_read+0x16/0x2a
> [ 26.781803] [<c011214c>] ? do_page_fault+0x25a/0x4da
> [ 26.781815] [<c0103906>] sysenter_past_esp+0x5f/0xa5

Not a kernel problem, but this looks a bit strange to me.

init has SIG_DFL for SIGUSR1, and someone does kill(1, SIGUSR1).
Note that init was explicitly targeted, the signal was not sent
to prgp or -1.

Most likely Ubuntu knows what it does, and I can't find any email
at ubuntu.com to cc...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/