Re: Dumb question: BKL on reboot ?

From: Andrea Arcangeli
Date: Sun Aug 24 2003 - 16:46:39 EST


Hi Hannes,

On Fri, Aug 22, 2003 at 08:39:03AM +0200, Hannes Reinecke wrote:
> Agreed, this smp_processor_id() == 0 thing is interesting. I'll try you
> suggestion and see how far I'll progress.

Ulrich pointed me out that only cpu0 can call into the VM (thanks!), and
in turn the s390 code I tocuehd was infact correct (despite it looked
suspect given the trace and the special ==0 case). After a closer
inspection my (last ;) conclusion is this (cut-and-past from bugzilla)

----------------------
[..]
But the data.started counter is already enforcing a good enough
guarantee, that the CPU1 will execute
"signal_processor(smp_processor_id(), sigp_stop);" only when the CPU0 is
already executing inside the IPI handler. So I can't imagine the IPI on
CPU0 getting lost.

the last explanation I can think, is that the CPU0 executes the IPI,
waits for cpu_restart_map to become 0 (i.e. CPU1 offline), and
eventually executes this code correctly:

do_store_status();
/*
* Finally call reipl. Because we waited for all other
* cpus to enter this function we know that they do
* not hold any s390irq-locks (the cpus have been
* interrupted by an external interrupt and s390irq
* locks are always held disabled).
*/
reipl(S390_lowcore.ipl_device);
}
signal_processor(smp_processor_id(), sigp_stop);

but the box doesn't reboot and the CPU0 returns from the IPI and goes
idle until kupdate kicks in (finds the first idle cpu and tries to take
the big kernel lock that is correctly held by CPU1 in sys_reboot).

This seems a bug in the s390 lowlevel code above, it just doesn't reboot
the machine.
----------------------

Maybe it's related to signal_processor(smp_processor_id(), sigp_stop)
failing inside IPI handlers or whatever similar arch specific, dunno.

The patch removing the BKL from sys_reboot shouldn't help either if my
above theory is correct: when the IPI runs in cpu0, it's not even
running on top of the BKL. It's just that the machine not rebooting,
eventually will have the first idle cpu (cpu0) execute kupdate in mean
in 2.5 sec, and it will eventually go in the state you found in lkcd
since the BKL is held by cpu1 in sys_reboot that is already offline (no
valid EIP in lkcd).

So if you want to test, probably one interesting info you could generate
to confirm my above theory, is to reproduce the very same deadlock even
with the BKL removal patch applied to sys_reboot. Not sure how easy it
is to reproduce it. If you can hang it again, it would not deadlock in
kupdate anymore: you should find cpu0 stuck in the idle loop instead of
sync_old_buffers.

It maybe something slightly different going wrong too, but still I'm
convinced the lock_kernel in sys_reboot is absoltely innocent and needed
(at least in 2.4).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/