Re: [PATCH v7 2/2] x86 mce: use new printk recursion disablinginterface

From: Ingo Molnar
Date: Mon Jun 25 2012 - 05:18:18 EST



* ShuoX Liu <shuox.liu@xxxxxxxxx> wrote:

> From: ShuoX Liu <shuox.liu@xxxxxxxxx>
>
> On x86 machines, some times MCE happens just when kernel calls printk
> to output some log info to serial console, while usually MCE module in
> kernel is used to print out some hardware error information, such like
> bad cache or bad memory bank. That causes printk recursion and printk
> would omit MCE printk output.
>
> We hit it when running MTBF testing on Android ATOM mobiles.
>
> Here in print_mce, we choose to disable printk recursion to make sure
> MCE logs printed out.
>
> Signed-off-by: Yanmin Zhang <yanmin_zhang@xxxxxxxxxxxxxxx>
> Signed-off-by: ShuoX Liu <shuox.liu@xxxxxxxxx>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 6 +++++-
> 1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 2afcbd2..6056e94 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -242,6 +242,7 @@ static void print_mce(struct mce *m)
> {
> int ret = 0;
>
> + printk_recursion_check_disable();
> pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
> m->extcpu, m->mcgstatus, m->bank, m->status);
>
> @@ -275,10 +276,13 @@ static void print_mce(struct mce *m)
> * (if the CPU has an implementation for that)
> */
> ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> - if (ret == NOTIFY_STOP)
> + if (ret == NOTIFY_STOP) {
> + printk_recursion_check_enable();
> return;
> + }
>
> pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
> + printk_recursion_check_enable();

Ok, this looks useful and it solves a real problem, but I'd
prefer a better interface: instead of exposing the guts of
printk to drivers in an unsafe manner (and allowing them to keep
printk in an unsafe state indefinitely), shouldn't we instead
introduce printk_emergency() (and variants) that just disable
the recursion check, do the printk and then enable them?

That way drivers cannot possibly leave the recursion check
disabled permanently and it would also be much more obvious
*which* actual printk sites are affected by this exception.

In theory this could be achieved via a new, super-high-prio
printk level: KERN_CRASH or so, which the core printk code could
check - without introducing a new side-facility.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/