Re: [tip:core/debug] debug lockups: Improve lockup detection

From: Ingo Molnar
Date: Sun Aug 02 2009 - 15:27:40 EST



* Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Sun, 2 Aug 2009 13:09:34 GMT tip-bot for Ingo Molnar <mingo@xxxxxxx> wrote:
>
> > Commit-ID: c1dc0b9c0c8979ce4d411caadff5c0d79dee58bc
> > Gitweb: http://git.kernel.org/tip/c1dc0b9c0c8979ce4d411caadff5c0d79dee58bc
> > Author: Ingo Molnar <mingo@xxxxxxx>
> > AuthorDate: Sun, 2 Aug 2009 11:28:21 +0200
> > Committer: Ingo Molnar <mingo@xxxxxxx>
> > CommitDate: Sun, 2 Aug 2009 13:27:17 +0200
> >
> > --- a/drivers/char/sysrq.c
> > +++ b/drivers/char/sysrq.c
> > @@ -24,6 +24,7 @@
> > #include <linux/sysrq.h>
> > #include <linux/kbd_kern.h>
> > #include <linux/proc_fs.h>
> > +#include <linux/nmi.h>
> > #include <linux/quotaops.h>
> > #include <linux/perf_counter.h>
> > #include <linux/kernel.h>
> > @@ -222,12 +223,7 @@ static DECLARE_WORK(sysrq_showallcpus, sysrq_showregs_othercpus);
> >
> > static void sysrq_handle_showallcpus(int key, struct tty_struct *tty)
> > {
> > - struct pt_regs *regs = get_irq_regs();
> > - if (regs) {
> > - printk(KERN_INFO "CPU%d:\n", smp_processor_id());
> > - show_regs(regs);
> > - }
> > - schedule_work(&sysrq_showallcpus);
> > + trigger_all_cpu_backtrace();
> > }
>
> I think this just broke all non-x86 non-sparc SMP architectures.

Yeah - it 'broke' them in the sense of them not having a working
trigger_all_cpu_backtrace() implementation to begin with. (which
breaks/degrades spinlock-debug to begin with so it's an existing
problem)

One solution would be to do a generic trigger_all_cpu_backtrace()
implementation that does the above schedule_work() approach.

I never understood why we proliferated all these different
backtrace-triggering mechanisms instead of doing one good approach
that everything uses.

> > static struct sysrq_key_op sysrq_showallcpus_op = {
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 7717b95..9c5fa9f 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -35,6 +35,7 @@
> > #include <linux/rcupdate.h>
> > #include <linux/interrupt.h>
> > #include <linux/sched.h>
> > +#include <linux/nmi.h>
> > #include <asm/atomic.h>
> > #include <linux/bitops.h>
> > #include <linux/module.h>
> > @@ -469,6 +470,8 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
> > }
> > printk(" (detected by %d, t=%ld jiffies)\n",
> > smp_processor_id(), (long)(jiffies - rsp->gp_start));
> > + trigger_all_cpu_backtrace();
>
> Be aware that trigger_all_cpu_backtrace() is a PITA when you have
> a lot of CPUs.
>
> If a callsite is careful to ensure that the most important
> information is emitted last then that might improve things.
>
> otoh, log buffer overflow will truncate, I think. So that info
> needs to be emitted first too ;)
>
> It's a PITA.

Yeah, it is - i'd expect larger systems to have larger log buffers.
Lack of info was obviously a showstopper with the highest priority.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/