Re: Badness at kernel/rcutree.c:1228

From: kordex -
Date: Wed Dec 09 2009 - 13:36:47 EST


I will turn debugging options on after
http://lkml.org/lkml/2009/12/9/44 gets traced down so I can do that.

--Mikko Kortelainen

2009/12/9 Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>:
> In that case, the best thing would be to drop the warning into the
> beginnings and ends of processing for system calls used by your workload.
> The hope would be to find it triggering at the end of a given syscall,
> permitting you to binary search the intervening code.
>
> Given that I cannot reproduce this, I cannot do much more than to offer
> random hints.
>
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂThanx, Paul
>
> On Wed, Dec 09, 2009 at 07:35:44PM +0200, kordex - wrote:
>> It did actually show the Badness after system had been running a long
>> time. And this cut from it shows that system was fully done kernel
>> init routines as there is ntpd running:
>>
>> warning: `ntpd' uses 32-bit capabilities (legacy support in use)
>> ------------[ cut here ]------------
>> Badness at kernel/rcutree.c:1228
>> NIP: c004ecbc LR: c004f14c CTR: c007bd70
>> REGS: df34dde0 TRAP: 0700 Â Not tainted Â(2.6.32)
>>
>> --Mikko Kortelainen
>>
>> 2009/12/9 Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>:
>> > Hmmm... ÂDidn't the first console output you sent me show the beenonline
>> > WARN_ON_ONCE() triggering during late boot? ÂYes, you had other failures
>> > later, but it might be that whatever is triggering this warning is
>> > related to those failures, right?
>> >
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂThanx, Paul
>> >
>> > On Wed, Dec 09, 2009 at 04:57:54PM +0200, kordex - wrote:
>> >> Sorry but,
>> >>
>> >> Where actually this "down nearer to the point in the boot-up sequence"
>> >> would be as I encountered the errors while the system was running (had
>> >> been for days).
>> >>
>> >> --Mikko Kortelainen
>> >>
>> >> 2009/12/9 Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>:
>> >> > On Wed, Dec 09, 2009 at 11:15:17AM +0200, kordex - wrote:
>> >> >> Hey,
>> >> >>
>> >> >> I hope it's in the right place.
>> >> >
>> >> > Looks fine to me.
>> >> >
>> >> > And the fact that you did -not- see anything in your dmesg indicates
>> >> > that the beenonline fields are set correctly at that point, as expected.
>> >> > You will only see a complaint if the beenonline fields have been
>> >> > corrupted.
>> >> >
>> >> > Please move them down nearer to the point in the boot-up sequence where
>> >> > you were seeing the failure. ÂPlease note that interrupts had been on
>> >> > for one good long time when your original kernel complained, so there
>> >> > had been a very large number of executions with beenonline set
>> >> > correctly.
>> >> >
>> >> > So it will probably be faster to start at the original failure
>> >> > and move towards boot rather than vice versa.
>> >> >
>> >> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂThanx, Paul
>> >> >
>> >> >> --Mikko Kortelainen
>> >> >>
>> >> >> navi:/usr/src# diff -Naur a/init/main.c b/init/main.c
>> >> >> --- a/init/main.c    2009-12-03 05:51:21.000000000 +0200
>> >> >> +++ b/init/main.c    2009-12-09 03:22:15.000000000 +0200
>> >> >> @@ -81,6 +81,9 @@
>> >> >> Â#include <asm/smp.h>
>> >> >> Â#endif
>> >> >>
>> >> >> +/* DEBUG STATEMENT 2009/12/08 */
>> >> >> +#include <linux/rcutree.h>
>> >> >> +
>> >> >> Âstatic int kernel_init(void *);
>> >> >>
>> >> >> Âextern void init_IRQ(void);
>> >> >> @@ -589,6 +592,10 @@
>> >> >> Â Â Â Â Â Â Â Â local_irq_disable();
>> >> >> Â Â Â Â }
>> >> >> Â Â Â Â rcu_init();
>> >> >> +
>> >> >> + Â Â Â /* DEBUG STATEMENT 2009/12/08 */
>> >> >> + Â Â Â WARN_ON_ONCE(rcu_check_beenonline());
>> >> >> +
>> >> >> Â Â Â Â /* init some links before init_ISA_irqs() */
>> >> >> Â Â Â Â early_irq_init();
>> >> >> Â Â Â Â init_IRQ();
>> >> >>
>> >> >>
>> >> >>
>> >> >> 2009/12/9 Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>:
>> >> >> > On Wed, Dec 09, 2009 at 03:41:00AM +0200, kordex - wrote:
>> >> >> >> Hey,
>> >> >> >>
>> >> >> >> I put the debug function under init/main.c after rcu_init(); but there
>> >> >> >> is no output on dmesg which means that it receives zero value.
>> >> >> >>
>> >> >> >> Full dmesg: http://xnet.fi/opt/apps/lkml-2.6.32-vanilla.dmesg.rcu-init.txt
>> >> >> >
>> >> >> > Could you please send the patch you applied to, as you said, put the
>> >> >> > debug function under init/main.c after rcu_init()?
>> >> >> >
>> >> >> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂThanx, Paul
>> >> >> >
>> >> >> >> --Mikko Kortelainen
>> >> >> >>
>> >> >> >> 2009/12/8 Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>:
>> >> >> >> > On Tue, Dec 08, 2009 at 11:22:07AM -0800, Paul E. McKenney wrote:
>> >> >> >> >> At this point, I must defer to those more skilled than I at diagnosing
>> >> >> >> >> early-boot problems.
>> >> >> >> >
>> >> >> >> > Well, that is silly on my part -- the problem seems to appear late in
>> >> >> >> > boot, and you had no problem capturing that portion of the boot log.
>> >> >> >> >
>> >> >> >> > So please see below for a patch providing a rcu_check_beenonline()
>> >> >> >> > function that, when called after rcu_init(), returns non-zero if the
>> >> >> >> > beenonline fields have become corrupted. ÂSo put calls of the form:
>> >> >> >> >
>> >> >> >> > Â Â Â ÂWARN_ON_ONCE(rcu_check_beenonline());
>> >> >> >> >
>> >> >> >> > in the initialization code path preceding the problem. ÂEither #include
>> >> >> >> > rcupdate.h or explicitly declare the function as appropriate.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>> >> >> >> > ---
>> >> >> >> >
>> >> >> >> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
>> >> >> >> > index 9642c6b..190a687 100644
>> >> >> >> > --- a/include/linux/rcutree.h
>> >> >> >> > +++ b/include/linux/rcutree.h
>> >> >> >> > @@ -39,6 +39,8 @@ extern int rcu_cpu_notify(struct notifier_block *self,
>> >> >> >> > Âextern int rcu_needs_cpu(int cpu);
>> >> >> >> > Âextern int rcu_expedited_torture_stats(char *page);
>> >> >> >> >
>> >> >> >> > +extern int rcu_check_beenonline(void);
>> >> >> >> > +
>> >> >> >> > Â#ifdef CONFIG_TREE_PREEMPT_RCU
>> >> >> >> >
>> >> >> >> > Âextern void __rcu_read_lock(void);
>> >> >> >> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>> >> >> >> > index 207125b..27d3722 100644
>> >> >> >> > --- a/kernel/rcutree.c
>> >> >> >> > +++ b/kernel/rcutree.c
>> >> >> >> > @@ -77,6 +77,17 @@ DEFINE_PER_CPU(struct rcu_data, rcu_sched_data);
>> >> >> >> > Âstruct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh_state);
>> >> >> >> > ÂDEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
>> >> >> >> >
>> >> >> >> > +/*
>> >> >> >> > + * Ad-hoc diagnostic function, for use only after rcu_init() has
>> >> >> >> > + * returned. ÂAssumes that the boot CPU is CPU 0. ÂAssumes that
>> >> >> >> > + * the kernel has been built with CONFIG_TREE_RCU. ÂNot for inclusion.
>> >> >> >> > + * Usage: "WARN_ON_ONCE(rcu_check_beenonline());"
>> >> >> >> > + */
>> >> >> >> > +int rcu_check_beenonline(void)
>> >> >> >> > +{
>> >> >> >> > + Â Â Â return !per_cpu(rcu_sched_data, 0).beenonline ||
>> >> >> >> > + Â Â Â Â Â Â Â!per_cpu(rcu_bh_data, 0).beenonline;
>> >> >> >> > +}
>> >> >> >> >
>> >> >> >> > Â/*
>> >> >> >> > Â* Return true if an RCU grace period is in progress. ÂThe ACCESS_ONCE()s
>> >> >> >> >
>> >> >> >
>> >> >
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/