Re: rcu self-detected stall messages on OMAP3, 4 boards

From: Paul E. McKenney
Date: Fri Sep 21 2012 - 20:05:32 EST


On Fri, Sep 21, 2012 at 10:41:14PM +0000, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
>
> > On Fri, Sep 21, 2012 at 05:47:31PM +0000, Paul Walmsley wrote:
> >
> > > I built an OMAP kernel from Linus' commit
> > > 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq"). The config used
> > > was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand. Booted it
> > > on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
> >
> > Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
>
> No, it's just as described above.
>
> > If not, could you please try it? (This patch cleared up a similar
> > problem for Becky, also on OMAP.)
>
> Did not seem to help, either with or without CONFIG_CPU_IDLE.

I was hoping! ;-)

And my init=/bin/sh kernel ran idle for more than an hour without
any RCU CPU stall warnings...

I am wondering if your system somehow figured out how to start a grace
period that had no RCU callbacks waiting for it. If that happened,
then a CONFIG_NO_HZ=y system could in theory get into a state where all
CPUs are in dyntick-idle mode, so that none of them is doing anything
to force the grace period to complete.

That should be easy to diagnose, anyway. Please see below, which
includes the earlier diagnostic patch.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 307caf1..696f189 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -879,6 +879,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
unsigned long flags;
int ndetected = 0;
struct rcu_node *rnp = rcu_get_root(rsp);
+ long totqlen = 0;

/* Only let one CPU complain about others per time interval. */

@@ -923,8 +924,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
raw_spin_unlock_irqrestore(&rnp->lock, flags);

print_cpu_stall_info_end();
- printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
- smp_processor_id(), (long)(jiffies - rsp->gp_start));
+ for_each_possible_cpu(cpu)
+ totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+ pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu, q=%lu)\n",
+ smp_processor_id(), (long)(jiffies - rsp->gp_start),
+ rsp->gpnum, rsp->completed, totqlen);
if (ndetected == 0)
printk(KERN_ERR "INFO: Stall ended before state dump start\n");
else if (!trigger_all_cpu_backtrace())
@@ -939,8 +943,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)

static void print_cpu_stall(struct rcu_state *rsp)
{
+ int cpu;
unsigned long flags;
struct rcu_node *rnp = rcu_get_root(rsp);
+ long totqlen = 0;

/*
* OK, time to rat on ourselves...
@@ -951,7 +957,10 @@ static void print_cpu_stall(struct rcu_state *rsp)
print_cpu_stall_info_begin();
print_cpu_stall_info(rsp, smp_processor_id());
print_cpu_stall_info_end();
- printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start);
+ for_each_possible_cpu(cpu)
+ totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+ pr_cont(" (t=%lu jiffies g=%lu c=%lu q=%lu)\n",
+ jiffies - rsp->gp_start, rsp->gpnum, rsp->completed, totqlen);
if (!trigger_all_cpu_backtrace())
dump_stack();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/