Re: rcu_sched stalls in idle task introduced in pre-4.8?

From: Paul E. McKenney
Date: Tue Aug 02 2016 - 18:08:57 EST


On Tue, Aug 02, 2016 at 03:20:37PM -0400, Rich Felker wrote:
> On Tue, Aug 02, 2016 at 11:16:36AM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 02, 2016 at 01:04:15PM -0400, Rich Felker wrote:
> > > Hi Paul,
> > >
> > > As I mentioned on Twitter, I'm experiencing (otherwise benign) rcu
> > > stall messages since rebasing my tree on Linus's in-progress merge for
> > > 4.8, as part of updating both the arch/sh tree and drivers going
> > > upstream through other maintainers to account for changes made to
> > > infrastructure they depend on. The revisions I've tried rebasing on
> > > are 194dc870a5890e855ecffb30f3b80ba7c88f96d6 and
> > > 7f155c702677d057d03b192ce652311de5434697 (now using the latter).
> >
> > Very helpful, thank you!
> >
> > Oh, and you can boot with rcupdate.rcu_cpu_stall_suppress to prevent
> > RCU CPU stall warnings from appearing, though that of course won't
> > help with whatever the underlying problem might be.
>
> Indeed, but I have found them quite useful in the past for tracking
> down actual deadlocks, so it's nice to keep them enabled.

Glad to hear that they have been helpful in the past!

> > > The system I'm testing on is an SMP configuration of the J2 SoC on
> > > FPGA (see: j-core.org), and the SMP support and timer driver (most
> > > relevant, imo) are new, but never exhibited this problem on 4.6 or
> > > 4.7. A tree with all drivers merged is available in the development
> > > branch of git://git.libc.org/linux-sh; it's regularly rebased but
> > > there's a perma-tag devel-20160801 you can check out instead. Fetching
> > > into an existing clone of Linus's tree is recommended since the server
> > > has limited resources.
> > >
> > > I doubt the code from my devel branch will be useful without hardware
> > > to test on, though. Hopefully the issue is reproducible on other
> > > systems. Very low cpu speed (50 MHz) could be a factor. The problem
> > > seems to go away with dyntick disabled.
> >
> > Indeed, it is quite possible that RCU's self-tuning needs a bit of help
> > for CPUs below a few hundred MHz. As might other pieces of the kernel.
> > Should not be a big deal (famous last words...). However, given that
> > you are running HZ=100, it should be reasonably close. So first, let's
> > see what is going on here...
> >
> > The following dmesg line usually indicates that RCU's kernel thread is
> > stuck for some reason.
> >
> > [42039.188293] rcu_sched kthread starved for 2152 jiffies! g70249 c70248 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> >
> > The RCU_GP_WAIT_FQS indicates that it is waiting for a timer to expire.
> > The task is in TASK_INTERRUPTIBLE state, but it should have been awakened
> > after only a few jiffies sleep time. The stack trace looks like there
> > might be trouble canceling the timer.
>
> Can you offer any advice on tracking down what it's waiting for, or
> what's going on with the timer? Is any arch- or clockevent-specific
> stuff that might be my fault involved here?

It is in an swait_event_interruptible_timeout(), most likely waiting on
a one-jiffy timer, as in __mod_timer(). If I understood your earlier
email correctly, setting CONFIG_HZ_PERIODIC avoided this problem.
If my understanding is correct, it might be worth reviewing any timer
code that might be pushed along by the once-per-jiffy scheduler tick.

Does the problem reproduces easily? A bisection might be very helpful.

> > Is it possible that the timer
> > handler is being prevented from making progress due to heavy interrupt
> > load or some such?
>
> Actually the stall messages are only appearing when the system is idle
> and interrupts are infrequent. That's what leads me to believe it
> might be a situation where, on every (infrequent) wake, both the rcu
> task and something blocking its progress happen at roughly the same
> time, and the slowness of the cpu leads the rcu task to decide to wait
> for the next timer/schedule event and try again -- over and over. Of
> course this hypothesis is just based on a very high-level
> understanding of what's going on; I'm not sufficiently familiar wilth
> the implementation specifics to offer more.
>
> One observation I just made versus old kernels where this was not
> happening: I've got a lot more timer interrupts happening than before.
> In 6 minutes of uptime:
>
> CPU0 CPU1
> 17: 38 0 AIC 17 abce0000.ethernet
> 18: 811 0 AIC 18 uartlite
> 20: 188 121 AIC 20 ipi
> 72: 11295 6499 AIC 72 jcore_pit
>
> That's 30+ timer interrupts per sec on cpu0. Previously it was more
> like 1-5 per second when idle (with dyntick).

That is interesting, but I don't know what to make of it. RCU does depend
heavily on timed sleeps to make forward progress, so if timer interrupts
were lost, that would be a smoking gun. But you actually appear to have
-more- timer interrupts in the failure case rather than fewer!

> > > My .config and message log are below. Let me know if there are any
> > > other details I can provide that would be helpful.
> >
> > In case the timer issue I am suspecting is a red herring...
> >
> > RCU does tune itself for different values of HZ, but it might need more
> > help for 50MHz CPUs. (Although if this were the problem, I would expect
> > you to be seeing excessive CPU consumption by the rcu_sched kthread.)
> > But it might be worthwhile booting with rcutree.jiffies_till_sched_qs=10
> > and rcutree.jiffies_till_first_fqs=10 (or thereabouts). It might also
> > be necessary to use a longer RCU CPU stall warning timeout (assuming
> > you don't just suppress it as above), for example, by booting with (say)
> > rcupdate.rcu_cpu_stall_timeout=60.
>
> With that in mind, it seems plausible that the problem existed before
> and just went over the threshold for reporting the stall with a
> non-bug change going into 4.8. Would building an older known-good
> kernel with TREE_RCU_TRACE enabled and checking debugfs
> rcu/rcu_sched/rcugp for max=??? be informative?

It might be. However, if ftrace is enabled on your platform, getting
a trace of the RCU-related events would be even more likely to be
helpful. And for that matter, the timer-related events.

One way to determine whether we were just on the edge on
the older systems would be to boot them with smaller values of
rcupdate.rcu_cpu_stall_timeout=60. The default values of the other two
boot parameters looks to already be at the minimum usefule value, though.

> > Is there any way for RCU to determine at boot time what the nominal CPU
> > speed is? I suppose I could look at BogoMIPs, but is there something
> > better?
>
> I'm not aware of any clean cross-arch way to do it.

If it turns out that the problem is that RCU has to configure itself
differently on 50MHz systems, we will need to revisit this.

Thanx, Paul