Re: rcu_sched stalls in idle task introduced in pre-4.8?
From: Rich Felker
Date: Tue Aug 02 2016 - 16:35:27 EST
On Tue, Aug 02, 2016 at 12:48:02PM -0700, Paul E. McKenney wrote:
> > > > The system I'm testing on is an SMP configuration of the J2 SoC on
> > > > FPGA (see: j-core.org), and the SMP support and timer driver (most
> > > > relevant, imo) are new, but never exhibited this problem on 4.6 or
> > > > 4.7. A tree with all drivers merged is available in the development
> > > > branch of git://git.libc.org/linux-sh; it's regularly rebased but
> > > > there's a perma-tag devel-20160801 you can check out instead. Fetching
> > > > into an existing clone of Linus's tree is recommended since the server
> > > > has limited resources.
> > > >
> > > > I doubt the code from my devel branch will be useful without hardware
> > > > to test on, though. Hopefully the issue is reproducible on other
> > > > systems. Very low cpu speed (50 MHz) could be a factor. The problem
> > > > seems to go away with dyntick disabled.
> > >
> > > Indeed, it is quite possible that RCU's self-tuning needs a bit of help
> > > for CPUs below a few hundred MHz. As might other pieces of the kernel.
> > > Should not be a big deal (famous last words...). However, given that
> > > you are running HZ=100, it should be reasonably close. So first, let's
> > > see what is going on here...
> > >
> > > The following dmesg line usually indicates that RCU's kernel thread is
> > > stuck for some reason.
> > >
> > > [42039.188293] rcu_sched kthread starved for 2152 jiffies! g70249 c70248 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > >
> > > The RCU_GP_WAIT_FQS indicates that it is waiting for a timer to expire.
> > > The task is in TASK_INTERRUPTIBLE state, but it should have been awakened
> > > after only a few jiffies sleep time. The stack trace looks like there
> > > might be trouble canceling the timer.
> >
> > Can you offer any advice on tracking down what it's waiting for, or
> > what's going on with the timer? Is any arch- or clockevent-specific
> > stuff that might be my fault involved here?
>
> It is in an swait_event_interruptible_timeout(), most likely waiting on
> a one-jiffy timer, as in __mod_timer(). If I understood your earlier
> email correctly, setting CONFIG_HZ_PERIODIC avoided this problem.
> If my understanding is correct, it might be worth reviewing any timer
> code that might be pushed along by the once-per-jiffy scheduler tick.
Is it possible that something wacky is going on with broadcast timers?
A dummy broadcast timer device gets initialized momentatily before the
percpu timer for the secondary cpu is initialized, and in my
experience these broadcast timers don't work when the real timer
device has a cpumask limiting it to a specific cpu (which seems like a
bug, but one I haven't had to care about since I'm not using it).
> Does the problem reproduces easily?
Yes, it happens right after boot and repeats every 30-90 seconds or
so.
> A bisection might be very helpful.
Bisection would require some manual work to setup because the whole
reason I was rebasing on Linus's tree was to adapt the drivers to
upstream infrastructure changes (the new cpuhp stuff replacing
notifier for cpu starting). The unfortunate way it was done, each
driver adds an enum to linux/cpuhotplug.h so all the patches have
gratuitous conflicts. In addition, for older revisions in Linus's
tree, there's at least one show-stopping (hang during boot) bug that
needs a cherry-pick to fix. There may be other small issues too. I
don't think they're at all insurmountible but it requires an annoying
amount of scripting.
> > > Is it possible that the timer
> > > handler is being prevented from making progress due to heavy interrupt
> > > load or some such?
> >
> > Actually the stall messages are only appearing when the system is idle
> > and interrupts are infrequent. That's what leads me to believe it
> > might be a situation where, on every (infrequent) wake, both the rcu
> > task and something blocking its progress happen at roughly the same
> > time, and the slowness of the cpu leads the rcu task to decide to wait
> > for the next timer/schedule event and try again -- over and over. Of
> > course this hypothesis is just based on a very high-level
> > understanding of what's going on; I'm not sufficiently familiar wilth
> > the implementation specifics to offer more.
> >
> > One observation I just made versus old kernels where this was not
> > happening: I've got a lot more timer interrupts happening than before.
> > In 6 minutes of uptime:
> >
> > CPU0 CPU1
> > 17: 38 0 AIC 17 abce0000.ethernet
> > 18: 811 0 AIC 18 uartlite
> > 20: 188 121 AIC 20 ipi
> > 72: 11295 6499 AIC 72 jcore_pit
> >
> > That's 30+ timer interrupts per sec on cpu0. Previously it was more
> > like 1-5 per second when idle (with dyntick).
>
> That is interesting, but I don't know what to make of it. RCU does depend
> heavily on timed sleeps to make forward progress, so if timer interrupts
> were lost, that would be a smoking gun. But you actually appear to have
> -more- timer interrupts in the failure case rather than fewer!
Could the extra interrupts be a symptom of the rcu task doing a
very-short sleep to retry over and over?
> > > > My .config and message log are below. Let me know if there are any
> > > > other details I can provide that would be helpful.
> > >
> > > In case the timer issue I am suspecting is a red herring...
> > >
> > > RCU does tune itself for different values of HZ, but it might need more
> > > help for 50MHz CPUs. (Although if this were the problem, I would expect
> > > you to be seeing excessive CPU consumption by the rcu_sched kthread.)
> > > But it might be worthwhile booting with rcutree.jiffies_till_sched_qs=10
> > > and rcutree.jiffies_till_first_fqs=10 (or thereabouts). It might also
> > > be necessary to use a longer RCU CPU stall warning timeout (assuming
> > > you don't just suppress it as above), for example, by booting with (say)
> > > rcupdate.rcu_cpu_stall_timeout=60.
> >
> > With that in mind, it seems plausible that the problem existed before
> > and just went over the threshold for reporting the stall with a
> > non-bug change going into 4.8. Would building an older known-good
> > kernel with TREE_RCU_TRACE enabled and checking debugfs
> > rcu/rcu_sched/rcugp for max=??? be informative?
>
> It might be. However, if ftrace is enabled on your platform, getting
> a trace of the RCU-related events would be even more likely to be
> helpful. And for that matter, the timer-related events.
It looks like ftrace is supported on sh. Looks like I have some
reading to do -- this could be very useful.
> One way to determine whether we were just on the edge on
> the older systems would be to boot them with smaller values of
> rcupdate.rcu_cpu_stall_timeout=60. The default values of the other two
> boot parameters looks to already be at the minimum usefule value, though.
CONFIG_RCU_CPU_STALL_TIMEOUT=21 seems to be what it's hitting.
Rich