Re: Bug report for RCU stalled warning [3.10.69]

From: Paul E. McKenney
Date: Sat Oct 14 2017 - 08:51:43 EST


On Thu, Oct 12, 2017 at 01:38:24PM -0700, Paul E. McKenney wrote:
> [ Adding LKML on CC so that others can find this. ]
>
> On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote:
> > Hi, Paul McKenney.
> >
> > I have received many machine-stopped-respone reports, after reboot and
> > inspect message, all of them show RCU stalled, but I can't figure out
> > how to fix it. I can't update the kernel, it is the painful point, so I
> > need to fix it in 3.10. I have attached four messages come from different
> > cpu and broads(so I guess it is a BUG instead of hardware fault), any
> > suggestion is welcome.
>
> The first step is of course to report this to your distro, as they are
> the ones who do the care and feeding of such old kernels. Please include
> the information below in that report, as it might help your distro find
> and fix the problem.
>
> It looks like the stalled CPU is idle, and that the activity resulting
> from the stall-warning message gets things going again. Callbacks are
> being processed, so no OOM. But you are getting the splat every 60
> seconds. The system has only two CPUs, and is x86.
>
> If you cannot upgrade the kernel, my ability to help is limited. And the
> diagnostics printed with the v3.10 CPU stall warnings are also quite
> limited. However, there are some things you could try as workarounds:
>
> 1. Check to make sure that the rcu_sched kthread is getting
> the CPU time that it needs. Preventing this kthread from
> running would create exactly this output, assuming that
> the stall warning got it going again temporarily.
>
> 2. It looks like the disturbance of the RCU CPU stall warning
> is getting things going again. Try artificially providing
> this disturbance, for example, by running a usermode program
> or script that runs on each CPU in turn, then sleeps for
> (say) five seconds.
>
> 3. If you can reconfigure your kernel, try building with
> CONFIG_RCU_FAST_NO_HZ=n.

And if you can reconfigure kernel, in v3.10, building with
CONFIG_RCU_CPU_STALL_INFO and CONFIG_RCU_CPU_STALL_VERBOSE will provide
more information on the CPUs and tasks stalling the grace period.

Thanx, Paul

> 4. Was the system running reliably on some earlier version?
> If so, consider reverting back to that version, and include
> the version information in your report to your distro. If
> your distro provides individual patches, you should consider
> bisecting so as to locate the offending patch.
>
> Good luck with it!
>
> Thanx, Paul