Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17
From: Paul E. McKenney
Date: Sat Mar 26 2016 - 14:49:57 EST
On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> On Sat, Mar 26, 2016 at 12:29:31PM +0000, Mathieu Desnoyers wrote:
> > ----- On Mar 25, 2016, at 5:46 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote:
> >
> > > On Fri, Mar 25, 2016 at 09:24:14PM +0000, Chatre, Reinette wrote:
> > >> Hi Paul,
> > >>
> > >> On 2016-03-23, Paul E. McKenney wrote:
> > >> > Please boot with the following parameters:
> > >> >
> > >> > rcu_tree.rcu_kick_kthreads ftrace
> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> > >>
> > >> With these parameters I expected more details to show up in the kernel logs but
> > >> cannot find any. Even so, today I left the machine running again and when this
> > >> happened I think I was able to capture the trace data for the event. Please
> > >> find attached the trace information for the kernel message below. Since the
> > >> complete trace file is very big I trimmed it to show the time around this event
> > >> - hopefully this will contain the information you need. I would also like to
> > >> provide some additional information. The system on which I see these events had
> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> > >> system-timesynd was one of the tasks calling the functions of interest to your
> > >> tracing and am wondering if a very out of sync time in process of being
> > >> corrected could be the cause of this issue? As an experiment I ensured the
> > >> system time was accurate before leaving the system idle overnight and I did not
> > >> see the issue the next morning.
> > >
> > > Ah! Yes, a sudden jump in time or a disagreement about the time among
> > > different components of the system can definitely cause these symptoms.
> > > We have sometimes seen these problems occur when a pair of CPUs have
> > > wildly different ideas about what time it is, for example. Please let
> > > me know how it goes.
> > >
> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
> > > process that are not immediately followed by sched_wakeup, so your trace
> > > isn't showing the problem that I am seeing.
> >
> > This is interesting.
> >
> > Perhaps we could try with those commits reverted ?
> >
> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Date: Wed Jun 4 10:31:18 2014 -0700
> >
> > sched/idle: Optimize try-to-wake-up IPI
> >
> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Date: Wed Apr 9 15:35:08 2014 +0200
> >
> > sched/idle: Avoid spurious wakeup IPIs
> >
> > They appeared in 3.16.
>
> At this point, I am up for trying pretty much anything. ;-)
>
> Will give it a go.
And those certainly don't revert cleanly! Would patching the kernel
to remove the definition of TIF_POLLING_NRFLAG be useful? Or, more
to the point, is there some other course of action that would be more
useful? At this point, the test times are measured in weeks...
Thanx, Paul