Re: WARN_ON_ONCE() in process_one_work()?
From: Paul E. McKenney
Date: Tue Jun 20 2017 - 12:45:38 EST
On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> Hello,
>
> On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > And no test failures from yesterday evening. So it looks like we get
> > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > > runtime with your printk() in the mix.
> > > >
> > > > Was the above output from your printk() output of any help?
> > >
> > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > racing against CPU offline, which would explain why it's so difficult
> > > to repro. Can you please see whether the following patch resolves the
> > > issue?
> >
> > That could explain why only Steve Rostedt and I saw the issue. As far
> > as I know, we are the only ones who regularly run CPU-hotplug stress
> > tests. ;-)
>
> I was a bit confused. It has to be racing against either new kworker
> being created on the wrong CPU or rescuer trying to migrate to the
> CPU, and it looks like we're mostly seeing the rescuer condition, but,
> yeah, this would only get triggered rarely. Another contributing
> factor could be the vmstat work putting on a workqueue w/ rescuer
> recently. It runs quite often, so probably has increased the chance
> of hitting the right condition.
Sounds like too much fun! ;-)
But more constructively... If I understand correctly, it is now possible
to take a CPU partially offline and put it back online again. This should
allow much more intense testing of this sort of interaction.
And no, I haven't yet tried this with RCU because I would probably need
to do some mix of just-RCU online/offline and full-up online-offline.
Plus RCU requires pretty much a full online/offline cycle to fully
exercise it. :-/
> > I have a weekend-long run going, but will give this a shot overnight on
> > Monday, Pacific Time. Thank you for putting it together, looking forward
> > to seeing what it does!
>
> Thanks a lot for the testing and patience. Sorry that it took so
> long. I'm not completely sure the patch is correct. It might have to
> be more specifc about which type of migration or require further
> synchronization around migration, but hopefully it'll at least be able
> to show that this was the cause of the problem.
And last night's tests had no failures. Which might actually mean
something, will get more info when I run without your patch this
evening. ;-)
Thanx, Paul