Re: frequent lockups in 3.18rc4

From: Dave Jones
Date: Thu Nov 27 2014 - 17:56:51 EST


On Thu, Nov 27, 2014 at 11:17:16AM -0800, Linus Torvalds wrote:
> On Wed, Nov 26, 2014 at 2:57 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
> >
> > So 3.17 also has this problem.
> > Good news I guess in that it's not a regression, but damn I really didn't
> > want to have to go digging through the mists of time to find the last 'good' point.
>
> So I'm looking at the watchdog code, and it seems racy wrt parking and startup.
>
> In particular, it sets the high priority *after* starting the hrtimer,
> and it goes back to SCHED_NORMAL *before* canceling the timer.
>
> Which seems completely ass-backwards. And the smp_hotplug_thread stuff
> explicitly enables preemption around the setup/cleanup/part/unpark
> operations.
>
> However, that would be an issue only if trinity might be doing things
> that enable and disable the watchdog. And doing so under insane loads.
> Even then it seems unlikely.
>
> The insane loads you have. But even then, could a load average of 169
> possibly delay running a non-RT process for 22 seconds? Doubtful.
>
> But just in case: do you do cpu hotplug events (that will disable and
> re-enable the watchdog process?). Anything else that will part/unpark
> the hotplug thread?

That's root-only iirc, and I'm not running trinity as root, so that
shouldn't be happening. There's also no sign of such behaviour in dmesg
when the problem occurs.

> Quite frankly, I'm just grasping for straws here, but a lot of the
> watchdog traces really have seemed spurious...

Agreed.

Currently leaving 3.16 running. 21hrs so far.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/