Re: nohz fail (was: perf related boot hang.)

From: Frederic Weisbecker
Date: Thu Sep 04 2014 - 17:29:37 EST

On Thu, Sep 04, 2014 at 11:05:02PM +0200, Catalin Iacob wrote:
> On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > Yeah, that's expected. You need to apply the nine patches on top of -rc1:
> >
> > git://
> > nohz/fixes
> >
> > "nohz: Restore NMI safe local irq work for local nohz kick" only fixes
> > part of the issue.
> Ok, but if the whole series is needed, isn't it better if it all goes
> into 3.17? Otherwise 3.17 is a clear regression for some users; it's
> definitely for me since before 3.17-rc1 I never saw this bug and now I
> see it every time I do something CPU intensive. Maybe the regression
> is acceptable because the it's confined to some CONFIG_NO_HZ_*
> combination (I think) which is still rather experimental, that's your
> call to make, but it's still a regression.

Yeah the bug is there for a while but likely something got merged in the
last -rc1 that made the bug more likely to happen.

This is probably due to the fact that we converted remote nohz kick to use
irq work instead of the scheduler IPI. So it fires more likely and if we
are unlucky enough, some tick sees the irq work before the irq work IPI
can fire.

Or some code enqueues that irq work from the tick itself.

Awyway you're right that it belongs to the category of regressions. Unfortunately
the fix is invasive.

Also I don't know much users of nohz full so probably this won't
have much impact. Or this could be a good way to know who uses this feature after all :o)

I'm not sure what I should do. Lets see how the final fix will look like, Peter
is proposing some simplifications. Then we'll know better.

BTW, do you run some specific workloads to trigger this?

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at