Re: perf: WARNING perfevents: irq loop stuck!

From: Ingo Molnar
Date: Fri May 08 2015 - 03:53:57 EST



* Vince Weaver <vincent.weaver@xxxxxxxxx> wrote:

> On Fri, 1 May 2015, Ingo Molnar wrote:
>
> > So 0000fffffffffffe corresponds to 2 events left until overflow,
> > right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> > allow these super short periods.
> >
> > Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> > one similar to bdw_limit_period()? Something like the patch below?
> >
> > Totally untested and such. I picked 128 because of Broadwell, but
> > lower values might work as well. You could try to increase it to 3 and
> > upwards and see which one stops triggering stuck NMI loops?
>
> I spent a lot of time trying to come up with a test case that
> triggered this more reliably but failed.
>
> It definitely is an issue with PMC0 being -2 causing the PMC0 bit in
> the status register getting stuck and no clearing. Often there is
> also a PEBS event active at the same time but that might be
> coincidence.
>
> With your patch applied I can't trigger the issue. I haven't tried
> narrowing down the exact value yet.

So how about I change it from 128U to 2U and apply it upstream?

I.e. use the minimal threshold that we have observed to cause
problems. That way should it ever trigger in different circumstances
we'll eventually trigger it or hear about it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/