Re: [BUG] Core2 cpu triggers hard lockup with perf test

From: Peter Zijlstra
Date: Tue Mar 01 2016 - 04:17:31 EST


On Mon, Feb 29, 2016 at 10:12:08PM +0000, Liang, Kan wrote:

> In SDM "18.4.4.4 Re-configuring PEBS Facilities" it mentioned that
> a quiescent period is needed between stopping the prior event counting and
> setting up a new PEBS event when software needs to reconfigure PEBS facilities.
> The quiescent period is to allow any latent residual PEBS records to complete
> its capture at their previously specified buffer address

> That requirement only can be found in Core Microarchitecture.

But that should apply to all (PEBS) event scheduling, not just the
multi thing.

Also very convenient that quiescent period is so well defined. How long
should we wait, a day?

> I think it may implies that there is some observed delay in writing PEBS buffer.

Doesn't it explicitly state just that?

> So if perf record precise hw event with very small period, the slow PEBS writing
> may lockup the CPU.

And I still don't see how this would explain a lockup in the MSR writes.

[ Jiri, can you disable that stupid panic on hard lockup and let it run
for a while, see if all the lockup msgs hit the same IP? Also, can you
look where exactly that IP lives in the code? ]

So I suspect it actually just did the PERF_GLOBAL_CTRL write, how else
would the hardware watchdog trigger on that same CPU.

After that, there's only BTS muck, which you're not using, so WTH is it
actually stuck on?

> If so, I think disabling the multiple pebs should be a good way.

As said, this should affect any and all PEBS event scheduling, not just
the multi stuff.