Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

From: Breno Leitao
Date: Wed Sep 13 2023 - 12:24:15 EST


Hi Peter,

On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > On some systems, the Performance Counter Global Status Register is
> > coming with reserved bits set, which causes the system to be unusable
> > if a simple `perf top` runs. The system hits the WARN() thousands times
> > while perf runs.
> >
> > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> >
> > This happens because the "Performance Counter Global Status Register"
> > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > Manual, Volume 2: System Programming, 24593"[1]
>
> Would it then not make more sense to mask out bit7 before:
>
> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> if (!status)
> goto done;

Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
(AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
global variable because it seems to represent what the loop below is
iterating over:

/* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
static u64 amd_pmu_global_cntr_mask __read_mostly;

Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
now, it warns at every time we call this function, which makes the
machine unusable, but, warning it once could be helpful to figure out
there is something wrong with the machine/firmware.

Anyway, please let me know whatever is your preferred way and I will
submit a v2.