Re: Error reports at boot time in Ampere Altra machines since c733ebb7c

From: Darren Hart
Date: Fri Mar 03 2023 - 15:23:42 EST


On Fri, Mar 03, 2023 at 08:10:17PM +0000, Marc Zyngier wrote:
> On Fri, 03 Mar 2023 19:38:40 +0000,
> Darren Hart <darren@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Thu, Mar 02, 2023 at 11:25:37PM +0000, Marc Zyngier wrote:
> > > On Thu, 02 Mar 2023 20:17:32 +0000,
> > > Aristeu Rozanski <aris@xxxxxxxxxx> wrote:
> > > >
> > > > Hi Marc,
> > > >
> > > > Since c733ebb7cb67d ("irqchip/gic-v3-its: Reset each ITS's BASERn
> > > > register before probe"), Ampere Altra machines are reporting corrected
> > > > errors during boot:
> > > >
> > > > [ 0.294334] HEST: Table parsing has been initialized.
> > > > [ 0.294397] sdei: SDEIv1.0 (0x0) detected in firmware.
> > > > [ 0.299622] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> > > > [ 0.299626] {1}[Hardware Error]: event severity: recoverable
> > > > [ 0.299629] {1}[Hardware Error]: Error 0, type: recoverable
> > > > [ 0.299633] {1}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > > > [ 0.299638] {1}[Hardware Error]: section length: 0x30
> > > > [ 0.299645] {1}[Hardware Error]: 00000000: 00000005 ec30000e 00080110 80001001 ......0.........
> > > > [ 0.299648] {1}[Hardware Error]: 00000010: 00000300 00000000 00000000 00000000 ................
> > > > [ 0.299650] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
> > > > [ 0.299714] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
> > > > [ 0.299716] {2}[Hardware Error]: event severity: recoverable
> > > > [ 0.299717] {2}[Hardware Error]: Error 0, type: recoverable
> > > > [ 0.299718] {2}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > > > [ 0.299720] {2}[Hardware Error]: section length: 0x30
> > > > [ 0.299722] {2}[Hardware Error]: 00000000: 40000005 ec30000e 00080110 80005001 ...@..0......P..
> > > > [ 0.299724] {2}[Hardware Error]: 00000010: 00000300 00000000 00000000 00000000 ................
> > > > [ 0.299726] {2}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
> > > > [ 0.299912] GHES: APEI firmware first mode is enabled by APEI bit.
> > > >
> > > > Because the errors are being reported later in boot, it's hard to
> > > > pinpoint exactly what's causing it without decoding the error information,
> > > > which I currently don't know how to do it.
> > >
> > > + Darren
> > >
> > > Hopefully someone at Ampere can decode this and tell us what is happening.
> >
> > Hi Marc,
> >
> > + D Scott
> >
> > Thanks for the connection.
> >
> > This is reporting that something attempted to access GITS2_BASER2, the base
> > register for the gicv4 vcpu table. Altra doesn't support gicv4. Is c733ebb7c
> > assuming GITS_BASER2 should be accessible on gicv3?
>
> All the GITS_BASERn registers should be RES0 if not implemented, as
> per the spec (12.19.1 GITS_BASER<n>, ITS Translation Table
> Descriptors, n = 0 - 7)
>
> <quote>
> A maximum of 8 GITS_BASER<n> registers can be provided. Unimplemented
> registers are RES 0.
> </quote>
>
> Returning an error on access is thus definitely a violation of the
> spec.
>
> So either the GIC implementation you are using is buggy, or you have
> some sort of HW firewalling between the CPU and the GIC that is
> trigger happy. My hunch is that this is the latter, as buggy
> implementations tend to return an SError when missing this sort of
> detail.

Thanks for the detail Marc, let me see what I can learn and will follow up.

--
Darren Hart
Ampere Computing / OS and Kernel