Re: [PATCH v9 11/12] target/arm: add an experimental mpidr arm cpu property object

From: Mauro Carvalho Chehab
Date: Sun Sep 01 2024 - 02:40:17 EST


Em Fri, 30 Aug 2024 17:27:27 +0100
Peter Maydell <peter.maydell@xxxxxxxxxx> escreveu:

> On Mon, 26 Aug 2024 at 04:12, Mauro Carvalho Chehab
> <mchehab+huawei@xxxxxxxxxx> wrote:
> >
> > Em Sun, 25 Aug 2024 12:34:14 +0100
> > Peter Maydell <peter.maydell@xxxxxxxxxx> escreveu:
> >
> > > On Sun, 25 Aug 2024 at 04:46, Mauro Carvalho Chehab
> > > <mchehab+huawei@xxxxxxxxxx> wrote:
> > > >
> > > > Accurately injecting an ARM Processor error ACPI/APEI GHES
> > > > error record requires the value of the ARM Multiprocessor
> > > > Affinity Register (mpidr).
> > > >
> > > > While ARM implements it, this is currently not visible.
> > > >
> > > > Add a field at CPU storing it, and place it at arm_cpu_properties
> > > > as experimental, thus allowing it to be queried via QMP using
> > > > qom-get function.
> > >
> > > > static Property arm_cpu_properties[] = {
> > > > DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0),
> > > > + DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0),
> > > > DEFINE_PROP_UINT64("mp-affinity", ARMCPU,
> > > > mp_affinity, ARM64_AFFINITY_INVALID),
> > > > DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID),
> > >
> > > Why do we need this?
> >
> > The ACPI HEST tables, in particular when using GHESv2 provide
> > several kinds of errors. Among them, we have ARM Processor Error,
> > as defined at UEFI 2.10 spec (and earlier versions), the Common
> > Platform Error Record (CPER) is defined as:
> >
> > https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html?highlight=ghes#arm-processor-error-section
> >
> > There are two fields that are part of the CPER record. One of them is
> > mandatory (MIDR); the other one is optional, but needed to decode another
> > field.
> >
> > So, basically those errors need them.
>
> OK, but why do scripts outside of QEMU need the information,
> as opposed to telling QEMU "hey, generate an error" and
> QEMU knowing the format to use? Do we have any other
> QMP APIs where something external provides raw ACPI
> data like this?

This was discussed during the review of this patch series.

See, the ACPI Platform Error Interfaces (APEI) code currently in QEMU
implements limited support for ACPI HEST - Hardware Error Source Table [1].

[1] https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source

HEST consists of, currently, 9 error types (plus 3 obsoleted ones). Among
them, there is support for generic errors via GHES and GHESv2 types.
While not officially obsoleted, GHES is superseded by GHESv2.

GHESv2 (and GHES) has a section type field to identify which error type it
is [2]. Currently, there are +10 defined UUIDs for the section type.

[2] https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#section-descriptor

The current code on ghes.c implements GHESv2 support for a single
type (memory error), received from the host OS via SIGBUS.

Testing such code and injecting such error is not easy, as the host OS needs
to send a SIGBUS to the guest, this reflecting an error at the main OS.
Such code also has several limitations.

-

At the first three versions of this patch set, the code was just doing
like what you said: it was adding an error injection for a HEST GHESv2
ARM Processor Error. So the error record (CPER) were produced in QEMU using
some optional parameters passed via QMP to change fields when needed.
With such approach, QEMU could use directly the value from MIDR and MPIDR.

The main disadvantage is that, to make full support of HEST, a lot
of code will be needed to add support for every GHESv2 type and for
every GHESv2 section type. So, the feedback we had were to re-implement
it into a generic way.

The generic CPER error inject approach (since v4 of this series), has
soma advantages:

- it is easy to do fuzz testing, as the entire CPER is built via a python
script;
- no need to modify QEMU to support other GHESv2 types of record and
to support other types of processors;
- GHESv2 fields can also be dynamically generated;
- It shouldn't be hard to change the code to support other types of
HEST table (currently, only GHESv2 is supported).

The disadvantage is that queries are needed to pick configuration and
register values from the current emulation to do error injection. For
ARM Processor Error, it means that MPIDR and MIDR, are needed. Other
processors and other error types will also require to query other data
from QEMU, either using already-existing QMP code or by adding new ones.

Yet, the amount of code for such queries seem to be smaller than the
amount of code to be added for every single GHESv2/HEST type.

-

Worth saying that QEMU may still require internal HEST/GHES errors to be
able to reflect at the guests hardware problems detected at the host OS.

So, for instance, if a host OS memory is poisoned due to hardware errors,
QEMU and guests need to know, in order to kill processes affected
by a bad memory.

Regards,
Mauro