Re: [PATCH v3 4/7] acpi/ghes: Add a logic to handle block addresses and FW first ARM processor error injection

From: Mauro Carvalho Chehab
Date: Mon Jul 29 2024 - 08:22:21 EST


Em Thu, 25 Jul 2024 11:48:12 +0200
Markus Armbruster <armbru@xxxxxxxxxx> escreveu:

> Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes:
>
> > From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> >
> > 1. Some GHES functions require handling addresses. Add a helper function
> > to support it.
> >
> > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection.
> >
> > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and
> > upper specs, using error type bit encoding as detailed at UEFI 2.9A
> > errata.
> >
> > Error injection examples:
> >
> > { "execute": "qmp_capabilities" }
> >
> > { "execute": "arm-inject-error",
> > "arguments": {
> > "errortypes": ['cache-error']
> > }
> > }
> >
> > { "execute": "arm-inject-error",
> > "arguments": {
> > "errortypes": ['tlb-error']
> > }
> > }
> >
> > { "execute": "arm-inject-error",
> > "arguments": {
> > "errortypes": ['bus-error']
> > }
> > }
> >
> > { "execute": "arm-inject-error",
> > "arguments": {
> > "errortypes": ['cache-error', 'tlb-error']
> > }
> > }
> >
> > { "execute": "arm-inject-error",
> > "arguments": {
> > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error']
> > }
> > }
> > ...
> >
> > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> > Co-authored-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> > For Add a logic to handle block addresses,
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> > For FW first ARM processor error injection,
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> > Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> > ---
> > configs/targets/aarch64-softmmu.mak | 1 +
> > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++--
> > hw/arm/Kconfig | 4 +
> > hw/arm/arm_error_inject.c | 35 ++++
> > hw/arm/arm_error_inject_stubs.c | 18 ++
> > hw/arm/meson.build | 3 +
> > include/hw/acpi/ghes.h | 2 +
> > qapi/arm-error-inject.json | 49 ++++++
> > qapi/meson.build | 1 +
> > qapi/qapi-schema.json | 1 +
> > 10 files changed, 361 insertions(+), 11 deletions(-)
> > create mode 100644 hw/arm/arm_error_inject.c
> > create mode 100644 hw/arm/arm_error_inject_stubs.c
> > create mode 100644 qapi/arm-error-inject.json
>
> Since the new file not covered in MAINTAINERS, get_maintainer.pl will
> blame it on the QAPI maintainers alone. No good.

Added myself there:

diff --git a/MAINTAINERS b/MAINTAINERS
index 98eddf7ae155..713a104ef901 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2075,6 +2075,13 @@ F: hw/acpi/ghes.c
F: include/hw/acpi/ghes.h
F: docs/specs/acpi_hest_ghes.rst

+ACPI/HEST/GHES/ARM processor CPER
+R: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
+S: Maintained
+F: hw/arm/arm_error_inject.c
+F: hw/arm/arm_error_inject_stubs.c
+F: qapi/arm-error-inject.json
+
ppc4xx
L: qemu-ppc@xxxxxxxxxx
S: Orphan

>
> [...]
>
> > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json
> > new file mode 100644
> > index 000000000000..430e6cea6b60
> > --- /dev/null
> > +++ b/qapi/arm-error-inject.json
> > @@ -0,0 +1,49 @@
> > +# -*- Mode: Python -*-
> > +# vim: filetype=python
> > +
> > +##
> > +# = ARM Processor Errors
> > +##
> > +
> > +##
> > +# @ArmProcessorErrorType:
> > +#
> > +# Type of ARM processor error to inject
> > +#
> > +# @unknown-error: Unknown error
>
> Removed in PATCH 7, and unused until then. Why add it in the first
> place?

I folded this with patch 7, so this was gone now.

>
> > +#
> > +# @cache-error: Cache error
> > +#
> > +# @tlb-error: TLB error
> > +#
> > +# @bus-error: Bus error.
> > +#
> > +# @micro-arch-error: Micro architectural error.
> > +#
> > +# Since: 9.1
> > +##
> > +{ 'enum': 'ArmProcessorErrorType',
> > + 'data': ['unknown-error',
> > + 'cache-error',
>
> Tab in this line. Please convert to spaces.

Ok.

>
> > + 'tlb-error',
> > + 'bus-error',
> > + 'micro-arch-error']
> > +}
> > +
> > +##
> > +# @arm-inject-error:
> > +#
> > +# Inject ARM Processor error.
> > +#
> > +# @errortypes: ARM processor error types to inject
> > +#
> > +# Features:
> > +#
> > +# @unstable: This command is experimental.
> > +#
> > +# Since: 9.1
> > +##
> > +{ 'command': 'arm-inject-error',
> > + 'data': { 'errortypes': ['ArmProcessorErrorType'] },
>
> Please separate words with dashes: 'error-types'.

Done.

Folding with patch 7 broke it on two separate fields: error and
type.

>
> > + 'features': [ 'unstable' ]
> > +}
>
> Is this used only with TARGET_ARM?

Yes, as this CPER record is defined only for arm. There are three other
processor error info:
- for x86;
- for ia32;
- for "generic cpu".

They have different structures, with different fields.

> Why is being able to inject multiple error types at once useful?

The CPER ARM Processor record is defined at UEFI spec as having from 1 to
255 errors, that can be using the same type or not. The idea behind UEFI
spec is that a single root error may be reflected on multiple errors.

It may also help to reduce BIOS interrupts to OS, by merging errors
altogether, as memory errors usually happen in bursts.

Due to that, a single Processor Error Information inside a CPER record
for ARM processor can, according with UEFI spec, contain more than one
of the following bits set:

+-----|---------------------------+
| Bit | Meaning |
+=====+===========================+
| 1 | Cache Error |
| 2 | TLB Error |
| 3 | Bus Error |
| 4 | Micro-architectural Error |
+-----|---------------------------+

So, the spec allows, for instance, to have a single Processor Error
Information (PEI) with micro-arch and tlb-error flags raised at the
same time.

We need the capability of testing multiple error types in order to check
if OS implementation is decoding it the right way. In particular, Linux
was not doing it right, as the CPER ARM Processor record handler was
written at the time UEFI 2.6 spec was written, while the actual encoding
for the error type was only defined at UEFI 2.9A errata and newer.

> I'd expect at least some of these errors to come with additional
> information. For instance, I imagine a bus error is associated with
> some address.

It actually depends on the ARM and PEI valid fields: the address may or
may not be present, depending if the phy/logical address valid field bit
is set or not.

>
> If we encode the the error to inject as an enum value, adding more will
> be hard.
>
> If we wrap the enum in a struct
>
> { 'struct': 'ArmProcessorError',
> 'data': { 'type': 'ArmProcessorErrorType' } }
>
> we can later extend it like
>
> { 'union': 'ArmProcessorError',
> 'base: { 'type': 'ArmProcessorErrorType' }
> 'data': {
> 'bus-error': 'ArmProcessorBusErrorData' } }
>
> { 'struct': 'ArmProcessorBusErrorData',
> 'data': ... }

I don't see this working as one might expect. See, the ARM error
information data can be repeated from 1 to 255 times. It is given
by this struct (see patch 7):

{ 'struct': 'ArmProcessorErrorInformation',
'data': { '*validation': ['ArmPeiValidationBits'],
'type': ['ArmProcessorErrorType'],
'*multiple-error': 'uint16',
'*flags': ['ArmProcessorFlags'],
'*error-info': 'uint64',
'*virt-addr': 'uint64',
'*phy-addr': 'uint64'}
}

According with the UEFI spec, the type is always be present.
The other fields are marked as valid or not via the field
"validation". So, there's one bit indicating what is valid between
the fields at the PEI structure, e. g.:

- multiple-error: multiple occurrences of the error;
- flags;
- error-info: error information;
- virt-addr: virtual address;
- phy-addr: physical address.

There are also other fields that are global for the entire record,
also marked as valid or not via another bitmask.

The contents of almost all those fields are independent of the error
type. The only field which content is affected by the error type is
"error-info", and the definition of such field is not fully specified.

So, currently, UEFI spec only defines it when:

1. the error type has just one bit set;
2. the error type is either cache, TLB or bus error[1].
If type is micro-arch-specific error, the spec doesn't tell how this
field if filled.

To make the API simple (yet powerful), I opted to not enforce any encoding
for error-info: let userspace fill it as required and use some default
that would make sense, if this is not passed via QMP.

[1] See https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-information

> > diff --git a/qapi/meson.build b/qapi/meson.build
> > index e7bc54e5d047..5927932c4be3 100644
> > --- a/qapi/meson.build
> > +++ b/qapi/meson.build
> > @@ -22,6 +22,7 @@ if have_system or have_tools or have_ga
> > endif
> >
> > qapi_all_modules = [
> > + 'arm-error-inject',
> > 'authz',
> > 'block',
> > 'block-core',
> > diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
> > index b1581988e4eb..479a22de7e43 100644
> > --- a/qapi/qapi-schema.json
> > +++ b/qapi/qapi-schema.json
> > @@ -81,3 +81,4 @@
> > { 'include': 'vfio.json' }
> > { 'include': 'cryptodev.json' }
> > { 'include': 'cxl.json' }
> > +{ 'include': 'arm-error-inject.json' }
>

Thanks,
Mauro