Re: [PATCH v3 4/7] acpi/ghes: Add a logic to handle block addresses and FW first ARM processor error injection

From: Markus Armbruster
Date: Mon Jul 29 2024 - 10:37:52 EST


Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes:

> Em Thu, 25 Jul 2024 11:48:12 +0200
> Markus Armbruster <armbru@xxxxxxxxxx> escreveu:
>
>> Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes:
>>
>> > From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
>> >
>> > 1. Some GHES functions require handling addresses. Add a helper function
>> > to support it.
>> >
>> > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection.
>> >
>> > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and
>> > upper specs, using error type bit encoding as detailed at UEFI 2.9A
>> > errata.
>> >
>> > Error injection examples:
>> >
>> > { "execute": "qmp_capabilities" }
>> >
>> > { "execute": "arm-inject-error",
>> > "arguments": {
>> > "errortypes": ['cache-error']
>> > }
>> > }
>> >
>> > { "execute": "arm-inject-error",
>> > "arguments": {
>> > "errortypes": ['tlb-error']
>> > }
>> > }
>> >
>> > { "execute": "arm-inject-error",
>> > "arguments": {
>> > "errortypes": ['bus-error']
>> > }
>> > }
>> >
>> > { "execute": "arm-inject-error",
>> > "arguments": {
>> > "errortypes": ['cache-error', 'tlb-error']
>> > }
>> > }
>> >
>> > { "execute": "arm-inject-error",
>> > "arguments": {
>> > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error']
>> > }
>> > }
>> > ...
>> >
>> > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
>> > Co-authored-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
>> > For Add a logic to handle block addresses,
>> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
>> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
>> > For FW first ARM processor error injection,
>> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
>> > Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
>> > ---
>> > configs/targets/aarch64-softmmu.mak | 1 +
>> > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++--
>> > hw/arm/Kconfig | 4 +
>> > hw/arm/arm_error_inject.c | 35 ++++
>> > hw/arm/arm_error_inject_stubs.c | 18 ++
>> > hw/arm/meson.build | 3 +
>> > include/hw/acpi/ghes.h | 2 +
>> > qapi/arm-error-inject.json | 49 ++++++
>> > qapi/meson.build | 1 +
>> > qapi/qapi-schema.json | 1 +
>> > 10 files changed, 361 insertions(+), 11 deletions(-)
>> > create mode 100644 hw/arm/arm_error_inject.c
>> > create mode 100644 hw/arm/arm_error_inject_stubs.c
>> > create mode 100644 qapi/arm-error-inject.json
>>
>> Since the new file not covered in MAINTAINERS, get_maintainer.pl will
>> blame it on the QAPI maintainers alone. No good.
>
> Added myself there:
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 98eddf7ae155..713a104ef901 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2075,6 +2075,13 @@ F: hw/acpi/ghes.c
> F: include/hw/acpi/ghes.h
> F: docs/specs/acpi_hest_ghes.rst
>
> +ACPI/HEST/GHES/ARM processor CPER
> +R: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> +S: Maintained
> +F: hw/arm/arm_error_inject.c
> +F: hw/arm/arm_error_inject_stubs.c
> +F: qapi/arm-error-inject.json
> +
> ppc4xx
> L: qemu-ppc@xxxxxxxxxx
> S: Orphan
>
>>
>> [...]
>>
>> > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json
>> > new file mode 100644
>> > index 000000000000..430e6cea6b60
>> > --- /dev/null
>> > +++ b/qapi/arm-error-inject.json
>> > @@ -0,0 +1,49 @@
>> > +# -*- Mode: Python -*-
>> > +# vim: filetype=python
>> > +
>> > +##
>> > +# = ARM Processor Errors
>> > +##
>> > +
>> > +##
>> > +# @ArmProcessorErrorType:
>> > +#
>> > +# Type of ARM processor error to inject
>> > +#
>> > +# @unknown-error: Unknown error
>>
>> Removed in PATCH 7, and unused until then. Why add it in the first
>> place?
>
> I folded this with patch 7, so this was gone now.
>
>>
>> > +#
>> > +# @cache-error: Cache error
>> > +#
>> > +# @tlb-error: TLB error
>> > +#
>> > +# @bus-error: Bus error.
>> > +#
>> > +# @micro-arch-error: Micro architectural error.
>> > +#
>> > +# Since: 9.1
>> > +##
>> > +{ 'enum': 'ArmProcessorErrorType',
>> > + 'data': ['unknown-error',
>> > + 'cache-error',
>>
>> Tab in this line. Please convert to spaces.
>
> Ok.
>
>>
>> > + 'tlb-error',
>> > + 'bus-error',
>> > + 'micro-arch-error']
>> > +}
>> > +
>> > +##
>> > +# @arm-inject-error:
>> > +#
>> > +# Inject ARM Processor error.
>> > +#
>> > +# @errortypes: ARM processor error types to inject
>> > +#
>> > +# Features:
>> > +#
>> > +# @unstable: This command is experimental.
>> > +#
>> > +# Since: 9.1
>> > +##
>> > +{ 'command': 'arm-inject-error',
>> > + 'data': { 'errortypes': ['ArmProcessorErrorType'] },
>>
>> Please separate words with dashes: 'error-types'.
>
> Done.
>
> Folding with patch 7 broke it on two separate fields: error and
> type.
>
>>
>> > + 'features': [ 'unstable' ]
>> > +}
>>
>> Is this used only with TARGET_ARM?
>
> Yes, as this CPER record is defined only for arm. There are three other
> processor error info:
> - for x86;
> - for ia32;
> - for "generic cpu".
>
> They have different structures, with different fields.

A generic inject-error command feels nicer, but coding its arguments in
the schema could be more trouble than it's worth. I'm not asking you to
try.

A target-specific command like this one should be conditional. Try
this:

{ 'command': 'arm-inject-error',
'data': { 'errortypes': ['ArmProcessorErrorType'] },
'features': [ 'unstable' ],
'if': 'TARGET_ARM' }

No need to provide a qmp_arm_inject_error() stub then.

>> Why is being able to inject multiple error types at once useful?
>
> The CPER ARM Processor record is defined at UEFI spec as having from 1 to
> 255 errors, that can be using the same type or not. The idea behind UEFI
> spec is that a single root error may be reflected on multiple errors.
>
> It may also help to reduce BIOS interrupts to OS, by merging errors
> altogether, as memory errors usually happen in bursts.
>
> Due to that, a single Processor Error Information inside a CPER record
> for ARM processor can, according with UEFI spec, contain more than one
> of the following bits set:
>
> +-----|---------------------------+
> | Bit | Meaning |
> +=====+===========================+
> | 1 | Cache Error |
> | 2 | TLB Error |
> | 3 | Bus Error |
> | 4 | Micro-architectural Error |
> +-----|---------------------------+
>
> So, the spec allows, for instance, to have a single Processor Error
> Information (PEI) with micro-arch and tlb-error flags raised at the
> same time.
>
> We need the capability of testing multiple error types in order to check
> if OS implementation is decoding it the right way. In particular, Linux
> was not doing it right, as the CPER ARM Processor record handler was
> written at the time UEFI 2.6 spec was written, while the actual encoding
> for the error type was only defined at UEFI 2.9A errata and newer.

I see.

>> I'd expect at least some of these errors to come with additional
>> information. For instance, I imagine a bus error is associated with
>> some address.
>
> It actually depends on the ARM and PEI valid fields: the address may or
> may not be present, depending if the phy/logical address valid field bit
> is set or not.
>
>>
>> If we encode the the error to inject as an enum value, adding more will
>> be hard.
>>
>> If we wrap the enum in a struct
>>
>> { 'struct': 'ArmProcessorError',
>> 'data': { 'type': 'ArmProcessorErrorType' } }
>>
>> we can later extend it like
>>
>> { 'union': 'ArmProcessorError',
>> 'base: { 'type': 'ArmProcessorErrorType' }
>> 'data': {
>> 'bus-error': 'ArmProcessorBusErrorData' } }
>>
>> { 'struct': 'ArmProcessorBusErrorData',
>> 'data': ... }
>
> I don't see this working as one might expect. See, the ARM error
> information data can be repeated from 1 to 255 times. It is given
> by this struct (see patch 7):
>
> { 'struct': 'ArmProcessorErrorInformation',
> 'data': { '*validation': ['ArmPeiValidationBits'],
> 'type': ['ArmProcessorErrorType'],
> '*multiple-error': 'uint16',
> '*flags': ['ArmProcessorFlags'],
> '*error-info': 'uint64',
> '*virt-addr': 'uint64',
> '*phy-addr': 'uint64'}
> }
>
> According with the UEFI spec, the type is always be present.
> The other fields are marked as valid or not via the field
> "validation". So, there's one bit indicating what is valid between
> the fields at the PEI structure, e. g.:
>
> - multiple-error: multiple occurrences of the error;
> - flags;
> - error-info: error information;
> - virt-addr: virtual address;
> - phy-addr: physical address.
>
> There are also other fields that are global for the entire record,
> also marked as valid or not via another bitmask.
>
> The contents of almost all those fields are independent of the error
> type. The only field which content is affected by the error type is
> "error-info", and the definition of such field is not fully specified.
>
> So, currently, UEFI spec only defines it when:
>
> 1. the error type has just one bit set;
> 2. the error type is either cache, TLB or bus error[1].
> If type is micro-arch-specific error, the spec doesn't tell how this
> field if filled.
>
> To make the API simple (yet powerful), I opted to not enforce any encoding
> for error-info: let userspace fill it as required and use some default
> that would make sense, if this is not passed via QMP.
>
> [1] See https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-information

I asked because designing for extensibility is good practice.

It's not a hard requirement here, because feature 'unstable' gives us
lincense to change the interface incompatibly.

[...]