Re: [PATCH v3 4/7] acpi/ghes: Add a logic to handle block addresses and FW first ARM processor error injection

From: Mauro Carvalho Chehab
Date: Wed Jul 31 2024 - 03:11:46 EST


Em Tue, 30 Jul 2024 13:17:09 +0200
Igor Mammedov <imammedo@xxxxxxxxxx> escreveu:

> On Mon, 22 Jul 2024 08:45:56 +0200
> Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> wrote:
>
> that's quite a bit of code that in 99% won't ever be used
> (assuming error injection testing scenario),
> not to mention it's a hw depended one and governed by different specs.
>
> Essentially we would need to create _whole_ lot of QAPI
> commands to cover possible errors for no benefit to QEMU.
>
> Let take for example very simple _OST status reporting,
> QEMU of cause can decode values and present it to users in
> more 'presentable' form. However instead of translating
> numbers (aka. spec language) into a made up QEMU language,
> QEMU just passes values up the stack and users can use
> well defined spec to interpret its meaning.
>
> benefits are: QEMU doesn't have to maintain translation
> code and QAPI ABI is limited to passing raw values.
>
> Can we do similar thing here as well?
> i.e. simplify error injection commands to
> a command that takes raw value and passes it
> to guest (QEMU here acts as proxy, if I'm not
> mistaken)?
>
> Preferably make it generic enough to handle
> not only ARM but other error formats HEST is
> able to handle.

A too generic interface doesn't sound feasible to me, as the
EINJ code needs to check QEMU implementation details before
doing the error inject.

See, processor is probably the simplest error injection
source, as most of the fields there aren't related to how
the hardware simulation is done.

Yet, if you see patch 7 of this series, you'll notice that some
fields should actually be filled based on the emulation.

On ARM, we have some IDs that depend on the emulation
(MIDR, MPIDR, power state). Doing that on userspace may require
a QAPI to query them.

The memory layout, however, is the most complex one. Even for
an ARM processor CPER (which is the simplest scenario), the
physical/virtual address need to be checked against the emulation
environment.

Other error sources (like memory errors, CXL, etc) will require
a deep knowledge about how QEMU mapped such devices.

So, in practice, if we move this to an EINJ script, we'll need
to add a probably more complex QAPI to allow querying the memory
layout and other device and CPU specific bindings.

Also, we don't know what newer versions of ACPI spec will reserve
us. See, even the HEST table contents is dependent of the HEST
revision number, as made clear at the ACPI 6.5 notes:

https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source

and at:

https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward

So, if we're willing to add support for a more generic "raw data"
QAPI, I would still do it per-type, and for the fields that won't
require knowledge of the device-emulation details.

Btw, my proposal on patch 7 of this series is to have raw data
for:
- the error-info field;
- registers dump;
- micro-architecture specific data.

I don't mind trying to have more raw data there as I see (marginal)
benefits of allowing to generate CPER invalid records [1], but some of
those fields need to be validated and/or filled internally at QEMU - if
not forced to an specific value by the caller.

[1] a raw data EINJ can be useful for fuzzy logic fault detection to
check if badly formed packages won't cause a Kernel panic or be
an exploit. Yet, not really a concern for APEI, as if the hardware
is faulty, a Kernel panic is not out of the table. Also, if the
the BIOS is already compromised and has malicious code on it,
the EINJ interface is not the main concern.

> PS:
> For user convenience, QEMU can carry a script that
> could help generate this raw value in user friendly way
> but at the same time it won't put maintenance
> burden on QEMU itself.

The script will still require reviews, and the same code will
be there. So, from maintenance burden, there won't be much
difference.

Btw, I'm actually using myself a script to test it, currently
sitting together with rasdaemon - which is the Linux tool to detect
and handle hardware errors:

https://github.com/mchehab/rasdaemon/blob/master/contrib/qemu_einj.py

as it helps a lot when trying to simulate more complex errors.

Once QEMU gains support to inject processor errors, I can prepare a
separate patch to move it to QEMU.

Thanks,
Mauro