Re: [PATCH v10 07/21] acpi/ghes: rework the logic to handle HEST source ID
From: Mauro Carvalho Chehab
Date: Tue Oct 01 2024 - 07:58:11 EST
Em Tue, 17 Sep 2024 13:59:34 +0200
Igor Mammedov <imammedo@xxxxxxxxxx> escreveu:
> On Sat, 14 Sep 2024 08:13:28 +0200
> Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> wrote:
>
> > The current logic is based on a lot of duct tape, with
> > offsets calculated based on one define with the number of
> > source IDs and an enum.
> >
> > Rewrite the logic in a way that it would be more resilient
> > of code changes, by moving the source ID count to an enum
> > and make the offset calculus more explicit.
> >
> > Such change was inspired on a patch from Jonathan Cameron
> > splitting the logic to get the CPER address on a separate
> > function, as this will be needed to support generic error
> > injection.
>
> so this patch switches to using HEST to lookup error status block
> by source id, though nothing in commit message mentions that.
> Perhaps it's time to rewrite commit message to be more
> specific/clear on what it's doing.
>
> now, I'd split this on several patches that should also take care of
> wiring needed to preserve old lookup to keep migration with 9.1 machines
> working:
>
> 1. extract error status block read_ack addresses calculation/lookup
> into a separate function (just current code, without HEST lookup).
Done. This is at the cleanup series.
> 2. since old lookup code handles only 1 source and won't ever see multiple
> error sources, simplify that as much as possible to keep old way simple
> (and mention that in commit message).
> it basically should take 1st error status block/read ack addresses from
> hardware_errors
Done there too.
>
> 3. add to AcpiGedState a callback to with signature of #1 lookup func.
> (which will be used to switch between old/new code), by default set to
> old lookup func
>
> 4. add to GED a bool property x-has-hardware_errors_addr = 1
> and use it in acpi_ged_realize() to set callback
>
> if x-has-hardware_errors_addr == 1
> callback = old_lookup
Instead of a callback, I opted to define a boolean:
DEFINE_PROP_BOOL("x-has-hest-addr", AcpiGedState, ghes_state.hest_lookup, true),
That avoided the need of exporting the math logic and deal with
callbacks. It also avoided the need of touching acpi_ged_realize().
>
> 5. add new HEST lookup func (not used yet with mentioning that follow up patch
> will use it)
>
> 6. cleanup fwcfg based on x-has-hardware_errors_addr,
> i.e. for 'true':
> ask for write pointer to hardware_errors like it's done in current code
> and don't register hest_addr write pointer
> while for 'false'
> do opposite of above.
This doesn't work. without the fw_cfg logic for both, QEMU/BIOS won't boot
and/or the hardware_errors won't work, causing ghes to do nothing.
> 7. flip x-has-hardware_errors_addr default to 0 and add to hw/core/machine.c
> hw_compat_9_1[] = {
> {ged type, 'x-has-hardware_errors_addr', 'true'},
> }
> this will make 9.1 and older machine types to use old lookup callback
> while newer machines will use default new lookup
I opted to add the changes on machine.c at the same patch.
See the RFC series I posted.
> that should take care of switching between new and old ABI,
> i.e. which code qemu and guest will use/see.
With the new series, both virt-9.1 and virt-9.2 are doing the
right thing using either the legacy or the new math. I double
checked with a small debug patch, and changing the virt-9.1
code to use GPIO, as it is a way easier to inject errors than
via SEA (SEA requires the host OS to generate a bus error on
a hw address used by the VM).
>
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> >
> > ---
> >
> > Changes from v9:
> > - patch split on multiple patches to avoid multiple changes
> > at the same patch.
> >
> > Changes from v8:
> > - Non-rename/cleanup changes merged altogether;
> > - source ID is now more generic, defined per guest target.
> > That should make easier to add support for 86.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> > ---
> > hw/acpi/ghes.c | 95 ++++++++++++++++++++++++++++++------------
> > include/hw/acpi/ghes.h | 6 ++-
> > 2 files changed, 73 insertions(+), 28 deletions(-)
> >
> > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > index 36fe5f68782f..6e5f0e8cf0c9 100644
> > --- a/hw/acpi/ghes.c
> > +++ b/hw/acpi/ghes.c
> > @@ -61,6 +61,23 @@
> > */
> > #define ACPI_GHES_GESB_SIZE 20
> >
> > +/*
> > + * Offsets with regards to the start of the HEST table stored at
> > + * ags->hest_addr_le, according with the memory layout map at
> > + * docs/specs/acpi_hest_ghes.rst.
> > + */
> > +
> > +/* ACPI 6.2: 18.3.2.8 Generic Hardware Error Source version 2
> > + * Table 18-382 Generic Hardware Error Source version 2 (GHESv2) Structure
> > + */
> > +#define HEST_GHES_V2_TABLE_SIZE 92
> > +#define GHES_ACK_OFFSET (64 + GAS_ADDR_OFFSET)
> > +
> > +/* ACPI 6.2: 18.3.2.7: Generic Hardware Error Source
> > + * Table 18-380: 'Error Status Address' field
> > + */
> > +#define GHES_ERR_ST_ADDR_OFFSET (20 + GAS_ADDR_OFFSET)
> > +
> > /*
> > * Values for error_severity field
> > */
> > @@ -401,11 +418,13 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
> >
> > int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
> > {
> > - uint64_t error_block_addr, read_ack_register_addr, read_ack_register = 0;
> > - uint64_t start_addr;
> > - bool ret = -1;
> > + uint64_t hest_read_ack_start_addr, read_ack_start_addr;
> > + uint64_t hest_addr, cper_addr, err_source_struct;
> > + uint64_t hest_err_block_addr, error_block_addr;
> > + uint32_t i, ret;
> > AcpiGedState *acpi_ged_state;
> > AcpiGhesState *ags;
> > + uint64_t read_ack;
> >
> > assert(source_id < ACPI_GHES_ERROR_SOURCE_COUNT);
> >
> > @@ -414,44 +433,66 @@ int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
> > g_assert(acpi_ged_state);
> > ags = &acpi_ged_state->ghes_state;
> >
> > - start_addr = le64_to_cpu(ags->ghes_addr_le);
> > + hest_addr = le64_to_cpu(ags->hest_addr_le);
> >
> > if (!physical_address) {
> > return -1;
> > }
> >
> > - start_addr += source_id * sizeof(uint64_t);
> > + err_source_struct = hest_addr + ACPI_GHES_ERROR_SOURCE_COUNT;
> >
> > - cpu_physical_memory_read(start_addr, &error_block_addr,
> > - sizeof(error_block_addr));
> > + /*
> > + * Currently, HEST Error source navigates only for GHESv2 tables
> > + */
> > + for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> > + uint64_t addr = err_source_struct;
> > + uint16_t type, src_id;
> >
> > - error_block_addr = le64_to_cpu(error_block_addr);
> > + cpu_physical_memory_read(addr, &type, sizeof(type));
> >
> > - read_ack_register_addr = start_addr +
> > - ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t);
> > + /* For now, we only know the size of GHESv2 table */
> > + assert(type == ACPI_GHES_SOURCE_GENERIC_ERROR_V2);
> >
> > - cpu_physical_memory_read(read_ack_register_addr,
> > - &read_ack_register, sizeof(read_ack_register));
> > + /* It is GHES. Compare CPER source address */
> > + addr += sizeof(type);
> > + cpu_physical_memory_read(addr, &src_id, sizeof(src_id));
> >
> > - /* zero means OSPM does not acknowledge the error */
> > - if (!read_ack_register) {
> > - error_report("OSPM does not acknowledge previous error,"
> > - " so can not record CPER for current error anymore");
> > - } else if (error_block_addr) {
> > - read_ack_register = cpu_to_le64(0);
> > - /*
> > - * Clear the Read Ack Register, OSPM will write it to 1 when
> > - * it acknowledges this error.
> > - */
> > - cpu_physical_memory_write(read_ack_register_addr,
> > - &read_ack_register, sizeof(uint64_t));
> > + if (src_id == source_id) {
> > + break;
> > + }
> >
> > - ret = acpi_ghes_record_mem_error(error_block_addr,
> > - physical_address);
> > - } else {
> > + err_source_struct += HEST_GHES_V2_TABLE_SIZE;
> > + }
> > + if (i == ACPI_GHES_ERROR_SOURCE_COUNT) {
> > error_report("can not find Generic Error Status Block");
> > +
> > + return -1;
> > }
> >
> > + /* Navigate though table address pointers */
> > +
> > + hest_err_block_addr = err_source_struct + GHES_ERR_ST_ADDR_OFFSET;
> > + hest_read_ack_start_addr = err_source_struct + GHES_ACK_OFFSET;
> > +
> > + cpu_physical_memory_read(hest_err_block_addr, &error_block_addr,
> > + sizeof(error_block_addr));
> > +
> > + cpu_physical_memory_read(error_block_addr, &cper_addr,
> > + sizeof(error_block_addr));
> > +
> > + cpu_physical_memory_read(hest_read_ack_start_addr, &read_ack_start_addr,
> > + sizeof(read_ack_start_addr));
> > +
> > + /* Update ACK offset to notify about a new error */
> > +
> > + cpu_physical_memory_read(read_ack_start_addr,
> > + &read_ack, sizeof(read_ack));
> > +
> > + read_ack = cpu_to_le64(0);
> > + cpu_physical_memory_write(read_ack_start_addr,
> > + &read_ack, sizeof(read_ack));
> > +
> > + ret = acpi_ghes_record_mem_error(error_block_addr, physical_address);
> > return ret;
> > }
> >
> > diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> > index c9bbda4740e2..7485f54c28f2 100644
> > --- a/include/hw/acpi/ghes.h
> > +++ b/include/hw/acpi/ghes.h
> > @@ -23,6 +23,7 @@
> > #define ACPI_GHES_H
> >
> > #include "hw/acpi/bios-linker-loader.h"
> > +#include "qapi/error.h"
> >
> > /*
> > * Values for Hardware Error Notification Type field
> > @@ -74,7 +75,10 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> > const char *oem_id, const char *oem_table_id);
> > void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
> > GArray *hardware_errors);
> > -int acpi_ghes_record_errors(uint8_t source_id, uint64_t error_physical_addr);
> > +int acpi_ghes_record_errors(uint8_t source_id,
> > + uint64_t error_physical_addr);
> > +void ghes_record_cper_errors(const void *cper, size_t len,
> > + uint16_t source_id, Error **errp);
> >
> > /**
> > * acpi_ghes_present: Report whether ACPI GHES table is present
>
Thanks,
Mauro