Re: [PATCH V17 01/11] acpi: apei: read ack upon ghes record consumption

From: Baicar, Tyler
Date: Fri Jun 30 2017 - 12:47:30 EST


On 6/30/2017 4:10 AM, Robert Richter wrote:
Tyler,

On 19.05.17 14:32:03, Tyler Baicar wrote:
A RAS (Reliability, Availability, Serviceability) controller
may be a separate processor running in parallel with OS
execution, and may generate error records for consumption by
the OS. If the RAS controller produces multiple error records,
then they may be overwritten before the OS has consumed them.

The Generic Hardware Error Source (GHES) v2 structure
introduces the capability for the OS to acknowledge the
consumption of the error record generated by the RAS
controller. A RAS controller supporting GHESv2 shall wait for
the acknowledgment before writing a new error record, thus
eliminating the race condition.

Add support for parsing of GHESv2 sub-tables as well.

Signed-off-by: Tyler Baicar <tbaicar@xxxxxxxxxxxxxx>
CC: Jonathan (Zhixiong) Zhang <zjzhang@xxxxxxxxxxxxxx>
Reviewed-by: James Morse <james.morse@xxxxxxx>
---
drivers/acpi/apei/ghes.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---
drivers/acpi/apei/hest.c | 7 ++++--
include/acpi/ghes.h | 5 +++-
3 files changed, 65 insertions(+), 6 deletions(-)
static int ghes_proc(struct ghes *ghes)
{
int rc;
@@ -661,6 +704,16 @@ static int ghes_proc(struct ghes *ghes)
ghes_estatus_cache_add(ghes->generic, ghes->estatus);
}
ghes_do_proc(ghes, ghes->estatus);
+
+ /*
+ * GHESv2 type HEST entries introduce support for error acknowledgment,
+ * so only acknowledge the error if this support is present.
+ */
+ if (is_hest_type_generic_v2(ghes)) {
+ rc = ghes_ack_error(ghes->generic_v2);
+ if (rc)
+ return rc;
+ }
out:
ghes_clear_estatus(ghes);
return rc;
was there any specific reason why the ack is sent before clearing the
block status? Spec says the ack should be sent at last.

Also, the block is never cleared if ghes_ack_error() returns an error.
IMO we should fall through and clear the block status (this will
change anyway if the bloc status is cleared first).
Hello Robert,

Thank you for pointing this out. I will send a patch to move the ack after the ghes_clear_estatus. This is probably the right thing to do since right now if the FW populates an invalid estatus, we will fail to read the estatus, jump to 'out:', and never send the ack.

Thanks,
Tyler

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.