Re: [PATCH] Add support of NVDIMM memory error notification in ACPI 6.2

From: Dan Williams
Date: Wed Jun 07 2017 - 15:09:45 EST


On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@xxxxxxx> wrote:
> ACPI 6.2 defines a new ACPI notification value to NVDIMM Root Device
> in Table 5-169.
>
> 0x81 Unconsumed Uncorrectable Memory Error Detected
> Used to pro-actively notify OSPM of uncorrectable memory errors
> detected (for example a memory scrubbing engine that continuously
> scans the NVDIMMs memory). This is an optional notification. Only
> locations that were mapped in to SPA by the platform will generate
> a notification.
>
> Add support of this notification value by initiating an ARS scan. This
> will find new error locations and add their badblocks information.
>
> Link: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> Signed-off-by: Toshi Kani <toshi.kani@xxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>
> Cc: Vishal Verma <vishal.l.verma@xxxxxxxxx>
> ---
> drivers/acpi/nfit/core.c | 28 ++++++++++++++++++++++------
> drivers/acpi/nfit/nfit.h | 1 +
> 2 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 656acb5..cc22778 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2967,7 +2967,7 @@ static int acpi_nfit_remove(struct acpi_device *adev)
> return 0;
> }
>
> -void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event)
> +static void acpi_nfit_update_notify(struct device *dev, acpi_handle handle)
> {
> struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev);
> struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL };
> @@ -2975,11 +2975,6 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event)
> acpi_status status;
> int ret;
>
> - dev_dbg(dev, "%s: event: %d\n", __func__, event);
> -
> - if (event != NFIT_NOTIFY_UPDATE)
> - return;
> -
> if (!dev->driver) {
> /* dev->driver may be null if we're being removed */
> dev_dbg(dev, "%s: no driver found for dev\n", __func__);
> @@ -3016,6 +3011,27 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event)
> dev_err(dev, "Invalid _FIT\n");
> kfree(buf.pointer);
> }
> +
> +static void acpi_nfit_uc_error_notify(struct device *dev, acpi_handle handle)
> +{
> + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev);
> +
> + acpi_nfit_ars_rescan(acpi_desc);

I wonder if we should gate re-scanning with a similar:

if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON)

...check that we do in the mce notification case? Maybe not since we
don't get an indication of where the error is without a rescan.
However, at a minimum I think we need support for the new Start ARS
flag ("If set to 1 the firmware shall return data from a previous
scrub, if any, without starting a new scrub") and use that for this
case.

Another thing that seems to be missing in both this and the mce case
is a notification to userspace that something changed. We have calls
to sysfs_notify_dirent() to notify scrub completion events and DIMM
health status change events, I think we need a similar notifier
mechanism for new un-correctable errors.