Re: [PATCH] Add support of NVDIMM memory error notification in ACPI 6.2

From: Kani, Toshimitsu
Date: Wed Jun 07 2017 - 16:58:07 EST


On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote:
> On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@xxxxxxx>
> wrote:
:
> > +
> > +static void acpi_nfit_uc_error_notify(struct device *dev,
> > acpi_handle handle)
> > +{
> > +ÂÂÂÂÂÂÂstruct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev);
> > +
> > +ÂÂÂÂÂÂÂacpi_nfit_ars_rescan(acpi_desc);
>
> I wonder if we should gate re-scanning with a similar:
>
> ÂÂÂÂif (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON)
>
> ...check that we do in the mce notification case? Maybe not since we
> don't get an indication of where the error is without a rescan.

I think this mce case is different since the MCE handler already knows
where the new poison location is and can update badblocks information
for it. Starting ARS is an optional precaution.

> However, at a minimum I think we need support for the new Start ARS
> flag ("If set to 1 the firmware shall return data from a previous
> scrub, if any, without starting a new scrub") and use that for this
> case.

That's an interesting idea. But I wonder how users know if it is OK to
set this flag as it relies on BIOS implementation that is not described
in ACPI...

> Another thing that seems to be missing in both this and the mce case
> is a notification to userspace that something changed. We have calls
> to sysfs_notify_dirent() to notify scrub completion events and DIMM
> health status change events, I think we need a similar notifier
> mechanism for new un-correctable errors.

Good point. I think this can be a badblocks population event, which
gets generated when badblocks information is updated at boot-time and
run-time via this notification and MCE.

Thanks,
-Toshi