Re: [PATCH] acpi/nfit: badrange report spill over to clean range

From: Jane Chu
Date: Fri Jul 15 2022 - 13:38:38 EST

Next message: Conor.Dooley: "Re: [PATCH] riscv: dts: microchip: hook up the mpfs' l2cache"
Previous message: Peter Xu: "Re: [PATCH] mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte"
In reply to: Dan Williams: "Re: [PATCH] acpi/nfit: badrange report spill over to clean range"
Next in thread: Dan Williams: "Re: [PATCH] acpi/nfit: badrange report spill over to clean range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 7/14/2022 5:58 PM, Dan Williams wrote:
[..]
>>>
>>>>> However, the ARS engine likely can return the precise error ranges so I
>>>>> think the fix is to just use the address range indicated by 1UL <<
>>>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>>>> scrub request to ask the device for the precise error list.
>>>>
>>>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>>>> poison report over a 4K range
>>>
>>> Over a L1_CACHE_BYTES range...
>>>
[..]
>>>
>>> For the badrange tracking, no. So this would just be a check to say
>>> "Yes, CPU I see you think the whole 4K is gone, but lets double check
>>> with more precise information for what gets placed in the badrange
>>> tracking".
>>
>> Okay, process-wise, this is what I am seeing -
>>
>> - for each poison, nfit_handle_mce() issues a short ARS given (addr,
>> 64bytes)
>
> Why would the short-ARS be performed over a 64-byte span when the MCE
> reported a 4K aligned event?

Cuz you said so, see above. :) Yes, 4K range as reported by the MCE
makes sense.

>
>> - and short ARS returns to say that's actually (addr, 256bytes),
>> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
>> anyway.
>
> Right, I am reacting to the fact that the patch is picking 512 as an
> arbtitrary blast radius. It's ok to expand the blast radius from
> hardware when, for example, recording a 64-byte MCE in badrange which
> only understands 512 byte records, but it's not ok to take a 4K MCE and
> trim it to 512 bytes without asking hardware for a more precise report.

Agreed.

>
> Recall that the NFIT driver supports platforms that may not offer ARS.
> In that case the 4K MCE from the CPU is all that the driver gets and
> there is no data source for a more precise answer.
>
> So the ask is to avoid trimming the blast radius of MCE reports unless
> and until a short-ARS says otherwise.
>

What happens to short ARS on a platform that doesn't support ARS?
-EOPNOTSUPPORTED ?

>> The precise badrange from short ARS is lost in the process, given the
>> time spent visiting the BIOS, what's the gain?
>
> Generic support for not under-recording poison on platforms that do not
> support ARS.
>
>> Could we defer the precise badrange until there is consumer of the
>> information?
>
> Ideally the consumer is immediate and this precise information can make
> it to the filesystem which might be able to make a better decision about
> what data got clobbered.
>
> See dax_notify_failure() infrastructure currently in linux-next that can
> convey poison events to filesystems. That might be a path to start
> tracking and reporting precise failure information to address the
> constraints of the badrange implementation.

Yes, I'm aware of dax_notify_failure(), but would appreciate if you
don't mind to elaborate on how the code path could be leveraged for
precise badrange implementation.
My understanding is that dax_notify_failure() is in the path of
synchronous fault accompanied by SIGBUS with BUS_MCEERR_AR.
But badrange could be recorded without poison being consumed, even
without DAX filesystem in the picture.

thanks,
-jane

Next message: Conor.Dooley: "Re: [PATCH] riscv: dts: microchip: hook up the mpfs' l2cache"
Previous message: Peter Xu: "Re: [PATCH] mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte"
In reply to: Dan Williams: "Re: [PATCH] acpi/nfit: badrange report spill over to clean range"
Next in thread: Dan Williams: "Re: [PATCH] acpi/nfit: badrange report spill over to clean range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]