Re: [PATCH] acpi/nfit: badrange report spill over to clean range

From: Jane Chu
Date: Thu Jul 14 2022 - 19:22:52 EST


On 7/13/2022 5:24 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/12/2022 5:48 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>>>> granularity") changed nfit_handle_mce() callback to report badrange for
>>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>>>> populated with Intel DCPMEM v2 dimms, it appears that
>>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>>>> 8 poisons.
>>>>
>>>> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
>>>> [..]
>>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
>>>> [..]
>>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> {
>>>> "dev":"namespace0.0",
>>>> "mode":"fsdax",
>>>> "map":"dev",
>>>> "size":33820770304,
>>>> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>>> "sector_size":512,
>>>> "align":2097152,
>>>> "blockdev":"pmem0",
>>>> "badblock_count":8,
>>>> "badblocks":[
>>>> {
>>>> "offset":8208,
>>>> "length":8,
>>>> "dimms":[
>>>> "nmem0"
>>>> ]
>>>> }
>>>> ]
>>>> }
>>>>
>>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>>>> radius and shouldn't be used. More over, as each injected poison is being
>>>> reported independently, any alignment under 512-byte appear works:
>>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>>>> or 512-byte.
>>>>
>>>> To get around this issue, 512-bytes is chosen as the alignment because
>>>> a. it happens to be the badblock granularity,
>>>> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>>>> c. architecture agnostic
>>>
>>> I am failing to see the kernel bug? Yes, you injected less than 8
>>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
>>> that's not the kernel's fault, that's the hardware. What happens when
>>> hardware really does detect 8 blocks of consective poison and this
>>> implementation decides to only record 1 at a time?
>>
>> In that case, there will be 8 reports of the poisons by APEI GHES,
>
> Why would there be 8 reports for just one poison consumption event?

I meant to say there would be 8 calls to the nfit_handle_mce() callback,
one call for each poison with accurate address.

Also, short ARS would find 2 poisons.

I attached the console output, my annotation is prefixed with "<==".

It is from these information I concluded that no poison will be missed
in terms of reporting.

>
>> ARC scan will also report 8 poisons, each will get to be added to the
>> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
>
> Right, that's what I'm saying about the proposed change, trim the
> reported poison by what is return from a "short" ARS. Recall that
> short-ARS just reads from a staging buffer that the BIOS knows about, it
> need not go all the way to hardware.

Okay, that confirms my understanding of your proposal. More below.

>
>> In the above 2 poison example, the poison in 0x00000040a0602400 and in
>> 0x00000040a0602600 were separately reported.
>
> Separately reported, each with a 4K alignment?

Yes, and so twice nfit_handle_mce() call
nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
complete overlap.

>
>>> It seems the fix you want is for the hardware to report the precise
>>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
>>> that precision in this case.
>>
>> That field describes a 4K range even for a single poison, it confuses
>> people unnecessarily.
>
> I agree with you on the problem statement, it's the fix where I have
> questions.
>
>>> However, the ARS engine likely can return the precise error ranges so I
>>> think the fix is to just use the address range indicated by 1UL <<
>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>> scrub request to ask the device for the precise error list.
>>
>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>> poison report over a 4K range
>
> Over a L1_CACHE_BYTES range...
>
>> in order to decide the precise range as a workaround of the hardware
>> issue? if there are 8 poisoned detected, there will be 8 short ARS,
>> sure we want to do that?
>
> Seems ok to me, short ARS is meant to be cheap. I would hope there are
> no latency concerns in this path.

Yeah, accumulated latency is the concern here. I know the upstream
user call stack has timeout for getting a response for certain
database transaction. And the test folks might inject dozens of
back-to-back poisons to adjacent pages...
>
>> also, for now, is it possible to log more than 1 poison per 512byte
>> block?
>
> For the badrange tracking, no. So this would just be a check to say
> "Yes, CPU I see you think the whole 4K is gone, but lets double check
> with more precise information for what gets placed in the badrange
> tracking".

Okay, process-wise, this is what I am seeing -

- for each poison, nfit_handle_mce() issues a short ARS given (addr,
64bytes)
- and short ARS returns to say that's actually (addr, 256bytes),
- and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
anyway.

The precise badrange from short ARS is lost in the process, given the
time spent visiting the BIOS, what's the gain? Could we defer the
precise badrange until there is consumer of the information?

thanks!
-jane




# dmesg
# ndctl inject-error namespace0.0 -n 2 -B 8210
[29076.551909] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[29076.561136] {3}[Hardware Error]: event severity: recoverable
[29076.567453] {3}[Hardware Error]: Error 0, type: recoverable
[29076.573769] {3}[Hardware Error]: section_type: memory error
[29076.580182] {3}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
[29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400 <== 1st poison @ 0x400
[29076.597664] {3}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
[29076.605532] {3}[Hardware Error]: node:0 card:0 module:1
[29076.611655] {3}[Hardware Error]: error_type: 14, scrub uncorrected error
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000) <== 1st call to nfit_handle_mce
[29077.877682] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x40a0602 offset:0x400 grain:32 - err_code:0x0000:0x009f SystemAddress:0x40a0602400 DevicePhysicalAddress:0x30602400 ProcessorSocketId:0x0 MemoryControllerId:0x0 ChannelAddress:0x430602400 ChannelId:0x0 PhysicalRankId:0x0 DimmSlotId:0x1 ChipSelect:0x4)

[29078.596454] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[29078.605682] {4}[Hardware Error]: event severity: recoverable
[29078.611997] {4}[Hardware Error]: Error 0, type: recoverable
[29078.618313] {4}[Hardware Error]: section_type: memory error
[29078.624727] {4}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
[29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600 <== 2nd poison @ 0x600
[29078.642200] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
[29078.650069] {4}[Hardware Error]: node:0 card:0 module:1
[29078.656184] {4}[Hardware Error]: error_type: 14, scrub uncorrected error
[29078.663944] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29078.672011] mce: [Hardware Error]: Machine check events logged
[29079.595327] nfit ACPI0012:00: XXX addr in SPA 1 (0x4080000000, 0x1f80000000)
[29079.603204] nfit ACPI0012:00: XXX new code nvdimm_bus_add_badrange
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000) <== 2nd call to nfit_handle_mce
[29079.949531] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x40a0602 offset:0x600 grain:32 - err_code:0x0000:0x009f SystemAddress:0x40a0602600 DevicePhysicalAddress:0x30602600 ProcessorSocketId:0x0 MemoryControllerId:0x0 ChannelAddress:0x430602600 ChannelId:0x0 PhysicalRankId:0x0 DimmSlotId:0x1 ChipSelect:0x4)

[29102.630372] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602400, 0x100) <== short ARS found 2 poisons
[29102.638341] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602600, 0x100)
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":33820770304,
"uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0",
"badblock_count":8,
"badblocks":[
{
"offset":8208,
"length":8,
"dimms":[
"nmem0"
]
}
]
}