[PATCH] acpi/nfit: badrange report spill over to clean range

From: Jane Chu
Date: Mon Jul 11 2022 - 19:27:47 EST


Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
granularity") changed nfit_handle_mce() callback to report badrange for
each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
instead of the hardcoded L1_CACHE_BYTES. However recently on a server
populated with Intel DCPMEM v2 dimms, it appears that
1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
8 poisons.

[29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
[..]
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
[29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
[..]
[29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":33820770304,
"uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0",
"badblock_count":8,
"badblocks":[
{
"offset":8208,
"length":8,
"dimms":[
"nmem0"
]
}
]
}

So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
radius and shouldn't be used. More over, as each injected poison is being
reported independently, any alignment under 512-byte appear works:
L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
or 512-byte.

To get around this issue, 512-bytes is chosen as the alignment because
a. it happens to be the badblock granularity,
b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
c. architecture agnostic

Fixes: 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx>
---
drivers/acpi/nfit/mce.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index d48a388b796e..eeacc8eb807f 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,7 +32,6 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
*/
mutex_lock(&acpi_desc_lock);
list_for_each_entry(acpi_desc, &acpi_descs, list) {
- unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
struct device *dev = acpi_desc->dev;
int found_match = 0;

@@ -64,7 +63,8 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,

/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
- ALIGN_DOWN(mce->addr, align), align);
+ ALIGN(mce->addr, SECTOR_SIZE),
+ SECTOR_SIZE);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);


base-commit: e35e5b6f695d241ffb1d223207da58a1fbcdff4b
--
2.18.4