RE: [RFC PATCH] mm: memory-failure: add soft-offline stat in mf_stats
From: Tomohiro Misono (Fujitsu)
Date: Tue Dec 10 2024 - 03:48:06 EST
> >>> And
> >>> 1. total = recovered + ignored + failed + delayed
> >>> 2. recovered = soft_offline + hard_offline
> >> Do you mean mf_stats now have 7 entries in sysfs?
> >> (total, ignored, failed, delayed, recovered, hard_offline, soft_offline, then recovered = hard_offline +
> soft_offline)
> >> Or 6 entries ? (in that case, hard_offline = recovered - soft_offline)
> >> It might be simpler to understand for user if total is just the sum of other entries like this RFC,
> >> but I'd like to know other opinions.
> > Will it be better to have below items?
> > "
> > total
> > ignored
> > failed
> > dalayed
> > hard_offline
> > soft_offline
> > "
>
> The existing "ignored, failed, delayed, recovered" apply to UEs while
> "soft_offline" applies to CE. The difference between UE and CE is that
> even a recovered UE page has PG_hwpoison set, but a soft offlined page
> does not and thus could be re-deployed.
Hi, thanks for your comments.
If I understand correctly, PG_hwpoison is also set in soft offlined page (and thus
counted in HardwareCorrupted too):
https://github.com/torvalds/linux/blob/v6.13-rc2/mm/memory-failure.c#L206
Also, unpoison works but can only be used via debugfs by hwpoison-inject module.
Is this correct?
>
> So if we want to flag CE pages, they seem to belong to a different
> category, something like -
>
> /sys/devices/system/node/node0/memory_failure/Uncorrected/{ignored, delayed, failed, recovered}
> /sys/devices/system/node/node0/memory_failure/Corrected/{offlined}
This makes sense. But as I stated in other thread, I don't think we can change the
current I/F for "Uncorrected". Is it worth to create "Corrected" dir only?
Regards
Tomohiro Misono