Re: [RFC PATCH] mm: memory-failure: add soft-offline stat in mf_stats

From: jane . chu
Date: Fri Dec 06 2024 - 19:18:07 EST


And
1. total = recovered + ignored + failed + delayed
2. recovered = soft_offline + hard_offline
Do you mean mf_stats now have 7 entries in sysfs?
(total, ignored, failed, delayed, recovered, hard_offline, soft_offline, then recovered = hard_offline + soft_offline)
Or 6 entries ? (in that case, hard_offline = recovered - soft_offline)
It might be simpler to understand for user if total is just the sum of other entries like this RFC,
but I'd like to know other opinions.
Will it be better to have below items?
"
total
ignored
failed
dalayed
hard_offline
soft_offline
"

The existing "ignored, failed, delayed, recovered" apply to UEs while "soft_offline" applies to CE. The difference between UE and CE is that even a recovered UE page has PG_hwpoison set, but a soft offlined page does not and thus could be re-deployed.

So if we want to flag CE pages, they seem to belong to a different category, something like -

/sys/devices/system/node/node0/memory_failure/Uncorrected/{ignored, delayed, failed, recovered}
/sys/devices/system/node/node0/memory_failure/Corrected/{offlined}

Thanks,

-jane


though this will break the previous interface.
Any thoughts?

Thanks.
.