Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

From: David Hildenbrand
Date: Thu Apr 28 2022 - 04:47:01 EST

>> 2) It happens rarely (ever?), so do we even care?
> I'm not certain of the rarity. Some cloud service providers who maintain
> lots of servers may care?

About replacing broken DIMMs? I'm not so sure, especially because it
requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
somewhat reliable and individual DIMMs can usually not get replaced at all.

>> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>> The memory can be happily used.
>> 3) can happen easily if our DIMM consists of multiple memory blocks and
>> offlining of some memory block fails -> we'll re-online all already
>> offlined ones. We'll happily reuse previously HWPoisoned pages, which
>> feels more dangerous to me then just leaving the DIMM around (and
>> eventually hwpoisoning all pages on it such that it won't get used
>> anymore?).
> I see. This scenario can often happen.
>> So maybe we should just fail offlining once we stumble over a hwpoisoned
>> page?
> That could be one choice.
> Maybe another is like this: offlining can succeed but HWPoison flags are
> kept over offline-reonline operations. If the system noticed that the
> re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
> saved HWPoison flags are still effective, so keep using them. If the
> re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
> all HWPoison flags associated with replaced physical address range. This
> can be done automatically in re-onlining if there's a way for kernel to know
> whether DIMM/NUMA nodes are replaced with new ones. But if there isn't,
> system applications have to check the HW and explicitly reset the HWPoison
> flags.

Offline memory sections have a stale memmap, so there is no trusting on
that. And trying to work around that or adjusting memory onlining code
overcomplicates something we really don't care about supporting.

So if we continue allowing offlining memory blocks with poisoned pages,
we could simply remember that that memory block had any posioned page
(either for the memory section or maybe better for the whole memory
block). We can then simply reject/fail memory onlining of these memory

So that leaves us with either

1) Fail offlining -> no need to care about reonlining
2) Succeed offlining but fail re-onlining


David / dhildenb