Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

From: David Hildenbrand
Date: Wed Jan 06 2021 - 04:57:55 EST


On 06.01.21 05:07, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixed with ZONE_DEVICE pfns.
>
> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section.
>
> With this implementation of pfn_to_online_page() pfn-walkers mostly only
> need to check section metadata to determine pfn validity. In the
> rare case of mixed-zone sections the pfn-walker will skip offline
> ZONE_DEVICE pfns as expected.
>
> Other notes:
>
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
>
> pfn_to_online_page() was already borderline too large to be a macro /
> inline function, but the additional logic certainly pushed it over that
> threshold, and so it is moved to an out-of-line helper.
>
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Reported-by: Michal Hocko <mhocko@xxxxxxxx>
> Reported-by: David Hildenbrand <david@xxxxxxxxxx>
> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

[...]

> +#define SECTION_MARKED_PRESENT (1UL<<0)
> +#define SECTION_HAS_MEM_MAP (1UL<<1)
> +#define SECTION_IS_ONLINE (1UL<<2)
> +#define SECTION_IS_EARLY (1UL<<3)
> +#define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> +#define SECTION_MAP_LAST_BIT (1UL<<5)
> +#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> +#define SECTION_NID_SHIFT 3
>
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section *section)
> return (section && (section->section_mem_map & SECTION_IS_ONLINE));
> }
>
> +static inline int online_device_section(struct mem_section *section)
> +{
> + unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
> +
> + return section && ((section->section_mem_map & flags) == flags);
> +}
> +
> static inline int online_section_nr(unsigned long nr)
> {
> return online_section(__nr_to_section(nr));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..9f36968e6188 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,47 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
> return 0;
> }
>
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> + unsigned long nr = pfn_to_section_nr(pfn);
> + struct dev_pagemap *pgmap;
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> +
> + if (!online_section(ms))
> + return NULL;
> +
> + if (!pfn_valid_within(pfn))
> + return NULL;
> +
> + if (!online_device_section(ms))
> + return pfn_to_page(pfn);
> +
> + /*
> + * Slowpath: when ZONE_DEVICE collides with
> + * ZONE_{NORMAL,MOVABLE} within the same section some pfns in
> + * the section may be 'offline' but 'valid'. Only
> + * get_dev_pagemap() can determine sub-section online status.
> + */
> + pgmap = get_dev_pagemap(pfn, NULL);
> + put_dev_pagemap(pgmap);
> +
> + /* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
> + if (pgmap)
> + return NULL;
> + return pfn_to_page(pfn);
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);

Note that this is not sufficient in the general case. I already
mentioned that we effectively override an already initialized memmap.

---

[ SECTION ]
Before:
[ ZONE_NORMAL ][ Hole ]

The hole has some node/zone (currently 0/0, discussions ongoing on how
to optimize that to e.g., ZONE_NORMAL in this example) and is
PG_reserved - looks like an ordinary memory hole.

After memremap:
[ ZONE_NORMAL ][ ZONE_DEVICE ]

The already initialized memmap was converted to ZONE_DEVICE. Your
slowpath will work.

After memunmap (no poisioning):
[ ZONE_NORMAL ][ ZONE_DEVICE ]

The slow path is no longer working. pfn_to_online_page() might return
something that is ZONE_DEVICE.

After memunmap (poisioning):
[ ZONE_NORMAL ][ POISONED ]

The slow path is no longer working. pfn_to_online_page() might return
something that will BUG_ON via page_to_nid() etc.

---

Reason is that pfn_to_online_page() does no care about sub-sections. And
for now, it didn't had to. If there was an online section, it either was

a) Completely present. The whole memmap is initialized to sane values.
b) Partially present. The whole memmap is initialized to sane values.

memremap/memunmap messes with case b)

Well have to further tweak pfn_to_online_page(). You'll have to also
check pfn_section_valid() *at least* on the slow path. Less-hacky would
be checking it also in the "somehwat-faster" path - that would cover
silently overriding a memmap that's visible via pfn_to_online_page().
Might slow down things a bit.


Not completely opposed to this, but I would certainly still prefer just
avoiding this corner case completely instead of patching around it. Thanks!

--
Thanks,

David / dhildenb