Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

From: Andrea Arcangeli
Date: Wed Nov 25 2020 - 13:29:37 EST


On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> Before that change, the memmap of memory holes were only zeroed
> out. So the zones/nid was 0, however, pages were not reserved and
> had a refcount of zero - resulting in other issues.

So maybe that "0,0" zoneid/nid was not actually the thing that
introduced the regression? Note: I didn't bisect anything yet, it was
just a guess.

In fact, I need first thing to boot with an old v5.3 kernel and to
dump the page->flags around the "Unknown E820 type" region to be 100%
certain that the older kernels had a correct page->flags for reserved
pages. I will clear this up before the end of the day.

> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
> them. That would be the right fix here.

What would then be the advantage of keeping wrong page->flags in
reserved pages compared to actually set it correct?

How do you plan to enforce that every VM bit will call PageReserved
(as it should be introduced in pageblock_pfn_to_page in that case) if
it's not possible to disambiguate a real DMA zone and nid 0 from the
uninitialized value? How can we patch page_nodenum to enforce it won't
read an uninitialized value because a PageReserved check was missing
in the caller?

I don't see this easily enforceable by code review, it'd be pretty
flakey to leave a 0,0 wrong value in there with no way to verify if a
PageResered check in the caller was missing.

> > It'd be preferable if the pfn_valid fails and the
> > pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> > step. Even better memset 0xff over the whole page struct until the
> > second stage comes around.
>
> I recently discussed with Baoquan to
> 1. Using a new pagetype to mark memory holes
> 2. Using special nid/zid to mark memory holes
>
> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.

What would need to call pfn_zone in between first and second stage?

If something calls pfn_zone in between first and second stage isn't it
a feature if it crashes the kernel at boot?

Note: I suggested 0xff kernel crashing "until the second stage comes
around" during meminit at boot, not permanently.

/*
* Use a fake node/zone (0) for now. Some of these pages
* (in memblock.reserved but not in memblock.memory) will
* get re-initialized via reserve_bootmem_region() later.
*/

Specifically I relied on the comment "get re-initialized via
reserve_bootmem_region() later".

I assumed the second stage overwrites the 0,0 to the real zoneid/nid
value, which is clearly not happening, hence it'd be preferable to get
a crash at boot reliably.

Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
calling init_reserved_page(start_pfn) won't do much with
CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
page->flags were still wrong for reserved pages in the "Unknown E820
type" region.

> > Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> > all times, otherwise if the second stage fails we end up in a bug with
> > weird side effects.
>
> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?

zoneid is always a pure abstract concept, not just the RAM-holes but
the RAM itself doesn't have a zoneid at all.

Nid is hardware defined for RAM, but for reserved RAM holes it becomes
a pure software construct as the zoneid.

So while it's right that they have no zone/nid in the hardware, they
still have a very well defined zoneid/nid in the software implementation.

The page struct belongs to one and only one mem_map array, that has a
specific nodeid and nid. So it can be always initialized right even
for non-RAM.

Only if the pfn is !pfn_valid, then it has no page struct and then
there's no zone/nid association, but in that case there's no reason to
even worry about it since nobody can possibly get a wrong value out of
the page struct because there's no page struct in the case.

Last but not the least, RAM pages can be marked reserved and assigned
to hardware and so it'd be really messy if real reserved RAM pages
given to hw, behaved different than non-RAM that accidentally got a
struct page because of section alignment issues.

Thanks,
Andrea