Re: [PATCH v2 1/2] mm: uninitialized struct page poisoning sanity checking

From: Pavel Tatashin
Date: Sat Apr 07 2018 - 10:45:56 EST


> Let me study your trace, perhaps I will able to figure out the issue
> without reproducing it.

Hi Sasha,

I've been studying this problem more. The issue happens in this stack:

...subsys_init...
topology_init()
register_one_node(nid)
link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages)
register_mem_sect_under_node(mem_blk, nid)
get_nid_for_pfn(pfn)
pfn_to_nid(pfn)
page_to_nid(page)
PF_POISONED_CHECK(page)

We are trying to get nid from struct page which has not been
initialized. My patches add this extra scrutiny to make sure that we
never get invalid nid from a "struct page" by adding
PF_POISONED_CHECK() to page_to_nid(). So, the bug already exists in
Linux where incorrect nid is read. The question is why does happen?

First, I thought, that perhaps struct page is not yet initialized.
But, the initcalls are done after deferred pages are initialized, and
thus every struct page must be initialized by now. Also, if deferred
pages were enabled, we would take a slightly different path and avoid
this bug by getting nid from memblock instead of struct page:

get_nid_for_pfn(pfn)
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
if (system_state < SYSTEM_RUNNING)
return early_pfn_to_nid(pfn);
#endif

I also verified in your config that CONFIG_DEFERRED_STRUCT_PAGE_INIT
is not set. So, one way to fix this issue, is to remove this "#ifdef"
(I have not checked for dependancies), but this is simply addressing
symptom, not fixing the actual issue.

Thus, we have a "struct page" backing memory for this pfn, but we have
not initialized it. For some reason memmap_init_zone() decided to skip
it, and I am not sure why. Looking at the code we skip initializing
if:
!early_pfn_valid(pfn)) aka !pfn_valid(pfn) and if !early_pfn_in_nid(pfn, nid).

I suspect, this has something to do with !pfn_valid(pfn). But, without
having a machine on which I could reproduce this problem, I cannot
study it further to determine exactly why pfn is not valid.

Please replace !pfn_valid_within() with !pfn_valid() in
get_nid_for_pfn() and see if problem still happens. If it does not
happen, lets study the memory map, pgdata's start end, and the value
of this pfn.

Thank you,
Pasha