Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

From: Dan Williams
Date: Thu Jun 13 2019 - 21:22:32 EST


On Thu, Jun 13, 2019 at 11:42 AM Qian Cai <cai@xxxxxx> wrote:
>
> On Wed, 2019-06-12 at 12:37 -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai <cai@xxxxxx> wrote:
> > >
> > > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > > of a section at boot" [1] causes a crash below when the first kmemleak
> > > scan kthread kicks in. This is because kmemleak_scan() calls
> > > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> > >
> > > The commit [1] did add an additional check of pfn_section_valid() in
> > > pfn_valid(), but forgot to add it in the above code path.
> > >
> > > page:ffffea0002748000 is uninitialized and poisoned
> > > raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> > > raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > ------------[ cut here ]------------
> > > kernel BUG at include/linux/mm.h:1084!
> > > invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > > BIOS -[TEE113T-1.00]- 07/07/2017
> > > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > > Call Trace:
> > > kmemleak_scan_thread+0x9f/0xc7
> > > kthread+0x1d2/0x1f0
> > > ret_from_fork+0x35/0x4
> > >
> > > [1] https://patchwork.kernel.org/patch/10977957/
> > >
> > > Signed-off-by: Qian Cai <cai@xxxxxx>
> > > ---
> > > include/linux/memory_hotplug.h | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > > index 0b8a5e5ef2da..f02be86077e3 100644
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -28,6 +28,7 @@
> > > unsigned long ___nr = pfn_to_section_nr(___pfn); \
> > > \
> > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > > + pfn_section_valid(__nr_to_section(___nr), pfn) && \
> > > pfn_valid_within(___pfn)) \
> > > ___page = pfn_to_page(___pfn); \
> > > ___page; \
> >
> > Looks ok to me:
> >
> > Acked-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> >
> > ...but why is pfn_to_online_page() a multi-line macro instead of a
> > static inline like all the helper routines it invokes?
>
> Sigh, probably because it is a mess over there.
>
> memory_hotplug.h and mmzone.h are included each other. Converted it directly to
> a static inline triggers compilation errors because mmzone.h was included
> somewhere else and found pfn_to_online_page() needs things like
> pfn_valid_within() and online_section_nr() etc which are only defined later in
> mmzone.h.

Ok, makes sense I had I assumed it was something horrible like that.

Qian, can you send more details on the reproduction steps for the
failures you are seeing? Like configs and platforms you're testing.
I've tried enabling kmemleak and offlining memory and have yet to
trigger these failures. I also have a couple people willing to help me
out with tracking down the PowerPC issue, but I assume they need some
help with the reproduction as well.