Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

From: Michal Hocko
Date: Fri Jan 25 2019 - 11:39:45 EST


On Fri 25-01-19 11:16:30, robert shteynfeld wrote:
> Attached is the dmesg from patched kernel.

Your Node1 physical memory range precedes Node0 which is quite unusual
but it shouldn't be a huge problem on its own. But memory ranges are
not aligned to the memory section

[ 0.286954] Early memory node ranges
[ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
[ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
[ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]

As you can see the last pfn for the node1 is inside the section and
Node0 starts right after. This is quite unusual as well. If for no other
reasons then the memmap of those struct pages will be remote for one or
the other. Actually I am not even sure we can handle that properly
because we do expect 1:1 mapping between sections and nodes.

Now it also makes some sense why 2830bf6f05fb ("mm, memory_hotplug:
initialize struct pages for the full memory section") made any
difference. We simply write over a potentially initialized struct page
and blow up on that. I strongly suspect that the commit just uncovered
a pre-existing problem. Let me think what we can do about that.

> I'm not an expert at debugging the kernel, obviously. I tried setting
> up a serial console before without much luck as part of this debugging
> session.

Ubuntu has a nice howto for netconsole configuration
https://wiki.ubuntu.com/Kernel/Netconsole. It is quite important to get
the actual failure.
--
Michal Hocko
SUSE Labs