Re: [PATCH v3 1/5] mm: page_alloc: remain memblock_next_valid_pfn() when CONFIG_HAVE_ARCH_PFN_VALID is enable

From: Jia He
Date: Mon Apr 02 2018 - 05:17:58 EST




On 4/2/2018 4:12 PM, Wei Yang Wrote:
On Wed, Mar 28, 2018 at 05:49:23PM +0800, Jia He wrote:

On 3/28/2018 5:18 PM, Wei Yang Wrote:
Oops, I should reply this thread. Forget about the reply on another thread.

On Sun, Mar 25, 2018 at 08:02:15PM -0700, Jia He wrote:
Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

Why this has a bug? Do you have some link about it?

If the audience could know the potential risk, it would be helpful to review
the code and decide whether to take it back.
Hi Wei
Paul firstly submit a commit b92df1de5 to improve the loop in
memmap_init_zone.
And Daniel tried to fix a bug_on panic issue on X86 in commit 864b75f9d6b
because
there is evidence that this bug_on was caused by b92df1de5 [1].

But things didn't get better, 864b75f9d6b caused booting hang issue on
arm{64} [2]
So maintainer decided to reverted both b92df1de5 and 864b75f9d6b

[1] https://patchwork.kernel.org/patch/10251145/
[2] https://lkml.org/lkml/2018/3/14/469
I took some time to look into the discussion, while the root cause seems not
clear now?

Frankly speaking, to me the root cause of that bug_on is not completedly
clear :-) Daniel ever gave me some hints as followed, but currently I have
no x86 platform to understand the details.

"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn()
does not always return the next valid one but skips more resulting in
some valid frames to be skipped (as if they were invalid). And that's
why kernel was eventually crashing on some !arm machines."

--
Cheers,
Jia