Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

From: Denise Finger
Date: Fri Jun 29 2018 - 22:38:27 EST


On 06/29/2018 04:01 PM, Linus Torvalds wrote:
On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@xxxxxxxxxxxx> wrote:

I have more information regarding this BUG. Line 700 of page-flags.h is the
macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
include/linux/mm.h. I also added a printk call to PageTable() that logs
page->page_type. The routine was called twice. The first had page_type of
0xfffffbff, which would have been expected for a . The second call had
0xffffffff, which led to the BUG.

So it looks to me like the tear-down of the page tables first found a
page that is indeed a page table, and cleared the page table bit
(well, it set it - the bits are reversed).

Then it took an exception (that "interrupt: 700") and that causes
do_exit() again, and it tries to free the same page table - and now
it's no longer marked as a page table, because it already went through
the __ClearPageTable() dance once.

So on the second path through, it catches that "the bit already said
it wasn't a page table" and does the BUG.

But the real question is what the problem was the *first* time around.
I assume that has scrolled off the screen? This part:

_exception_pkey+0x58/0x128
ret_from_except_full+0x0/0x4
--- interrupt: 700 at free_pgd_range+0x19c/0x30c
LR = free_pgd_range+0x19c/0x30c
free_pgtables+0xa/0xb
exit_mnap+0xf4/0x16c
mmput+0x64/0xf0

Does reverting that commit 1d40a5ea01d5 make everything work for you?
Because if so, judging by the deafening silence on this so far, I
think that's what we should do.

That said, can some ppc person who knows the 32-bit ppc code and maybe
knows what that "interrupt: 700" means talk about that oddity in the
trace, please?

The deafening silence may be due to my having an old Microsoft address for Matthew Wilcox in my first posting. He should now have received the BUG report, and he may have some suggestions. Yes, reverting commit 1d40a5ea01d5 does permit the box to boot.

Kirill's patch also works, which seems like a better solution. If any other architecture bugs on boot, at least we will know where to look. :)

@Kirill: You may add a Reported-by: and Tested-by: Larry Finger <Larry.Finger@xxxxxxxxxxxx> to the patch.

Thanks for the help,

Larry