Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

From: Michal Hocko
Date: Fri Jan 25 2019 - 02:37:10 EST


On Fri 25-01-19 17:48:32, Linus Torvalds wrote:
> [ Just adding a lot of other people to the cc ]
>
> Robert, could you add a dmesg of a successful boot to that bugzilla,
> or just as an attachement in email to this group of people..
>
> This looks to be with the Fedora kernel config. Two people reporting
> it, it looks like similar machines.
>
> I assume it's some odd memory sizing detail that happens to trigger a
> particular case.

Quite possible.

> I absolutely *hate* those "let's lazily clear 'struct page' array"
> patches. They've caused problems before, and I'm not convinced the
> pain has been worth it. Maybe we should revert them (again) and
> promise to never ever take things like that again? Andrew?

The performance numbers were pretty dramatic on the other hand. This was
especially seen for very large NVDIMMs initialization when a userspace
was timing out without these applied.

I am certainly not very happy about regressions which we still do see. I
was worried about that early when reviewing these patches because it is
really hard to find all those weird places which simply happened to work
even when broken before. E.g. unitialized struct pages were simply
with zeroed and that means that they seemed to belong to node zero and
zone DMA and nothing really blown up. With the poisoning in place we
have an explicit VM_BUG_ON and Fedora kernels do enable VM debugging so
those problems are visible.

I still think we should chase after those issue and fix them regardless.
Lazy initialization revert will not solve those problems. It will just
paper over them.
--
Michal Hocko
SUSE Labs