Re: [PATCH 4.4 00/37] 4.4.110-stable review

From: Pavel Tatashin
Date: Fri Jan 05 2018 - 20:17:36 EST


Hi Hugh,

Thank you very much for your very thoughtful input.

I quiet positive this problem is PTI regression, because exactly the same problem I see with kernel 4.1 to which I back-ported all the necessary PTI patches from 4.4.110. I will provide this thread with more information as I collect it. I will also try to root cause the problem.

The bug has memory corruption behavior, but with both 4.1 and 4.4 kernels problem goes away when I boot with noefi parameter. So, EFI + PTI is the culprit for this memory corruption.

Thank you,
Pavel

On 01/05/2018 06:15 PM, Hugh Dickins wrote:
On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin
<pasha.tatashin@xxxxxxxxxx> wrote:
The hardware works :) I meant that before the patch linked in
https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
with that patch applied, I was able to boot it at least once, but it could
be accidental. The hang/panic does not happen at the same time on every
boot.

I get the feeling that it was accidental: it seems to me that you have
a memory corruption problem, that gets shifted around by the different
patches (or "noefi" or "nopti").

Because yesterday your boots were able to get way beyond the "EFI
Variables Facility" message, and I can't imagine why the EFI issue
would not have been equally debilitating on yesterday's 110-rc, if it
were in play.

I did intend to ask you to send your System.map, for us to scan
through: maybe some variable is marked __init and should not be, then
the "Freeing unused kernel memory" frees it for random reuse.

But today you didn't get anywhere near the "Freeing unused kernel
memory", so that can't be it - or do you sometimes get that far today?

You mention that the hang/panic does not happen at the same time on
every boot: I think all I can ask is for you to keep supplying us with
different examples (console messages) of where it occurs, in the hope
that one of them will point us in the right direction.

And it even seems possible that this has nothing to do with the
4.4.110 changes - that 4.4.109 plus some other random patches would
unleash similar corruption. Though on balance that does seem unlikely.

Hugh