Re: ktime_get_ts64() splat during resume

From: Rafael J. Wysocki
Date: Fri Jun 17 2016 - 21:11:23 EST


On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>>> A couple of questions:
>>> - I guess this is reproducible 100% of the time?
>>
>> Yap.
>>
>> I took latest Linus + tip/master which has your commit.
>>
>>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>>> does it still crash in the same way?
>>
>> My suspend to disk script does:
>>
>> echo 3 > /proc/sys/vm/drop_caches
>> echo "shutdown" > /sys/power/disk
>> echo "disk" > /sys/power/state
>>
>> I don't use anything else for years now.
>>
>>> - Are both the image and boot kernels the same binary?
>>
>> Yep.
>
> OK, we need to find out what's wrong, then.
>
> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

Due to the nature of the memory corruption you are seeing (the same
address appears to be corrupted every time in the same way) with 100%
reproducibility and due to the fact that new code added by the commit
in question only writes to dynamically allocated memory (and you've
already verified that https://patchwork.kernel.org/patch/9185165/
doesn't help), it is quite unlikely that the memory corruption comes
from that commit itself.

However, I see a couple of ways in which that commit might uncover a latent bug.

First, it changed the layout of the kernel text by adding the
PAGE_SIZE alignment of restore_registers(). That likely pushed stuff
behind it to new offsets, possibly including the static struct field
that is now corrupted. Now, say that the memory corruption has always
happened at the same memory location, but previously there was nothing
in there or whatever was in there, wasn't vital. In that case the
memory corruption might have gone unnoticed until the commit in
question caused things to move to new locations and the corrupted
location contains a vital piece of data now. This is my current
theory.

Second, it added two invocations of get_safe_page() that in theory
might push things a bit too far towards the limit and they started to
break there. I don't see how that can happen ATM, but I'm not
excluding this possibility yet. It seems, though, that in that case
the corruption would be more random and I certainly wouldn't expect it
to happen at the same location every time.

One more indicator is that multiple people reported success with that
commit and in many hibernation runs, so the problem appears to be very
specific to your system and/or kernel configuration. It also is
interesting that the memory corruption only becomes visible during the
thawing of tasks and given the piece of data that is corrupted, it
should become visible much earlier if the memory was corrupted during
image restoration by the boot kernel.

In any case, reverting the changes in hibernate_asm_64.S alone should
show us the direction, but if it makes things work again, I would try
to change the restore_registers() alignment to something smaller, like
64 (which should be safe) and see what happens then.