Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

From: Luis R. Rodriguez
Date: Mon Nov 23 2015 - 13:56:44 EST


On Sat, Nov 21, 2015 at 01:49:06PM +0200, Vassilis Virvilis wrote:
> On 11/20/2015 02:23 PM, Juergen Gross wrote:
> >On 20/11/15 11:04, vasvir@xxxxxxxxxxxxxxxxx wrote:
> >>>I've just found a potential issue: In case MTRR is disabled by the BIOS
> >>>the PAT register of the boot processor won't be restored after resume.
> >>>
> >>>Can you check whether pr_info("MTRR: Disabled\n") has been executed in
> >>>early boot? If yes, this might be a BIOS option.
> >>>
> >>
> >>I don't have access right now. I will test it later tonight (This is my
> >>home machine).
> >>
> >>Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
> >>somewere else e.g. /proc /sys etc?
> >
> >I think grepping for MTRR in dmesg should be enough.
>
> kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar (see previously attached image) "Calling lapic..." place.
>
> $dmesg | grep -i mtr for 4.3 kernel with notpat
> [ 0.189113] calling mtrr_if_init+0x0/0x5f @ 1
> [ 0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [ 0.189222] pmd_set_huge: Cannot satisfy [mem 0xf8000000-0xf8200000] with a huge-page mapping due to MTRR override.
> [ 0.189559] calling mtrr_init_finialize+0x0/0x3a @ 1
> [ 0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
> [ 8.994140] mtrr: type mismatch for e0000000,10000000 old: write-back new: write-combining
> [ 8.994154] Failed to add WC MTRR for [00000000e0000000-00000000efffffff]; performance may suffer.

Its not clear from the log who called this MTRR call for WC that failed, I
hope we didn't attempt a WC wright on a WB region. Who owns
00000000e0000000-00000000efffffff ?

What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr

Not being able to use WC is not fatal, its just a performance issue, but if we tried
to override a region which we should not have to WC for which another area the BIOS
might rely on to not be WC, that could be a big issue.

> $dmesg | grep -i mtr for 4.3 kernel with default pat enabled
> [ 0.189368] calling mtrr_if_init+0x0/0x5f @ 1
> [ 0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [ 0.189478] pmd_set_huge: Cannot satisfy [mem 0xf8000000-0xf8200000] with a huge-page mapping due to MTRR override.
> [ 0.189814] calling mtrr_init_finialize+0x0/0x3a @ 1
> [ 0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs

The fact we don't see a conflict doesn't mean an issue or conflict didn't
trigger. If PAT didn't see something the BIOS did that make the kernel assume
it could do something that it was not able to. The MTRR init code should pick
up on this stuff and let the kernel PAT code know if there could be a conflict,
but if for some reason that was missed, that could be an issue.

> I also checked my BIOS. I found nothing about mtrr. My BIOS manual is ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about MTRR?
>
> Question: If we assume your theory is correct about mtrr/pat, wouldn't lockup/hang reboot every time the system goes to hibernate/resume? Can this assumption explain why the first hibernation/resume cycles in rapid succession after system boot are working and the long ones fail somewhat more consistently?
>
> Note: With PAT enabled the system boots up significantly faster.
>
> In the weekend I will return to 3.18-rc2 and I will try to verify my bisection is correct. Double guessing your self is a terrible thing...
>
> I will also try with nopat and I will run dmesg | grep -i mtr and post results
>
> Unless you have any other suggestions...

Bisection on the merge commit would help.

Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/