Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

From: Luis R. Rodriguez
Date: Mon Nov 23 2015 - 13:48:27 EST


On Thu, Nov 19, 2015 at 06:39:28AM +0100, Juergen Gross wrote:
> On 18/11/15 22:43, Vassilis Virvilis wrote:
> > Hi,
> >
> > I have been hit by a hibernate/resume bug. Other people may have too:
> > The following links are consistent with my observations
> >
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
> > https://bugs.archlinux.org/task/44807
> >
> > Some observations:
> > 1) The first few rapid hibernation / resume cycles do not fail.
> >
> > 2) If the computer is loaded (eclipse + chromium + firefox/iceweasel +
> > thunderbird/icedove + Konsole) helps to reproduce and lock up during resume

Let's try to speed up reproducing this.

I have a hunch perhaps this might be related to some BIOS controlled
MTRRs and a mismatch which then enables the kernel to think that a type
of MTRR write might be OK, but in fact its not. Due to the work load
description of this perhaps this could be related to fan control and BIOS
control on them and against some other device MTRR. More on this suspicion
on another thread where you provide more logs.

On a kernel that you know fails can you try replacing this work load by making
you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building for 2,
4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
making the CPU fan trigger would accelerate the issue. If 'make -j' is too nuts
to the point you can't even CTRL C it, try 'make -j 16' . Note that if this is
true then that means a hot CPU could still trigger CPU fan controls on on a
fresh boot if the previous boot was CPU intensive.

If this doesn't do it lets try forcing an MTRR capable driver, say graphics is
the obvious target, try perhaps some 3D stuff or a screen saver prior to
hibernation. Note that even if you boot nomtrr the BIOS may still use MTRRs,
and PAT use on Linux could assume MTRR is not being used on drivers but the
BIOS may still do something behind the scenes. This is actually one reason why
we can't exactly remove MTRR support from Linux, since the BIOS may still do
some wacky stuff with MTRRs, one example of such I was given was CPU can
control might use WC MTRRs, so the kernel must be aware of this, even if no
MTRRs are ever used on the Linux kernel at all -- this is the case now as of
v4.3 and onwards.

If that doesn't help speed it up , maybe try both screen saver + some 3D
stuff + cpu instensive stuff.

To help you speed up testing you can try reducing your build time by reducing
the amount of crap you have to build:

make localmodconfig

That should only build things your kernel has loaded as modules or is already
enabled (=y).

> > 3) Long hibernation times (overnight) helps to reproduce and lock up
> > during resume
> >
> > 4) For the bad commits (where the lockup during resume takes place) -
> > the image loading during resume is significantly faster. It is fast and
> > then it locks.
> >
> > How I hit the problem and what I have done:
> >
> > I am running debian unstable
> >
> > Debian went from 3.16 to 3.19 - hence the problem raised its ugly head.
> > I upgraded diligently up to 4.2.6 - The problem persists
> >
> > I started kernel bisection from 3.16 to 3.19 following
> > https://wiki.debian.org/DebianKernel/GitBisect
> >
> > One month and 25 kernels later see below for the bisect log
>
> Wow! Thanks for doing this work!
>

Vassilis, indeed, the amount of work you have put into this is extremely
appreciated!

> Juergen
>
> >
> > I hit some untestable kernel that weren't booting. They were hanging at
> > "Loading ramdisk..." before any actual kernel message.
> >
> > Looks like the first bad / untestable commit is from Juergen Gross /
> > Thomas Gleixner Merge branch 'x86-mm-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip [full PAT support]
> >

That is commit a023748d53c10850650fe86b1c4a7d421d576451
("Merge branch 'x86-mm-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

Git is smart enough to tell you you've hit a merge commit and that all the
possible commits on that merge could be the issue. This is why you bisect
log shows a slew of commits. The next step is to bisect through the merge
and then bisect through that, this will then let us identify the exact commit
that may have caused the issue.

There are a few ways to do this, my preferred way is to "unfold" a merge
commit manually.

To help keep thing separately (without affecting other tests you might
have on your other git tree and to avoid having to force you to loose
fresh object as you continue to build test on the other tree), I'd do
something like this:

mkdir ~/tmp
git clone ~/linux/.git linux-dev-test

cd linux-dev-test

Notice how if you do git log and search for a023748d53c10850650fe86b1c4a7d421d576451
you'll see that the commit listed before this is 773fed910d41e443e495a6bfa9ab1c2b7b13e012
("Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

To be clear the list of commits you typically would see is just:

a023748d53c10850650fe86b1c4a7d421d576451 - Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
773fed910d41e443e495a6bfa9ab1c2b7b13e012 - Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

We want to go down into the commits in the merge commit a023748d53c and
then zero out exactly which commit caused the issue. To do that on your
linux-dev-test directory you can do this:

git checkout -b test-merge-commit a023748d53c10850650fe86b1c4a7d421d576451

That will create branch for testing based on the merge commit.
Now do this:

git rebase -i 773fed910d41e443e495a6bfa9ab1c2b7b13e012

Then don't pick any commit, just save and exit the editor, and then
git will actually "unfold" the merge commit for you -- it magically
will apply each commit in that merge commit linearly into your git
history.

For instance the rebase should show 22 commits as follows, just
leave the defaults set as in bewlow and just hit (ESCT + :wq if
in vi):

pick 96e70f832856 x86/mm: Avoid overlap the fixmap area on i386
pick 63e7b6d90c1e x86: mm: Re-use the early_ioremap fixed area
pick bdee237c0343 x86: mm: Use 2GB memory block size on large-memory x86-64 systems
pick 281d4078bec3 x86: Make page cache mode a real type
pick c27ce0af896b x86: Use new cache mode type in include/asm/fb.h
pick 2d85ebf8e12e x86: Use new cache mode type in drivers/video/fbdev/gbefb.c
pick 5006e45a6bc2 x86: Use new cache mode type in drivers/video/fbdev/vermilion
pick 1c64216be164 x86: Use new cache mode type in arch/x86/pci
pick 2df58b6d3530 x86: Use new cache mode type in arch/x86/mm/init_64.c
pick d85f33342a0f x86: Use new cache mode type in asm/pgtable.h
pick 49a3b3cbdf16 x86: Use new cache mode type in mm/iomap_32.c
pick 2a3746984c98 x86: Use new cache mode type in track_pfn_remap() and track_pfn_insert()
pick 102e19e1955d x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c
pick c06814d8419a x86: Use new cache mode type in setting page attributes
pick b14097bd911c x86: Use new cache mode type in mm/ioremap.c
pick e00c8cc93c1a x86: Use new cache mode type in memtype related functions
pick 87ad0b713b10 x86: Clean up pgtable_types.h
pick f439c429c320 x86: Support PAT bit in pagetable dump for lower levels
pick f5b2831d6541 x86: Respect PAT bit when copying pte values between large and normal pages
pick bd809af16e3a x86: Enable PAT to use cache mode translation tables
pick 47591df50512 xen: Support Xen pv-domains using PAT
pick 0dbcae884779 x86: mm: Move PAT only functions to mm/pat.c

You should see:

Successfully rebased and updated refs/heads/test-merge-commit.

Now if you do git log you will see the above commits in linear
atomic history. You can now bisect this merge commit atomically, so do:

git bisect 099487de0934e3d5e326666914a426af89a0868b 773fed910d41e443e495a6bfa9ab1c2b7b13e012

Note that this assumes that the commit prior to the merge commit is fine.
Is this true, can you confirm? (git checkout -b test-prior-merge-gtest 773fed910d4,
build and see if it doesn't break there)

If we know for sure 773fed910d4 did not break anything then the above bisect
should let us zero in on the exact atomic commit ID that caused the issue.

> > Full disclaimer: I may have fucked up the bisection. Finding bad commits
> > was semi easy - finding good commits needs a run time for 2-3 days.

Reducing the amount of time it takes to reproduce a bug is art work but perhaps
we can reduce that time.

> >
> > I would really appreciate some help and directions to nail this down.
> >

The amount of time and patience on your side is appreciated as well.

> >
> > Regards
> >
> > Vassilis Virvilis
> >
> >
> >
> > bill@localhost:~/Downloads/linux$ git bisect log
> > git bisect start
> > # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
> > git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
> > # bad: [bfa76d49576599a4b9f9b7a71f23d73d6dcff735] Linux 3.19
> > git bisect bad bfa76d49576599a4b9f9b7a71f23d73d6dcff735
> > # good: [754c780953397dd5ee5191b7b3ca67e09088ce7a] Merge branch
> > 'for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
> > git bisect good 754c780953397dd5ee5191b7b3ca67e09088ce7a
> > # bad: [7ef58b32f571bffb7763c6252ad7527562081f34] Merge tag
> > 'devicetree-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux
> > git bisect bad 7ef58b32f571bffb7763c6252ad7527562081f34
> > # good: [53429290a054b30e4683297409fc4627b2592315] Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
> > git bisect good 53429290a054b30e4683297409fc4627b2592315
> > # good: [3a647c1d7ab08145cee4b650f5e797d168846c51] Merge tag
> > 'drivers-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> > git bisect good 3a647c1d7ab08145cee4b650f5e797d168846c51
> > # bad: [1366f5d3129f2abde606214de7afc3dd61781fa3] Merge branch
> > 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> > git bisect bad 1366f5d3129f2abde606214de7afc3dd61781fa3
> > # good: [151cd97630f87451cab412e40750d0e5f7581c98] Merge tag
> > 'defconfig-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> > git bisect good 151cd97630f87451cab412e40750d0e5f7581c98
> > # good: [ecb50f0afd35a51ef487e8a54b976052eb03d729] Merge branch
> > 'irq-core-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good ecb50f0afd35a51ef487e8a54b976052eb03d729
> > # bad: [3a5dc1fafb016560315fe45bb4ef8bde259dd1bc] Merge branch
> > 'x86-microcode-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect bad 3a5dc1fafb016560315fe45bb4ef8bde259dd1bc
> > # good: [b6444bd0a18eb47343e16749ce80a6ebd521f124] Merge branch
> > 'x86-boot-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good b6444bd0a18eb47343e16749ce80a6ebd521f124
> > # bad: [a023748d53c10850650fe86b1c4a7d421d576451] Merge branch
> > 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect bad a023748d53c10850650fe86b1c4a7d421d576451
> > # good: [773fed910d41e443e495a6bfa9ab1c2b7b13e012] Merge branches
> > 'x86-platform-for-linus' and 'x86-uv-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good 773fed910d41e443e495a6bfa9ab1c2b7b13e012
> > # good: [49a3b3cbdf1621678a39bd95a3e67c0f858539c7] x86: Use new cache
> > mode type in mm/iomap_32.c
> > git bisect good 49a3b3cbdf1621678a39bd95a3e67c0f858539c7
> > # skip: [87ad0b713b1034b6caf559976c35ce47f6d1d1e9] x86: Clean up
> > pgtable_types.h
> > git bisect skip 87ad0b713b1034b6caf559976c35ce47f6d1d1e9
> > # skip: [c06814d8419a74528500f85faf5fc01f67f8e7e6] x86: Use new cache
> > mode type in setting page attributes
> > git bisect skip c06814d8419a74528500f85faf5fc01f67f8e7e6
> > # skip: [e00c8cc93c1ac01ecd5049929a50fb47b62bb041] x86: Use new cache
> > mode type in memtype related functions
> > git bisect skip e00c8cc93c1ac01ecd5049929a50fb47b62bb041
> > # skip: [bd809af16e3ab1f8d55b3e2928c47c67e2a865d2] x86: Enable PAT to
> > use cache mode translation tables
> > git bisect skip bd809af16e3ab1f8d55b3e2928c47c67e2a865d2
> > # skip: [f439c429c320981943f8b64b2a4049d946cb492b] x86: Support PAT bit
> > in pagetable dump for lower levels
> > git bisect skip f439c429c320981943f8b64b2a4049d946cb492b
> > # skip: [47591df505129c9774af6cca2debf283a6e56ed7] xen: Support Xen
> > pv-domains using PAT
> > git bisect skip 47591df505129c9774af6cca2debf283a6e56ed7
> > # skip: [b14097bd911c2554b0b5271b3a6b2d84044d1843] x86: Use new cache
> > mode type in mm/ioremap.c
> > git bisect skip b14097bd911c2554b0b5271b3a6b2d84044d1843
> > # skip: [102e19e1955d85f31475416b1ee22980c6462cf8] x86: Remove looking
> > for setting of _PAGE_PAT_LARGE in pageattr.c
> > git bisect skip 102e19e1955d85f31475416b1ee22980c6462cf8
> > # skip: [f5b2831d654167d77da8afbef4d2584897b12d0c] x86: Respect PAT bit
> > when copying pte values between large and normal pages
> > git bisect skip f5b2831d654167d77da8afbef4d2584897b12d0c
> > # skip: [0dbcae884779fdf7e2239a97ac7488877f0693d9] x86: mm: Move PAT
> > only functions to mm/pat.c
> > git bisect skip 0dbcae884779fdf7e2239a97ac7488877f0693d9
> > # skip: [2a3746984c98b17b565e6a2c2bbaaaef757db1b4] x86: Use new cache
> > mode type in track_pfn_remap() and track_pfn_insert()
> > git bisect skip 2a3746984c98b17b565e6a2c2bbaaaef757db1b4
> > # only skipped commits left to test
> > # possible first bad commit: [a023748d53c10850650fe86b1c4a7d421d576451]
> > Merge branch 'x86-mm-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > # possible first bad commit: [0dbcae884779fdf7e2239a97ac7488877f0693d9]
> > x86: mm: Move PAT only functions to mm/pat.c
> > # possible first bad commit: [47591df505129c9774af6cca2debf283a6e56ed7]
> > xen: Support Xen pv-domains using PAT
> > # possible first bad commit: [bd809af16e3ab1f8d55b3e2928c47c67e2a865d2]
> > x86: Enable PAT to use cache mode translation tables
> > # possible first bad commit: [f5b2831d654167d77da8afbef4d2584897b12d0c]
> > x86: Respect PAT bit when copying pte values between large and normal pages
> > # possible first bad commit: [f439c429c320981943f8b64b2a4049d946cb492b]
> > x86: Support PAT bit in pagetable dump for lower levels
> > # possible first bad commit: [87ad0b713b1034b6caf559976c35ce47f6d1d1e9]
> > x86: Clean up pgtable_types.h
> > # possible first bad commit: [e00c8cc93c1ac01ecd5049929a50fb47b62bb041]
> > x86: Use new cache mode type in memtype related functions
> > # possible first bad commit: [b14097bd911c2554b0b5271b3a6b2d84044d1843]
> > x86: Use new cache mode type in mm/ioremap.c
> > # possible first bad commit: [c06814d8419a74528500f85faf5fc01f67f8e7e6]
> > x86: Use new cache mode type in setting page attributes
> > # possible first bad commit: [102e19e1955d85f31475416b1ee22980c6462cf8]
> > x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c
> > # possible first bad commit: [2a3746984c98b17b565e6a2c2bbaaaef757db1b4]
> > x86: Use new cache mode type in track_pfn_remap() and track_pfn_insert()
>

--
Luis Rodriguez, SUSE LINUX GmbH
Maxfeldstrasse 5; D-90409 Nuernberg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/