Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

From: vasvir
Date: Tue Nov 24 2015 - 04:37:17 EST


> Let's try to speed up reproducing this.
>
> I have a hunch perhaps this might be related to some BIOS controlled
> MTRRs and a mismatch which then enables the kernel to think that a type
> of MTRR write might be OK, but in fact its not. Due to the work load
> description of this perhaps this could be related to fan control and BIOS
> control on them and against some other device MTRR. More on this suspicion
> on another thread where you provide more logs.
>
> On a kernel that you know fails can you try replacing this work load by
> making
> you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building
> for 2,
> 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
> making the CPU fan trigger would accelerate the issue. If 'make -j' is
> too nuts
> to the point you can't even CTRL C it, try 'make -j 16' . Note that if
> this is
> true then that means a hot CPU could still trigger CPU fan controls on on
> a
> fresh boot if the previous boot was CPU intensive.

OK that nailed it - with kernel 4.3 a known "bad" kernel I was able to
reproduce it in the second hibernate/resume cycle. Here is what I did in
my own words so you can spot inconsistencies.

I started a kernel compile with make -j 32. My computer was very
responsive which is an impressive feat by the way.
In a second tab in my Konsole (I am running KDE) I run $watch sensors. I
watched the temperature of the cores to go from 38 to ~70 and the cpu fan
from ~1630 to ~1900. Then the first time I hit Ctrl+C - stopped the
compilation and hibernated from the KDE. I always hibernate from the KDE
start menu. Previously I had made some tests where I was hibernating from
the VT console (although sddm may was running in VT7) and I have managed
to reproduce it - so (in my mind) it was not graphics mode specific. From
that point I am always hibernating from KDE.

The first time it worked. For the second time I thought - why to hit
Ctrl+C let's try to hibernate with the compilation running - and it
failed. Now I don't know if it failed because it was the second cycle or
because the load of the compilation was there or because of the
temperature controlled fan register you mentioned.

Then I repeated the test with a known good kernel 3.18 (which should be
773fed910d41e443e495a6bfa9ab1c2b7b13e012 according to my git bisect logs -
I have a problem there - see below) and it survived the same test
(hibernate two times with temperature being ~70).


> If this doesn't do it lets try forcing an MTRR capable driver, say
> graphics is
> the obvious target, try perhaps some 3D stuff or a screen saver prior to
> hibernation. Note that even if you boot nomtrr the BIOS may still use
> MTRRs,
> and PAT use on Linux could assume MTRR is not being used on drivers but
> the
> BIOS may still do something behind the scenes. This is actually one reason
> why
> we can't exactly remove MTRR support from Linux, since the BIOS may still
> do
> some wacky stuff with MTRRs, one example of such I was given was CPU can
> control might use WC MTRRs, so the kernel must be aware of this, even if
> no
> MTRRs are ever used on the Linux kernel at all -- this is the case now as
> of
> v4.3 and onwards.
>
> If that doesn't help speed it up , maybe try both screen saver + some 3D
> stuff + cpu instensive stuff.

I have 3D effects enabled in my KDE. Since your tip succeed to reproduce
the problem early I didn't bother but If I should test 3D which program /
benchmark should I run? glxgears?

>
> To help you speed up testing you can try reducing your build time by
> reducing
> the amount of crap you have to build:
>
> make localmodconfig
>
> That should only build things your kernel has loaded as modules or is
> already
> enabled (=y).
>

Thanks for the tip. I don't want to change that right now. I don't mind
waiting a little bit because I a get a deb with the kernel and can retest
a known configuration. The other tip you gave if it actually works as it
looks like working would give a great boost to the debugging cycle to
actually make me the bottleneck.

>
> That is commit a023748d53c10850650fe86b1c4a7d421d576451
> ("Merge branch 'x86-mm-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> Git is smart enough to tell you you've hit a merge commit and that all the
> possible commits on that merge could be the issue. This is why you bisect
> log shows a slew of commits. The next step is to bisect through the merge
> and then bisect through that, this will then let us identify the exact
> commit
> that may have caused the issue.
>
> There are a few ways to do this, my preferred way is to "unfold" a merge
> commit manually.
>
> To help keep thing separately (without affecting other tests you might
> have on your other git tree and to avoid having to force you to loose
> fresh object as you continue to build test on the other tree), I'd do
> something like this:

we will go with your preferred way - no question about that.

>
> mkdir ~/tmp
> git clone ~/linux/.git linux-dev-test

ok I have them in paralled ~/path/linux ~/path/linux-dev-test

>
> cd linux-dev-test
>
> Notice how if you do git log and search for
> a023748d53c10850650fe86b1c4a7d421d576451
> you'll see that the commit listed before this is
> 773fed910d41e443e495a6bfa9ab1c2b7b13e012
> ("Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> To be clear the list of commits you typically would see is just:
>
> a023748d53c10850650fe86b1c4a7d421d576451 - Merge branch 'x86-mm-for-linus'
> of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> 773fed910d41e443e495a6bfa9ab1c2b7b13e012 - Merge branches
> 'x86-platform-for-linus' and 'x86-uv-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>
> We want to go down into the commits in the merge commit a023748d53c and
> then zero out exactly which commit caused the issue. To do that on your
> linux-dev-test directory you can do this:

Thank you for the explanations. I thing I had understood that bit. git
bisect visualized (gitk) helped me to grasp it. git log gave me a hard
time with all these "hidden commits". Confirmation is good.

>
> git checkout -b test-merge-commit a023748d53c10850650fe86b1c4a7d421d576451
>
> That will create branch for testing based on the merge commit.
> Now do this:
>
> git rebase -i 773fed910d41e443e495a6bfa9ab1c2b7b13e012
>
> Then don't pick any commit, just save and exit the editor, and then
> git will actually "unfold" the merge commit for you -- it magically
> will apply each commit in that merge commit linearly into your git
> history.
>
> For instance the rebase should show 22 commits as follows, just
> leave the defaults set as in bewlow and just hit (ESCT + :wq if
> in vi):
>
> pick 96e70f832856 x86/mm: Avoid overlap the fixmap area on i386
> pick 63e7b6d90c1e x86: mm: Re-use the early_ioremap fixed area
> pick bdee237c0343 x86: mm: Use 2GB memory block size on large-memory
> x86-64 systems
> pick 281d4078bec3 x86: Make page cache mode a real type
> pick c27ce0af896b x86: Use new cache mode type in include/asm/fb.h
> pick 2d85ebf8e12e x86: Use new cache mode type in
> drivers/video/fbdev/gbefb.c
> pick 5006e45a6bc2 x86: Use new cache mode type in
> drivers/video/fbdev/vermilion
> pick 1c64216be164 x86: Use new cache mode type in arch/x86/pci
> pick 2df58b6d3530 x86: Use new cache mode type in arch/x86/mm/init_64.c
> pick d85f33342a0f x86: Use new cache mode type in asm/pgtable.h
> pick 49a3b3cbdf16 x86: Use new cache mode type in mm/iomap_32.c
> pick 2a3746984c98 x86: Use new cache mode type in track_pfn_remap() and
> track_pfn_insert()
> pick 102e19e1955d x86: Remove looking for setting of _PAGE_PAT_LARGE in
> pageattr.c
> pick c06814d8419a x86: Use new cache mode type in setting page attributes
> pick b14097bd911c x86: Use new cache mode type in mm/ioremap.c
> pick e00c8cc93c1a x86: Use new cache mode type in memtype related
> functions
> pick 87ad0b713b10 x86: Clean up pgtable_types.h
> pick f439c429c320 x86: Support PAT bit in pagetable dump for lower levels
> pick f5b2831d6541 x86: Respect PAT bit when copying pte values between
> large and normal pages
> pick bd809af16e3a x86: Enable PAT to use cache mode translation tables
> pick 47591df50512 xen: Support Xen pv-domains using PAT
> pick 0dbcae884779 x86: mm: Move PAT only functions to mm/pat.c
>

Ok I will do later tonight. But from my (git bisect) logs what I was
expecting was

# only skipped commits left to test
# possible first bad commit: [a023748d53c10850650fe86b1c4a7d421d576451]
Merge branch 'x86-mm-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# possible first bad commit: [0dbcae884779fdf7e2239a97ac7488877f0693d9]
x86: mm: Move PAT only functions to mm/pat.c
# possible first bad commit: [47591df505129c9774af6cca2debf283a6e56ed7]
xen: Support Xen pv-domains using PAT
# possible first bad commit: [bd809af16e3ab1f8d55b3e2928c47c67e2a865d2]
x86: Enable PAT to use cache mode translation tables
# possible first bad commit: [f5b2831d654167d77da8afbef4d2584897b12d0c]
x86: Respect PAT bit when copying pte values between large and normal
pages
# possible first bad commit: [f439c429c320981943f8b64b2a4049d946cb492b]
x86: Support PAT bit in pagetable dump for lower levels
# possible first bad commit: [87ad0b713b1034b6caf559976c35ce47f6d1d1e9]
x86: Clean up pgtable_types.h
# possible first bad commit: [e00c8cc93c1ac01ecd5049929a50fb47b62bb041]
x86: Use new cache mode type in memtype related functions
# possible first bad commit: [b14097bd911c2554b0b5271b3a6b2d84044d1843]
x86: Use new cache mode type in mm/ioremap.c
# possible first bad commit: [c06814d8419a74528500f85faf5fc01f67f8e7e6]
x86: Use new cache mode type in setting page attributes
# possible first bad commit: [102e19e1955d85f31475416b1ee22980c6462cf8]
x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c
# possible first bad commit: [2a3746984c98b17b565e6a2c2bbaaaef757db1b4]
x86: Use new cache mode type in track_pfn_remap() and track_pfn_insert()

commit a023748d53c10850650fe86b1c4a7d421d576451 contains all the other
commits listed below. The order is that newest is higher

Note that these commits listed above are untestable because the resulting
kernels are not bootable. They hang in the second line of boot output in
"Loading Ramdisk..." or something similar.

my last good commit was 49a3b3cbdf1621678a39bd95a3e67c0f858539c7 that
means with git bisect I had already started zeroing in in
a023748d53c10850650fe86b1c4a7d421d576451 since 49a3b3... was part of
a0237...

So based on my git bisect so far my understanding is
last good merge commit: 773fed910d41e443e495a6bfa9ab1c2b7b13e012
last bad merge commit (next after 773fed...):
a023748d53c10850650fe86b1c4a7d421d576451
last good commit (inside a023748d53c10850650fe86b1c4a7d421d576451):
49a3b3cbdf1621678a39bd95a3e67c0f858539c7
all the others from 49a3b3cbdf1... to a023748d53c1... are
untestable/unbootable kernels.

Please correct me if I am wrong - it will help me build the correct mental
model.

> You should see:
>
> Successfully rebased and updated refs/heads/test-merge-commit.
>
> Now if you do git log you will see the above commits in linear
> atomic history. You can now bisect this merge commit atomically, so do:
>
> git bisect 099487de0934e3d5e326666914a426af89a0868b
> 773fed910d41e443e495a6bfa9ab1c2b7b13e012
>
> Note that this assumes that the commit prior to the merge commit is fine.
> Is this true, can you confirm? (git checkout -b test-prior-merge-gtest
> 773fed910d4,
> build and see if it doesn't break there)
>
> If we know for sure 773fed910d4 did not break anything then the above
> bisect
> should let us zero in on the exact atomic commit ID that caused the issue.
>

Now the problem is that I tried twice to verify that
773fed910d41e443e495a6bfa9ab1c2b7b13e012 is indeed a good commit and I
ended up with an unbootable kernel (hangs in "Loading Ramdisk..."). This
is very disappointing and means that all my bisections so far are invalid.
Very disappointing indeed but it's only a setback. I will figure it out
and will make sure I have a valid setup for reproducible tests before I
bother you again.

Just for the record I did
$git checkout 773fed910d41e443e495a6bfa9ab1c2b7b13e012
$fakeroot make -j 4 CC=gcc-4.8 deb-pkg

I will do as you suggest with the unfold of commits - but if my bisection
was right (serious hints to the opposite exist) I stopped on
unbootable/untestable kernels

Thanks for the exhaustive mails with the explanations and the tips. They
are much appreciated.

Vassilis


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/