runtime regression with "x86/mm/pat: Emulate PAT when it is disabled"

From: Paul Gortmaker
Date: Thu Mar 03 2016 - 15:59:46 EST


So, the yocto folks moved from 4.1 to 4.4 and one of their automated
qemu x86-32 boot tests started failing. None of the yocto details seem
to matter since I offered to help and I've repropduced it using 100%
mainline kernels and a generic distro toolchain as well.

The test case is slightly complicated, in that it relies on uvesafb
being modular, and so one has to juggle modules within an ext4 image
that qemu boots from. We tried making uvesafb builtin, but that made
the issue magically vanish. Given PAT, this isn't too surprising.

Richard did the preliminary investigation and analysis, and from that I
did a bisect, and found the commit in $SUBJECT to be the root cause, as
per the discussion here:

http://lists.openembedded.org/pipermail/openembedded-core/2016-March/118397.html

I'd mentioned the above to bpetkov on IRC and after confirming it was
still an issue on 4.5-rc6, he'd asked if I had a portable reproducer.

Not sure how complicated that would be, I set out to make one from my
build. With a little LD_PRELOAD type magic and ensuring all the qemu
components are in ./ I have one that runs on an otherwise qemu-free
x86-64 box.

The stand alone reproducer is here; launched in 00-runme:

http://openlinux.wrs.com/pat-splat/reproducer.tar.bz2

It is nothing fancy, just a generic yocto build of "sato" (gfx enabled
rootfs). When it "works" it boots to a UI touchscreen interface. When
it fails, you get a black screen with a blinking cursor (as seen in
"vncviewer localhost:0").

Upon failure, you can do <Ctrl>-<Alt>-<2> to get to a passwd-less root
login ; there you can run dmesg and see the splat. The image is
currently using 4.5-rc6 ; but any kernel can be inserted; "make
modules_install INSTALL_MOD_PATH=here" and then populating those modules
from "here" into /lib/modules of the loopback mounted image. And of
course updating the bzImage on the qemu cmdline. Currently it
contains a bzImage and modules for 4.5-rc6 as I last tested that.

Also note that vncviewer will disconnect when it goes from early boot
80x25 to a higer res gfx mode; just reconnect and continue observing the
target.

I've ruled out yocto kernel changes, and yocto toolchain -- but maybe it
is a qemu issue this commit triggers ; who knows at this point.

Since I've NFI what component(s) cause this, I wanted to have the qemu
binary, all libraries etc as part of the reproducer and nothing left to
chance, and I've tested the reproducer on an ancient dual core w/o vmx
and w/o any qemu binaries installed. Bruce also tested it on a slightly
more modern dual socket xeon with vmx and confirmed it failed there..

Inside there is a 00-runme ; mostly a copy of qemu args the yocto
automated tests were using. There is also everything the qemu binaries
need to run ; toplevel dir is noisy since qemu only looks in ./ it
seems. There is also an ext4.img ; as mentioned earlier, this only
happens when uvesafb.ko is a module, so one has to loopback mount that
image and repopulate /lib/modules/ for each boot test/bisect step.

I've also included 00-bisect.txt as the output of git bisect log. And
there is also 00-configs/ dir that has the ".config" kernel file for
each build (dir names are "git describe" in here for easy correlation)
done for the bisect (plus the latest mainline build). The failing commit
in the subject is v4.1-rc5-22-g9cd25aac1f44 .

My contribution here is largely a bisect that can be relied on and
providing a portable reproducer of the regression; I am by no means a
PAT expert ; Richard invested more time into actually understanding the
problem than I did, so I'm going to totally throw him under the bus on
this when it comes to considering the ultimate root cause and possible
fixes. :)

Paul.
--