Re: Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated Nehalem CPU since late November 6.13 snapshot
From: Adam Williamson
Date: Fri Jan 10 2025 - 12:28:14 EST
On Fri, 2025-01-10 at 11:57 +0200, Mike Rapoport wrote:
> Hi Adam,
>
> On Thu, Jan 02, 2025 at 12:16:03PM -0800, Adam Williamson wrote:
> > On Wed, 2024-12-11 at 08:51 -0800, Adam Williamson wrote:
> > > Hi, folks. Please CC me on replies, I'm not subscribed to the list. The
> > > downstream bug report for this is
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2329581 . I also filed
> > > https://bugzilla.kernel.org/show_bug.cgi?id=219554 but it looks like
> > > nobody is monitoring that ATM, hence this email. Sorry, I don't know
> > > where to send it that would be more targeted.
> > >
> > > I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/
> > > (openQA is an automated testing system which runs jobs on qemu VMs,
> > > inputting keyboard and mouse events via VNC, and monitoring results via
> > > screenshots and the serial console).
> > >
> > > In openQA testing we've noticed a lot of failures of install tests
> > > since kernel-6.13.0-0.rc0.20241125git9f16d5e6f220.8.fc42 landed in
> > > Rawhide - that is, a snapshot of upstream git 9f16d5e6f220 . The
> > > previous build, kernel-6.13.0-0.rc0.20241119git158f238aa69d.2.fc42 - a
> > > snapshot of upstream 158f238aa69d - did not show this problem. The
> > > problems persist with the latest kernel build, kernel-6.13.0-
> > > 0.rc2.22.fc42 (a build of 6.13 rc2 exactly).
> > >
> > > Both BIOS and UEFI x86_64 installs are frequently hitting kernel
> > > crashes when the Fedora installer runs grub2-mkconfig as part of the
> > > install process. In the BIOS case, this causes the system to hang
> > > permanently. In the UEFI case, the system hangs for a while then
> > > reboots, and fails to boot properly as the installation did not
> > > complete.
> > >
> > > I've reproduced both BIOS and UEFI failures locally with a qemu VM
> > > configured like the one we use in the affected tests: 2 vCPUs, 4G RAM,
> > > and CPU model Nehalem - that's `-cpu Nehalem` argument to qemu. If I
> > > use host CPU config instead, the bug doesn't happen. We intentionally
> > > use the Nehalem model in this testing to ensure Fedora doesn't
> > > inadvertently stop supporting the CPU baseline it intends to support.
> > >
> > > This happens on more than 50% of install attempts, but not all of them
> > > (sometimes they work; I've set our test system to retry failures five
> > > times for now to mitigate the effects of this bug).
> > >
> > > The details of the traces we get in the kernel logs differ between
> > > occurrences and also between BIOS and UEFI, which someone suggested
> > > indicate this may be some kind of memory corruption issue. But the
> > > broad shape is consistent: the installer reaches grub2-mkconfig and we
> > > get a kernel crash.
> > >
> > > I did also try reproducing this by running `grub2-mkconfig -o
> > > /boot/grub/grub2.cfg` multiple times on an *installed* VM with the same
> > > kernel and VM config, but could not trigger a crash in this case. There
> > > must be something specific about how this happens in the installer
> > > environment (for one thing, the installer runs the command chroot'ed
> > > into the installed system environment).
> > >
> > > I'll attach sample logs from a UEFI failure and a BIOS failure.
> > >
> > > I haven't attempted to bisect this yet as I find bisecting kernel
> > > issues pretty painful (the Fedora kernel package spec is a bit weird if
> > > you're not used to it, building a full kernel takes a long time, I
> > > don't know how to do intermittent builds with the Fedora kernel spec,
> > > and since I can't yet reproduce this outside the installer I then have
> > > to build an installer image with the kernel build in to test it...).
> > > But if needs must I'll bite the bullet and do it. If anyone could e.g.
> > > guess at a commit or commit series that might be causing this so I
> > > could try a targeted reversion, though, that'd be great.
> >
> > Update on this: over the holidays, I bisected it to
> > 5185e7f9f3bd754ab60680814afd714e2673ef88 . A kernel with that commit
> > reverted does not hit the bug.
> >
> > I also did some testing with various CPU model configurations. I think
> > this actually isn't to do with Nehalem per se, but "virtual machines
> > where the CPU configuration does not exactly match the host", or
> > something like that.
> >
> > I tried a bunch of qemu CPU model settings - nehalem, sandybridge,
> > haswell, Skylake-Client and Cascadelake-Server - and got failures with
> > all of them, but when I set the model to "host", all tests passed.
> >
> > The tests get farmed out to a cluster of systems which have different
> > CPUs - one is Broadwell, one is Skylake, one is Cascade Lake - so I
> > think when I set the model to anything specific, it will match the host
> > CPU on some or none of those systems, but never *all* of them, so the
> > bug will always show up.
> >
> > I have emailed the author and reviewer of
> > 5185e7f9f3bd754ab60680814afd714e2673ef88 (also CCed on this mail) but
> > have not heard back from them yet. I've sunk over a week into this bug
> > at this point so it'd be great if someone could look at it. It's not
> > the biggest regression in the world, but it is a bit awkward for our
> > automated testing (I'll have to fiddle around to try and set CPU model
> > 'host' for the most badly-affected tests but ensure we still have
> > enough tests with 'nehalem' to confirm our baseline isn't moved).
> >
> > Thanks, and happy new year!
>
> Can you please test this patch:
>
> diff --git a/mm/execmem.c b/mm/execmem.c
> index be6b234c032e..0090a6f422aa 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -266,6 +266,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
> unsigned long vm_flags = VM_ALLOW_HUGE_VMAP;
> struct execmem_area *area;
> unsigned long start, end;
> + unsigned int page_shift;
> struct vm_struct *vm;
> size_t alloc_size;
> int err = -ENOMEM;
> @@ -296,8 +297,9 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
> if (err)
> goto err_free_mem;
>
> + page_shift = get_vm_area_page_order(vm) + PAGE_SHIFT;
> err = vmap_pages_range_noflush(start, end, range->pgprot, vm->pages,
> - PMD_SHIFT);
> + page_shift);
> if (err)
> goto err_free_mem;
>
Hi Mike! Thanks. I can indeed, and I will, but also an update: on
further testing, sadly, using 'host' CPU for qemu doesn't really avoid
the bug either :/ The initial test must have just gotten lucky. I
implemented that as a 'workaround' in our openQA system and dropped the
five automatic retries per test I was using as a bludgeon, but then
failures started showing up again :/ So I've had to put the five
retries back in place for now.
Sorry if this sent you down any wrong paths, I will test the patch
unless you tell me it's useless with this new information :)
--
Adam Williamson (he/him/his)
Fedora QA
Fedora Chat: @adamwill:fedora.im | Mastodon: @adamw@xxxxxxxxxxxxx
https://www.happyassassin.net