Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data

From: Nathan Chancellor
Date: Wed Feb 08 2023 - 12:47:32 EST


Hi Jason,

On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up. So the decompressed kernel clobbers it.
>
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
>
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
>
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
>
> Cc: x86@xxxxxxxxxx
> Cc: Philippe Mathieu-Daudé <philmd@xxxxxxxxxx>
> Cc: H. Peter Anvin <hpa@xxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
> Cc: Eric Biggers <ebiggers@xxxxxxxxxx>
> Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>

I apologize if this has already been reported/fixed already (I did a
brief search on lore.kernel.org) or if my terminology is not as precise
as it could be, I am a little out of my element here :)

After this change as commit eac7a7791b ("x86: don't let decompressed
kernel image clobber setup_data") in QEMU master, I am no longer able to
boot a kernel directly through OVMF using '-append' + '-initrd' +
'-kernel'. Instead, systemd-boot tries to start the distribution's
kernel, which fails with:

Error registering initrd: Already started

This can be reproduced with just a defconfig Linux kernel (I used
6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
boot testing [1], and OVMF. Prior to this change, the kernel would start
up but after, I am dumped into the UEFI shell (as there is no
bootloader).

The QEMU command I used was:

$ qemu-system-x86_64 \
-kernel arch/x86_64/boot/bzImage \
-append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
-drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
-drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
-object rng-random,filename=/dev/urandom,id=rng0 \
-device virtio-rng-pci \
-cpu host \
-enable-kvm \
-smp 8 \
-display none \
-initrd .../boot-utils/images/x86_64/rootfs.cpio \
-m 512m \
-nodefaults \
-no-reboot \
-serial mon:stdio

Not sure if the OVMF version will matter but just in case:

$ pacman -Q edk2-ovmf
edk2-ovmf 202211-3

I did see a few fixes on qemu-devel for this patch such as [2] but none
I found fixed the issue for me.

If there is any additional information I can provide or patches I can
test, please let me know.

[1]: https://github.com/ClangBuiltLinux/boot-utils/blob/fec03ef13519e26ac1f226e404e1620a142d0e40/images/x86_64/rootfs.cpio.zst
[2]: https://lore.kernel.org/20230207224847.94429-1-Jason@xxxxxxxxx/

Cheers,
Nathan

# bad: [969d09c3a6186c0a4bc8a41db0c1aba1c76081fc] Merge tag 'pull-aspeed-20230207' of https://github.com/legoater/qemu into staging
# good: [b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56] Update VERSION for v7.2.0
git bisect start '969d09c3a6186c0a4bc8a41db0c1aba1c76081fc' 'b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56'
# good: [38cb336fe9d47507da5177c3d349532d0af6885e] hw/arm/gumstix: Use the IEC binary prefix definitions
git bisect good 38cb336fe9d47507da5177c3d349532d0af6885e
# bad: [864a3fa439276148b6d7abcf2d43ee8dbe4c4062] monitor: Rename misc.c to hmp-target.c
git bisect bad 864a3fa439276148b6d7abcf2d43ee8dbe4c4062
# good: [e471a8c9850f1af0c1bc5768ca28285348cdd6c5] target/riscv: Trap on writes to stimecmp from VS when hvictl.VTI=1
git bisect good e471a8c9850f1af0c1bc5768ca28285348cdd6c5
# bad: [c1a9ac9bdeea213162a76f7b9e2f876c89a50a94] tests: acpi: cleanup use_uefi argument usage
git bisect bad c1a9ac9bdeea213162a76f7b9e2f876c89a50a94
# good: [dc575b5e0300a7a375b4e4501a17ada21e9a6d10] hw/i2c/bitbang_i2c: Change state calling bitbang_i2c_set_state() helper
git bisect good dc575b5e0300a7a375b4e4501a17ada21e9a6d10
# good: [3b07a936d3bfe97b07ddffcfbb532985a88033dd] target/arm: Look up ARMCPRegInfo at runtime
git bisect good 3b07a936d3bfe97b07ddffcfbb532985a88033dd
# good: [d395b18dce82855d03d934e30a515caf5a10a885] hw/acpi/acpi_dev_interface: Remove unused parameter from AcpiDeviceIfClass::madt_cpu
git bisect good d395b18dce82855d03d934e30a515caf5a10a885
# bad: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data
git bisect bad eac7a7791bb6d719233deed750034042318ffd56
# good: [8a8c9c3a747f77e664fa2288735b45a9d750be75] hw/pci-host: Use register definitions from PCI standard
git bisect good 8a8c9c3a747f77e664fa2288735b45a9d750be75
# good: [8a7c606016d283a1716290c657f6f45bc7c4d817] intel-iommu: Document iova_tree
git bisect good 8a7c606016d283a1716290c657f6f45bc7c4d817
# first bad commit: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data