Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data

From: Jason A. Donenfeld
Date: Wed Feb 08 2023 - 12:55:18 EST


Hi Nathan (and MST),

On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@xxxxxxxxxx> wrote:
>
> Hi Jason,
>
> On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > The setup_data links are appended to the compressed kernel image. Since
> > the kernel image is typically loaded at 0x100000, setup_data lives at
> > `0x100000 + compressed_size`, which does not get relocated during the
> > kernel's boot process.
> >
> > The kernel typically decompresses the image starting at address
> > 0x1000000 (note: there's one more zero there than the compressed image
> > above). This usually is fine for most kernels.
> >
> > However, if the compressed image is actually quite large, then
> > setup_data will live at a `0x100000 + compressed_size` that extends into
> > the decompressed zone at 0x1000000. In other words, if compressed_size
> > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > clobber setup_data, resulting in crashes.
> >
> > Visually, what happens now is that QEMU appends setup_data to the kernel
> > image:
> >
> > kernel image setup_data
> > |--------------------------||----------------|
> > 0x100000 0x100000+l1 0x100000+l1+l2
> >
> > The problem is that this decompresses to 0x1000000 (one more zero). So
> > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> >
> > kernel image setup_data
> > |--------------------------||----------------|
> > 0x100000 0x100000+l1 0x100000+l1+l2
> >
> > d e c o m p r e s s e d k e r n e l
> > |-------------------------------------------------------------|
> > 0x1000000 0x1000000+l3
> >
> > The decompressed kernel seemingly overwriting the compressed kernel
> > image isn't a problem, because that gets relocated to a higher address
> > early on in the boot process, at the end of startup_64. setup_data,
> > however, stays in the same place, since those links are self referential
> > and nothing fixes them up. So the decompressed kernel clobbers it.
> >
> > Fix this by appending setup_data to the cmdline blob rather than the
> > kernel image blob, which remains at a lower address that won't get
> > clobbered.
> >
> > This could have been done by overwriting the initrd blob instead, but
> > that poses big difficulties, such as no longer being able to use memory
> > mapped files for initrd, hurting performance, and, more importantly, the
> > initrd address calculation is hard coded in qboot, and it always grows
> > down rather than up, which means lots of brittle semantics would have to
> > be changed around, incurring more complexity. In contrast, using cmdline
> > is simple and doesn't interfere with anything.
> >
> > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > after the fact. So this hack is updated to account for this appending,
> > by reserving some bytes.
> >
> > Cc: x86@xxxxxxxxxx
> > Cc: Philippe Mathieu-Daudé <philmd@xxxxxxxxxx>
> > Cc: H. Peter Anvin <hpa@xxxxxxxxx>
> > Cc: Borislav Petkov <bp@xxxxxxxxx>
> > Cc: Eric Biggers <ebiggers@xxxxxxxxxx>
> > Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
>
> I apologize if this has already been reported/fixed already (I did a
> brief search on lore.kernel.org) or if my terminology is not as precise
> as it could be, I am a little out of my element here :)
>
> After this change as commit eac7a7791b ("x86: don't let decompressed
> kernel image clobber setup_data") in QEMU master, I am no longer able to
> boot a kernel directly through OVMF using '-append' + '-initrd' +
> '-kernel'. Instead, systemd-boot tries to start the distribution's
> kernel, which fails with:
>
> Error registering initrd: Already started
>
> This can be reproduced with just a defconfig Linux kernel (I used
> 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> boot testing [1], and OVMF. Prior to this change, the kernel would start
> up but after, I am dumped into the UEFI shell (as there is no
> bootloader).
>
> The QEMU command I used was:
>
> $ qemu-system-x86_64 \
> -kernel arch/x86_64/boot/bzImage \
> -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \

Oh no... Without jumping into it, at first glance, I have absolutely
no idea. I suppose I could start debugging it and probably come up
with a solution, but...

@mst - I'm beginning to think that this whole setup_data route is
cursed. This is accumulating hacks within hacks within hacks. What
would you think if I just send a patch *removing* all use of
setup_data (the rng seed and the dtb thing), and then we can gradually
add that back with an actual overarching design. For example, it'd
probably make sense to have a separate fwcfg file for setup_data
rather than trying to mangle and existing one, etc. This way, we
unbreak the tree, and let the new approach be reviewed more
reasonably.

Jason