Intermittent inability to type in graphical Plymouth on UEFI VMs since kernel 6.9

From: Adam Williamson
Date: Fri May 24 2024 - 12:08:40 EST


Hi, folks. Please CC me on replies, I'm not subscribed to the list. The downstream bug report for this is https://bugzilla.redhat.com/show_bug.cgi?id=2274770 .

I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/ (openQA is an automated testing system which runs jobs on qemu VMs, inputting keyboard and mouse events via VNC, and monitoring results via screenshots and the serial console).

We have several tests that involve doing an install of Fedora with root storage encrypted and then booting it. Some of these install enough packages for us to hit the 'graphical' mode of plymouth (the bootsplash manager thingy), so we see a graphical passphrase prompt like https://openqa.fedoraproject.org/tests/2642868#step/_graphical_wait_login/3 ; some are minimal installs, so we see a text prompt like https://openqa.fedoraproject.org/tests/2642845#step/disk_guided_encrypted_postinstall/1 .

Recently I switched up our configuration so most of these tests run on UEFI VMs (previously they mostly ran on BIOS VMs). When I did that, the tests that hit the graphical prompt started failing frequently on Fedora Rawhide. The tests that hit the text prompt do not seem to be affected.

At first I figured this was caused by a plymouth change, but some testing indicates it's actually related to kernel version: it seems to have been introduced in kernel 6.9. Fedora 40 uses kernel 6.8, so tests on F40 are not usually affected by this, but I engineered some runs of an affected test on an F40 install with kernel 6.9, and they hit the bug.

So to summarize, we hit the bug when all the following conditions are met:

* Running on UEFI qemu-kvm VM
* Graphical passphrase prompt encountered on boot
* Running kernel 6.9

When it sees the passphrase prompt, the test system types the correct password. When the bug happens, this input seems to simply be ignored - plymouth does not echo dots back to the screen representing the typed characters, and on hitting enter the system does not attempt to proceed with decryption. (Unfortunately this also means we don't get any logs from the failure, as the test system needs a booted system to be able to upload any logs).

Looking at results from the last month and a half, the bug happens on about 30% of the tests run.

I have reproduced this manually in a similar VM, but did not yet manage to reproduce it on hardware (which is unfortunate, as it'd make it somewhat easier to attempt some kind of bisect).

The earliest build I can say for sure the bug happened with is kernel-6.9.0-0.rc0.20240322git8e938e398669.14.fc41 .
--
Adam Williamson (he/him/his)
Fedora QA
Fedora Chat: @adamwill:fedora.im | Mastodon: @adamw@xxxxxxxxxxxxx
https://www.happyassassin.net