Re: [REGRESSION][BISECTED] from bd9bbc96e835: cannot boot Win11 KVM guest
From: Juri Lelli
Date: Mon Dec 16 2024 - 10:23:48 EST
On 14/12/24 19:52, Peter Zijlstra wrote:
> On Sat, Dec 14, 2024 at 06:32:57AM +0000, Ranguvar wrote:
> > Hello, all,
> >
> > Any assistance with proper format and process is appreciated as I am new to these lists.
> > After the commit bd9bbc96e835 "sched: Rework dl_server" I am no longer able to boot my Windows 11 23H2 guest using pinned/exclusive CPU cores and passing a PCIe graphics card.
> > This setup worked for me since at least 5.10, likely earlier, with minimal changes.
> >
> > Most or all cores assigned to guest VM report 100% usage, and many tasks on the host hang indefinitely (10min+) until the guest is forcibly stopped.
> > This happens only once the Windows kernel begins loading - its spinner appears and freezes.
> >
> > Still broken on 6.13-rc2, as well as 6.12.4 from Arch's repository.
> > When testing these, the failure is similar, but tasks on the host are slow to execute instead of stalling indefinitely, and hung tasks are not reported in dmesg. Only one guest core may show 100% utilization instead of many or all of them. This seems to be due to a separate regression which also impacts my usecase [0].
> > After patching it [1], I then find the same behavior as bd9bbc96e835, with hung tasks on host.
> >
> > git bisect log: [2]
> > dmesg from 6.11.0-rc1-1-git-00057-gbd9bbc96e835, with decoded hung task backtraces: [3]
> > dmesg from arch 6.12.4: [4]
> > dmesg from arch 6.12.4 patched for svm.c regression, has hung tasks, backtraces could not be decoded: [5]
> > config for 6.11.0-rc1-1-git-00057-gbd9bbc96e835: [6]
> > config for arch 6.12.4: [7]
> >
> > If it helps, my host uses an AMD Ryzen 5950X CPU with latest UEFI and AMD WX 5100 (Polaris, GCN 4.0) PCIe graphics.
> > I use libvirt 10.10 and qemu 9.1.2, and I am passing three PCIe devices each from dedicated IOMMU groups: NVIDIA RTX 3090 graphics, a Renesas uPD720201 USB controller, and a Samsung 970 EVO NVMe disk.
> >
> > I have in kernel cmdline `iommu=pt isolcpus=1-7,17-23 rcu_nocbs=1-7,17-23 nohz_full=1-7,17-23`.
> > Removing iommu=pt does not produce a change, and dropping the core isolation freezes the host on VM startup.
> > Enabling/disabling kvm_amd.nested or kvm.enable_virt_at_load did not produce a change.
> >
> > Thank you for your attention.
> > - Devin
> >
> > #regzbot introduced: bd9bbc96e8356886971317f57994247ca491dbf1
> >
> > [0]: https://lore.kernel.org/regressions/52914da7-a97b-45ad-86a0-affdf8266c61@xxxxxxxxxxx/
> > [1]: https://lore.kernel.org/regressions/376c445a-9437-4bdd-9b67-e7ce786ae2c4@xxxxxxxxxxx/
> > [2]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/bisect.log
> > [3]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/dmesg-6.11.0-rc1-1-git-00057-gbd9bbc96e835-decoded.log
>
> Hmm, this has:
>
> [ 978.035637] sched: DL replenish lagged too much
>
> Juri, have we seen that before?
Not in the context of dl_server. Hummm, looks like replenishment wasn't
able to catch up with the clock or something like that (e.g.
replenishment didn't happen for a long time).