Re: [PATCH] KVM: x86: enable TDP MMU by default

From: Stoiko Ivanov
Date: Wed Jul 27 2022 - 09:31:33 EST


On Wed, 27 Jul 2022 13:22:48 +0300
Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:

> On Tue, 2022-07-26 at 17:43 +0200, Paolo Bonzini wrote:
> > On 7/26/22 16:57, Stoiko Ivanov wrote:
> > > Hi,
> > >
> > > Proxmox[0] recently switched to the 5.15 kernel series (based on the one
> > > for Ubuntu 22.04), which includes this commit.
> > > While it's working well on most installations, we have a few users who
> > > reported that some of their guests shutdown with
> > > `KVM: entry failed, hardware error 0x80000021` being logged under certain
> > > conditions and environments[1]:
> > > * The issue is not deterministically reproducible, and only happens
> > > eventually with certain loads (e.g. we have only one system in our
> > > office which exhibits the issue - and this only by repeatedly installing
> > > Windows 2k22 ~ one out of 10 installs will cause the guest-crash)
> > > * While most reports are referring to (newer) Windows guests, some users
> > > run into the issue with Linux VMs as well
> > > * The affected systems are from a quite wide range - our affected machine
> > > is an old IvyBridge Xeon with outdated BIOS (an equivalent system with
> > > the latest available BIOS is not affected), but we have
> > > reports of all kind of Intel CPUs (up to an i5-12400). It seems AMD CPUs
> > > are not affected.
> > >
> > > Disabling tdp_mmu seems to mitigate the issue, but I still thought you
> > > might want to know that in some cases tdp_mmu causes problems, or that you
> > > even might have an idea of how to fix the issue without explicitly
> > > disabling tdp_mmu?
> >
> > If you don't need secure boot, you can try disabling SMM. It should not
> > be related to TDP MMU, but the logs (thanks!) point at an SMM entry (RIP
> > = 0x8000, CS base=0x7ffc2000).
>
> No doubt about it. It is the issue.
>
> >
> > This is likely to be fixed by
> > https://lore.kernel.org/kvm/20220621150902.46126-1-mlevitsk@xxxxxxxxxx/.
Thanks to both of you for the quick feedback and the patches!

We ran our reproducer with the patch-series above applied on top of
5.19-rc8 from
git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/kinetic
* without the patches the issue occurred within 20 minutes,
* with the patches applied issues did not occur for 3 hours (it usually
does within 1-2 hours at most)

so fwiw it seems to fix the issue on our setup.
we'll do some more internal tests and would then make this available
(backported to our 5.15 kernel) to our users, who are affected by this.

Kind regards,
stoiko