Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit

From: Sean Christopherson
Date: Thu Mar 30 2023 - 13:12:41 EST


On Thu, Mar 30, 2023, Mathias Krause wrote:
> On 22.03.23 02:37, Mathias Krause wrote:
> > Guests like grsecurity that make heavy use of CR0.WP to implement kernel
> > level W^X will suffer from the implied VMEXITs.
> >
> > With EPT there is no need to intercept a guest change of CR0.WP, so
> > simply make it a guest owned bit if we can do so.
> >
> > This implies that a read of a guest's CR0.WP bit might need a VMREAD.
> > However, the only potentially affected user seems to be kvm_init_mmu()
> > which is a heavy operation to begin with. But also most callers already
> > cache the full value of CR0 anyway, so no additional VMREAD is needed.
> > The only exception is nested_vmx_load_cr3().
> >
> > This change is VMX-specific, as SVM has no such fine grained control
> > register intercept control.
>
> Just a heads up! We did more tests, especially with the backports we did
> internally already, and ran into a bug when running a nested guest on an
> ESXi host.
>
> Setup is like: ESXi (L0) -> Linux (L1) -> Linux (L2)
>
> The Linux system, especially the kernel, is the same for L1 and L2. It's
> a grsecurity kernel, so makes use of toggling CR0.WP at runtime.
>
> The bug we see is that when L2 disables CR0.WP and tries to write to an
> r/o memory region (implicitly to the r/o GDT via LTR in our use case),
> this triggers a fault (EPT violation?) that gets ignored by L1, as,
> apparently, everything is fine from its point of view.

It's not an EPT violation if there's a triple fault. Given that you're dumping
the VMCS from handle_triple_fault(), I assume that L2 gets an unexpected #PF that
leads to a triple fault in L2.

Just to make sure we're on the same page: L1 is still alive after this, correct?

> I suspect the L0 VMM to be at fault here, as the VMCS structures look
> good, IMO. Here is a dump of vmx->loaded_vmcs in handle_triple_fault():

...

> The "host" (which is our L1 VMM, I guess) has CR0.WP enabled and that is
> what I think confuses ESXi to enforce the read-only property to the L2
> guest as well -- for unknown reasons so far.

...

> I tried to reproduce the bug with different KVM based L0 VMMs (with and
> without this series; vanilla and grsecurity kernels) but no luck. That's
> why I'm suspecting a ESXi bug.

...

> I'm leaning to make CR0.WP guest owned only iff we're running on bare
> metal or the VMM is KVM to not play whack-a-mole for all the VMMs that
> might have similar bugs. (Will try to test a few others here as well.)
> However, that would prevent them from getting the performance gain, so
> I'd rather have this fixed / worked around in KVM instead.

Before we start putting bandaids on this, let's (a) confirm this appears to be
an issue with ESXi and (b) pull in VMware folks to get their input.

> Any ideas how to investigate this further?

Does the host in question support UMIP?

Can you capture a tracepoint log from L1 to verify that L1 KVM is _not_ injecting
any kind of exception? E.g. to get the KVM kitchen sink:

echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 1 > /sys/kernel/debug/tracing/events/kvm/enable

cat /sys/kernel/debug/tracing/trace > log

Or if that's too noisy, a more targeted trace (exception injection + emulation)
woud be:

echo 1 > /sys/kernel/debug/tracing/tracing_on

echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_emulate_insn/enable
echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_inj_exception/enable
echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_entry/enable
echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable

> PS: ...should have left the chicken bit of v3 to be able to disable the
> feature by a module parameter ;)

A chicken bit isn't a good solution for this sort of thing. Toggling a KVM module
param requires (a) knowing that it exists and (b) knowing the conditions under which
it is/isn't safe to toggle the bit.

E.g. if this ends up being an ESXi L0 bug, then an option might be to add something
in vmware_platform_setup() to communicate the bug to KVM so that KVM can precisely
disable the optimization on affected platforms.