Re: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

From: Peter Xu
Date: Tue Jun 07 2022 - 20:59:43 EST


On Tue, Jun 07, 2022 at 06:47:41PM +0000, Sean Christopherson wrote:
> On Tue, Jun 07, 2022, Peter Xu wrote:
> > On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
> > > On Tue, Jun 07, 2022, Paolo Bonzini wrote:
> > > > On 6/6/22 23:27, Peter Xu wrote:
> > > > > On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
> > > > > > > However there seems to be something missing at least to me, on why it'll
> > > > > > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch).
> > > > > > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this
> > > > > > > patch, but 0x0 if with it.
> > > > > >
> > > > > > What CPU model are you using for the VM?
> > > > >
> > > > > I didn't specify it, assuming it's qemu64 with no extra parameters.
> > > >
> > > > Ok, so indeed it lacks AVX and this patch can have an effect.
> > > >
> > > > > > For example, if the source lacks this patch but the destination has it,
> > > > > > the source will transmit YMM registers, but the destination will fail to
> > > > > > set them if they are not available for the selected CPU model.
> > > > > >
> > > > > > See the commit message: "As a bonus, it will also fail if userspace tries to
> > > > > > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to
> > > > > > the guest configuration. Such features will never be returned by
> > > > > > KVM_GET_XSAVE or KVM_GET_XSAVE2."
> > > > >
> > > > > IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned
> > > > > (probably by failing validate_user_xstate_header when checking against the
> > > > > user_xfeatures on dest host). But that's probably not my case, because here
> > > > > KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after
> > > > > the precopy migration completes (or for postcopy when the switchover is
> > > > > done).
> > > >
> > > > Difficult to say what's happening without seeing at least the guest code
> > > > around the double fault (above you said "fail a migration" and I thought
> > > > that was a different scenario than the double fault), and possibly which was
> > > > the first exception that contributed to the double fault.
> > >
> > > Regardless of why the guest explodes in the way it does, is someone planning on
> > > bisecting this (if necessary?) and sending a backport to v5.15? There's another
> > > bug report that is more than likely hitting the same bug.
> >
> > What's the bisection you mentioned? I actually did a bisection and I also
> > checked reverting Leo's change can also fix this issue. Or do you mean
> > something else?
>
> Oooooh, sorry! I got completely turned around. You ran into a bug with the
> fix. I thought that you were hitting the same issues as Mike where migrating
> between hosts with different capabilities is broken in v5.15, but works in v5.18.

Aha, no worry.

>
> > > https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@xxxxxxxxxx
> >
> > That is kvm64, and I agree it could be the same problem since both qemu64
> > and kvm64 models do not have any xsave feature bit declared in cpuid 0xd,
> > so potentially we could be migrating some fpu states to it even with
> > user_xfeatures==0 on dest host.
> >
> > So today I continued the investigation, and I think what's really missing
> > is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and
> > continues even if it returns -EINVAL. IOW, I'm wondering whether we should
> > fail properly and start to check kvm_arch_put_registers() retcode. But
> > that'll be a QEMU fix, and it'll at least not causing random faults
> > (e.g. double faults) in guest but we should fail the migration gracefully.
> >
> > Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in
> > your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
> >
> > WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
> >
> > It'll be great if you'd like to check that up.
>
> Ugh, userspace can force KVM_MP_STATE_UNINITIALIZED via KVM_SET_MP_STATE. Looks
> like QEMU does that when emulating RESET.
>
> Logically, a full RESET of the xAPIC seems like the right thing to do. I think
> we can get away with that without breaking ABI? And kvm_lapic_reset() has a
> related bug where it stops the HR timer but not doesn't handle the HV timer :-/
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e69b83708f05..948aba894245 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2395,7 +2395,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> return;
>
> /* Stop the timer in case it's a reset to an active apic */
> - hrtimer_cancel(&apic->lapic_timer.timer);
> + cancel_apic_timer(&apic->lapic_timer.timer);

Needs to be:

+ cancel_apic_timer(apic);

>
> /* The xAPIC ID is set at RESET even if the APIC was already enabled. */
> if (!init_event)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 540651cd28d7..ed2c7cb1642d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10912,6 +10912,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
> mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
> goto out;
>
> + if (mp_state->mp_state == KVM_MP_STATE_UNINITIALIZED)
> + kvm_lapic_reset(vcpu, false);
> +
> if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
> vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);
>

The change looks reasonable, but sadly I did a quick run and it still
triggers.. :-/ So there seems to be something else missing.

--
Peter Xu