Re: VMs freezing when host is running 4.14

From: Radim KrÄmÃÅ
Date: Thu Nov 23 2017 - 10:59:54 EST


2017-11-23 16:20+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim KrÄmÃÅ wrote:
> > 2017-11-22 16:52+0100, Marc Haber:
> > > On Wed, Nov 22, 2017 at 04:04:42PM +0100, çéæ wrote:
> > > > So all guest kernels are 4.14, or also other older kernel?
> > >
> > > Guest kernels are also 4.14, but the issue disappears when the host is
> > > downgraded to an older kernel. I therefore reckoned that the guest
> > > kernel doesn't matter, but that was before I saw the trace in the log.
> >
> > The two most suspicious patches since 4.13 (which I assume works) are
> >
> > 664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
> > injected")
>
> That one does not revert cleanly, the line in questions seems to have
> been removed a bit later.
>
> Reject is:
> 141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
> +++ arch/x86/kvm/vmx.c
> @@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned nr = vcpu->arch.exception.nr;
> bool has_error_code = vcpu->arch.exception.has_error_code;
> - bool reinject = vcpu->arch.exception.injected;
> + bool reinject = vcpu->arch.exception.reinject;
> u32 error_code = vcpu->arch.exception.error_code;
> u32 intr_info = nr | INTR_INFO_VALID_MASK;

This line one can be deleted as reinject isn't used in the function.

Btw. there have been already many fixes from Liran Alon for that patch
and your case could be the one adressed in
https://www.spinics.net/lists/kvm/msg159158.html

The patch is incorrect, but you might be able to see only its benefits.

> > and
> >
> > 9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
> > and "Page Ready" exceptions simultaneously")
> >
> > please try reverting them to see if it helps,
>
> That one reverted cleanly. I am now running the new kernel on the
> affected machine, and I think that a second machine has joined the
> market of being affected.

That one had much lower chances of being the culprit.

> Would this matter on the host only or on the guests as well?

Only on the host.

Thanks.