Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check

From: Andrea Arcangeli
Date: Mon Dec 19 2016 - 06:42:51 EST


Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
>
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
>
> Signed-off-by: Pan Xinhui <xinhui.pan@xxxxxxxxxxxxxxxxxx>
> Acked-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> ---
> arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
> arch/x86/kvm/x86.c | 16 ++++++++++++++++
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> + return;
> +
> + vcpu->arch.st.steal.preempted = 1;
> +
> + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> + &vcpu->arch.st.steal.preempted,
> + offsetof(struct kvm_steal_time, preempted),
> + sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> + kvm_steal_time_set_preempted(vcpu);
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.