Re: [PATCHv3 2/2] kvm: x86: implement KVM PM-notifier

From: Maxim Levitsky
Date: Mon Jun 07 2021 - 08:54:50 EST


On Sun, 2021-06-06 at 11:10 +0900, Sergey Senozhatsky wrote:
> Implement PM hibernation/suspend prepare notifiers so that KVM
> can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly
> suspend VMs.
>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx>
> ---
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/x86.c | 36 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 37 insertions(+)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fb8efb387aff..ac69894eab88 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -43,6 +43,7 @@ config KVM
> select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> select KVM_VFIO
> select SRCU
> + select HAVE_KVM_PM_NOTIFIER if PM
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b594275d49b5..af1ab527a0cb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -58,6 +58,7 @@
> #include <linux/sched/isolation.h>
> #include <linux/mem_encrypt.h>
> #include <linux/entry-kvm.h>
> +#include <linux/suspend.h>
>
> #include <trace/events/kvm.h>
>
> @@ -5615,6 +5616,41 @@ static int kvm_vm_ioctl_set_msr_filter(struct kvm *kvm, void __user *argp)
> return 0;
> }
>
> +#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> +static int kvm_arch_suspend_notifier(struct kvm *kvm)
> +{
> + struct kvm_vcpu *vcpu;
> + int i, ret = 0;
> +
> + mutex_lock(&kvm->lock);
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!vcpu->arch.pv_time_enabled)
> + continue;
> +
> + ret = kvm_set_guest_paused(vcpu);
> + if (ret) {
> + kvm_err("Failed to pause guest VCPU%d: %d\n",
> + vcpu->vcpu_id, ret);
> + break;
> + }
> + }
> + mutex_unlock(&kvm->lock);
> +
> + return ret ? NOTIFY_BAD : NOTIFY_DONE;
> +}
> +
> +int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
> +{
> + switch (state) {
> + case PM_HIBERNATION_PREPARE:
> + case PM_SUSPEND_PREPARE:
> + return kvm_arch_suspend_notifier(kvm);
> + }
> +
> + return NOTIFY_DONE;
> +}
> +#endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
> +
> long kvm_arch_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {

Overall this looks OK to me.

Do you have a test case in which this patch helps to make the guest behave better after
the host suspend though? I tested this and I don't see any significant change.
(guest works after host suspend before and after, and I still have clocksource
watchdogs firing in the guest)


Also I would like to add my .02 cents on my observations on what happens when I suspend my system
with guests running, which I do once in a while.
I haven't dug deep into it yet as host suspend with VM running wasn't high on my priority list.

First of all after a host suspend/resume cycle (and that is true on all 3 machines I own),
the host TSC is reset to 0 on all CPUs, thus while it is still synchronized, it jumps backward.

Host kernel has no issues coping with this.

Guests however complain about clocksource watchdog and mark the tsc clocksource as unstable,
at least when invtsc is used (regardless of this patch, I wasn't able to notice a difference
with and without it yet).


[ 287.515864] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 287.516926] clocksource: 'kvm-clock' wd_now: 4437767926 wd_last: 429c3c42f5 mask: ffffffffffffffff
[ 287.527100] clocksource: 'tsc' cs_now: c33f6ce157 cs_last: c1be2ad19f mask: ffffffffffffffff
[ 287.528493] tsc: Marking TSC unstable due to clocksource watchdog
[ 287.556640] clocksource: Switched to clocksource kvm-clock


This is from Intel system with stable TSC, but I have seen this on my AMD systems as well,
but these have other issues which might affect this (see below).

AFAIK, we have code in kvm_arch_hardware_enable for this exact case but it might not work
correctly or be not enough to deal with this.

Also I notice that this code sets kvm->arch.backwards_tsc_observed = true which
in turn disables the master clock which is not good as well.

I haven't yet allocated time to investigate this.


Another bit of information which I didn't start a discussion (but I think I should),
which is relevant to AMD systems, is in 'unsynchronized_tsc' function.

On AMD guest it will mark the TSC as unstable in the guest as long as invtsc is not used.
I patched that code out for myself, that is why I am mentioning it.

Best regards,
Maxim Levitsky