Re: [PATCH] x86/kvmclock: Stop kvmclocks for hibernate restore

From: Vitaly Kuznetsov
Date: Fri Mar 26 2021 - 08:38:54 EST


Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> writes:

> Lenny Szubowicz <lszubowi@xxxxxxxxxx> writes:
>
>> On 3/17/21 9:30 AM, Vitaly Kuznetsov wrote:
>>> Lenny Szubowicz <lszubowi@xxxxxxxxxx> writes:
>>>
>>>> Turn off host updates to the registered kvmclock memory
>>>> locations when transitioning to a hibernated kernel in
>>>> resume_target_kernel().
>>>>
>>>> This is accomplished for secondary vcpus by disabling host
>>>> clock updates for that vcpu when it is put offline. For the
>>>> primary vcpu, it's accomplished by using the existing call back
>>>> from save_processor_state() to kvm_save_sched_clock_state().
>>>>
>>>> The registered kvmclock memory locations may differ between
>>>> the currently running kernel and the hibernated kernel, which
>>>> is being restored and resumed. Kernel memory corruption is thus
>>>> possible if the host clock updates are allowed to run while the
>>>> hibernated kernel is relocated to its original physical memory
>>>> locations.
>>>>
>>>> This is similar to the problem solved for kexec by
>>>> commit 1e977aa12dd4 ("x86: KVM guest: disable clock before rebooting.")
>>>>
>>>> Commit 95a3d4454bb1 ("x86/kvmclock: Switch kvmclock data to a
>>>> PER_CPU variable") innocently increased the exposure for this
>>>> problem by dynamically allocating the physical pages that are
>>>> used for host clock updates when the vcpu count exceeds 64.
>>>> This increases the likelihood that the registered kvmclock
>>>> locations will differ for vcpus above 64.
>>>>
>>>> Reported-by: Xiaoyi Chen <cxiaoyi@xxxxxxxxxx>
>>>> Tested-by: Mohamed Aboubakr <mabouba@xxxxxxxxxx>
>>>> Signed-off-by: Lenny Szubowicz <lszubowi@xxxxxxxxxx>
>>>> ---
>>>> arch/x86/kernel/kvmclock.c | 34 ++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>>> index aa593743acf6..291ffca41afb 100644
>>>> --- a/arch/x86/kernel/kvmclock.c
>>>> +++ b/arch/x86/kernel/kvmclock.c
>>>> @@ -187,8 +187,18 @@ static void kvm_register_clock(char *txt)
>>>> pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
>>>> }
>>>>
>>>> +/*
>>>> + * Turn off host clock updates to the registered memory location when the
>>>> + * cpu clock context is saved via save_processor_state(). Enables correct
>>>> + * handling of the primary cpu clock when transitioning to a hibernated
>>>> + * kernel in resume_target_kernel(), where the old and new registered
>>>> + * memory locations may differ.
>>>> + */
>>>> static void kvm_save_sched_clock_state(void)
>>>> {
>>>> + native_write_msr(msr_kvm_system_time, 0, 0);
>>>> + kvm_disable_steal_time();
>>>> + pr_info("kvm-clock: cpu %d, clock stopped", smp_processor_id());
>>>> }
>>>
>>> Nitpick: should we rename kvm_save_sched_clock_state() to something more
>>> generic, like kvm_disable_host_clock_updates() to indicate, that what it
>>> does is not only sched clock related?
>>
>> I see your rationale. But if I rename kvm_save_sched_clock_state()
>> then shouldn't I also rename kvm_restore_sched_clock_state().
>> The names appear to reflect the callback that invokes them,
>> from save_processor_state()/restore_state(), rather than what these
>> functions need to do.
>>
>> x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
>> x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
>>
>> For V2 of my patch, I kept these names as they were. But if you have a strong
>> desire for a different name, then I think both routines should be renamed
>> similarly, since they are meant to be a complimentary pair.
>>
>>>
>>>>
>>>> static void kvm_restore_sched_clock_state(void)
>>>> @@ -311,9 +321,23 @@ static int kvmclock_setup_percpu(unsigned int cpu)
>>>> return p ? 0 : -ENOMEM;
>>>> }
>>>>
>>>> +/*
>>>> + * Turn off host clock updates to the registered memory location when a
>>>> + * cpu is placed offline. Enables correct handling of secondary cpu clocks
>>>> + * when transitioning to a hibernated kernel in resume_target_kernel(),
>>>> + * where the old and new registered memory locations may differ.
>>>> + */
>>>> +static int kvmclock_cpu_offline(unsigned int cpu)
>>>> +{
>>>> + native_write_msr(msr_kvm_system_time, 0, 0);
>>>> + pr_info("kvm-clock: cpu %d, clock stopped", cpu);
>>>
>>> I'd say this pr_info() is superfluous: on a system with hundereds of
>>> vCPUs users will get flooded with 'clock stopped' messages which don't
>>> actually mean much: in case native_write_msr() fails the error gets
>>> reported in dmesg anyway. I'd suggest we drop this and pr_info() in
>>> kvm_save_sched_clock_state().
>>>
>>
>> Agreed. I was essentially using it as a pr_debug(). Gone in V2.
>>
>>>> + return 0;
>>>
>>> Why don't we disable steal time accounting here? MSR_KVM_STEAL_TIME is
>>> also per-cpu. Can we merge kvm_save_sched_clock_state() with
>>> kvmclock_cpu_offline() maybe?
>>>
>>
>> kvm_cpu_down_prepare() in arch/x86/kernel/kvm.c already calls
>> kvm_disable_steal_time() when a vcpu is placed offline.
>> So there is no need to do that in kvmclock_cpu_offline().
>>
>> In the case of the hibernation resume code path, resume_target_kernel()
>> in kernel/power/hibernate.c, the secondary cpus are placed offline,
>> but the primary is not. Instead, we are going to be switching contexts
>> of the primary cpu from the boot kernel to the kernel that was restored
>> from the hibernation image.
>>
>> This is where save_processor_state()/restore_processor_state() and kvm_save_sched_clock_state()/restore_sched_clock_state() come into play
>> to stop the kvmclock of the boot kernel's primary cpu and restart
>> the kvmclock of restored hibernated kernel's primary cpu.
>>
>> And in this case, no one is calling kvm_disable_steal_time(),
>> so kvm_save_sched_clock_state() is doing it. (This is very similar
>> to the reason why kvm_crash_shutdown() in kvmclock.c needs to call
>> kvm_disable_steal_time())
>>
>> However, I'm now wondering if kvm_restore_sched_clock_state()
>> needs to add a call to the equivalent of kvm_register_steal_time(),
>> because otherwise no one will do that for the primary vcpu
>> on resume from hibernation.
>
> In case this is true, steal time accounting is not our only
> problem. kvm_guest_cpu_init(), which is called from
> smp_prepare_boot_cpu() hook also sets up Async PF an PV EOI and both
> these features establish a shared guest-host memory region, in this
> doesn't happen upon resume from hibernation we're in trouble.
>
> smp_prepare_boot_cpu() hook is called very early from start_kernel() but
> what happens when we switch to the context of the hibernated kernel?
>
> I'm going to set up an environement and check what's going on.

According to the log we have a problem indeed:

[ 15.844263] ACPI: Preparing to enter system sleep state S4
[ 15.844309] PM: Saving platform NVS memory
[ 15.844311] Disabling non-boot CPUs ...
[ 15.844625] kvm-guest: Unregister pv shared memory for cpu 1
[ 15.846272] smpboot: CPU 1 is now offline
[ 15.847124] kvm-guest: Unregister pv shared memory for cpu 2
[ 15.848720] smpboot: CPU 2 is now offline
[ 15.849637] kvm-guest: Unregister pv shared memory for cpu 3
[ 15.851452] smpboot: CPU 3 is now offline
[ 15.853295] PM: hibernation: Creating image:
[ 15.865126] PM: hibernation: Need to copy 82214 pages
[18446743485.711482] kvm-clock: cpu 0, msr 8201001, primary cpu clock, resume
[18446743485.711610] PM: Restoring platform NVS memory
[18446743485.713922] Enabling non-boot CPUs ...
[18446743485.713997] x86: Booting SMP configuration:
[18446743485.713998] smpboot: Booting Node 0 Processor 1 APIC 0x1
[18446743485.714127] kvm-clock: cpu 1, msr 8201041, secondary cpu clock
[18446743485.714484] kvm-guest: KVM setup async PF for cpu 1
[18446743485.714489] kvm-guest: stealtime: cpu 1, msr 3ecac080
[18446743485.714816] CPU1 is up
[18446743485.714846] smpboot: Booting Node 0 Processor 2 APIC 0x2
[18446743485.714954] kvm-clock: cpu 2, msr 8201081, secondary cpu clock
[18446743485.715359] kvm-guest: KVM setup async PF for cpu 2
[18446743485.715364] kvm-guest: stealtime: cpu 2, msr 3ed2c080
[18446743485.715640] CPU2 is up
[18446743485.715672] smpboot: Booting Node 0 Processor 3 APIC 0x3
[18446743485.715867] kvm-clock: cpu 3, msr 82010c1, secondary cpu clock
[18446743485.716288] kvm-guest: KVM setup async PF for cpu 3
[18446743485.716293] kvm-guest: stealtime: cpu 3, msr 3edac080
[18446743485.716564] CPU3 is up
[18446743485.716732] ACPI: Waking up from system sleep state S4
[18446743485.728139] sd 0:0:0:0: [sda] Starting disk
[18446743485.750373] OOM killer enabled.
[18446743485.750909] Restarting tasks ... done.
[18446743485.754465] PM: hibernation: hibernation exit

(this is with your v2 included). There's nothing about CPU0 for
e.g. async PF + timestamps are really interesting. Seems we have issues
to fix) I'm playing with it right now.

--
Vitaly