Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

From: Radim KrÄmÃÅ
Date: Tue Sep 22 2015 - 10:34:03 EST

Next message: Christophe Leroy: "[PATCH 4/9] powerpc: inline ip_fast_csum()"
Previous message: Martin Schwidefsky: "Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics"
In reply to: Marcelo Tosatti: "Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO."
Next in thread: Radim KrÄmÃÅ: "Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

2015-09-21 19:37-0300, Marcelo Tosatti:
> On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim KrÄmÃÅ wrote:
>> 2015-09-21 17:53-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim KrÄmÃÅ wrote:
>> >> so I thought about other problems with
>> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
>> >> It doesn't work well ;)
>> >
>> > Can you hot-unplug vcpu 0?
>>
>> I can, I did, hot-plug bugged.
>
> Due to PVCLOCK_COUNTS_FROM_ZERO patch?

Yes, VCPU 0 writes MSR_KVM_SYSTEM_TIME_NEW to access the kvm clock, but
that very operation resets system_time. The result is a long sleep,
because at least Linux doesn't handle that shift.
(Just in case, works fine after reverting the one host patch.)

With more thinking, PVCLOCK_COUNTS_FROM_ZERO also breaks suspend and
resume as we write the MSR there. Testing shows that is isn't as bad as
the hotplug, because `sleep 1` returns in a second and `date` is fine,
but having multiple 0 points in `dmesg` isn't very nice either.

There are too many problems with PVCLOCK_COUNTS_FROM_ZERO, I'll send the
revert with cc:stable soon. (Without any guest changes to sched_clock.)

>> >> > The problem is, "selecting one read as the initial point" is inherently
>> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> >> > but must be applied to all vcpus.
>> >>
>> >> I don't think that is a problem.
>> >>
>> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> >> to propagate system_time into guest VCPUs as precisely as possible with
>> >> the help of TSC.
>> >
>> >Different pairs of values (system_time, tsc reads) are fundamentally
>> >broken. This is why
>> >
>> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>> > x86, paravirt: Add a global synchronization point for pvclock
>> >
>> >Exists.
>> >
>> >Then to guarantee monotonicity you need to stop those updates
>> >(or perform the pair update in lock step):
>> >
>> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
>> > KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
>>
>> (Thanks for the commits, split values are the core design that avoids
>> modifying rdtsc,
>
> Which is broken (thats the point).

Yes, but I don't think we could use tsc any better without having a
correct guest's time in the host.

We chose to provide kvmclock without a host_tsc -> host_ns function in
the host. It is just impossible to give the guest a precise (tsc, ns)
tuple if the host is not using TSC for its clock. (We don't have
control over the machine and using two reads of time can't be precise.)

>> > 3 3 0 2 -1
>> ^^^
>> -1 is the overflow. Very unlikely to ever happen in Linux, as we now
>> boot other VCPUs much later than the KVM clock configuration and it can
>> only happen if VCPU1 is running as the reconfiguration takes place, but
>> a simple
>>
>> if (vcpu[x].sched_clock <= global_sched_clock_offset)
>> return 0;
>>
>> would take care of it. The time would stand still for a while, which is
>> not a huge problem for boot-only scenario.
>
> Look at
>
> commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> x86, paravirt: Add a global synchronization point for pvclock

This patch shows the place where hotplug likely fails -- system_time
goes back to 0 so time is frozen until system_time reaches the original
value, again.
It also makes the result of kvm_clock_read monotonic across all vcpus,
so [1/2] doesn't need to add code to handle initial overflow.

> And think of what an overflow does.

If the clock was slightly negative (unsigned) and then overflowed,
sched_clock would freeze because it would never reach that high value
again. Offset in [1/2] is applied later and can't reach negative.

> Note: i tried your solution before (to add offset) and saw this issue
> in practice.

Great, thanks, I will thoroughly test it to see where it fails.

>> Other VCPUs couldn't be
>> reading KVM sched before it was configured, so only operations that
>> happen before vcpu1sched_clock goes positive are affected.
>> (We have other problems if the window can be long.)
>
> The point where vcpu1sched_clock is negative is after kvmclock is
> initialized.

Existing code takes care of that -- vcpu0 reads a value before any user
of sched_clock could have so later reads must be larger.

>> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
>> > requests the feature.
>>
>> Yes, and the alternative needs new paravirt interface.
>
> So either:
>
> Proceed with guest solution:
> -> Make sure the overflow can't happen (and write down why not in the
> code). Don't assume a small delta between kvmclock values of vcpus.

I will comment that 489fb490dbf8 ("x86, paravirt: Add a global
synchronization point for pvclock") gives all guarantees that prevent
the overflow on secondary VCPU bringup.

> -> Handle stable -> non-stable kvmclock transition.

No need to do anything here either. It will behave like before.

> -> kvmclock counts from zero should not depend on stable kvmclock
> (because nohz_full should work on hpet host systems).

Yes, I will do that.

> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
> -> Figure out whats wrong with different kvmclock values on hotplug,
> and fix it.

It's wrong to reset the time unconditionally on an operation that can be
used anytime (and sometimes must be used even if we don't want to reset
the time), revert is the best fix.

We can do guest-only or paravirtualized zeroing.
Paravirtualized zeroing has more complicated code, but simpler context.
I mainly don't like the idea of adding an interface for a feature the
guest can implement at the same level of quality.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Christophe Leroy: "[PATCH 4/9] powerpc: inline ip_fast_csum()"
Previous message: Martin Schwidefsky: "Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics"
In reply to: Marcelo Tosatti: "Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO."
Next in thread: Radim KrÄmÃÅ: "Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]