Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

From: Radim KrÄmÃÅ
Date: Mon Sep 21 2015 - 18:00:47 EST


2015-09-21 17:53-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim KrÄmÃÅ wrote:
>> 2015-09-21 12:52-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim KrÄmÃÅ wrote:
>> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
>> >>> Is it counting from zero that breaks SLES10?
>> >>
>> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> >> while still keeping system time; we used to allow that, which means an
>> >> ABI breakage. (And we can't even say that guest's behaviour is against
>> >> the spec ...)
>> >
>> > Because this behaviour was not defined.
>>
>> It is defined by implementation.
>>
>> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
>> > boot_vcpu_runs_old_kvmclock == false?
>> > The patch would be much simpler.
>>
>> If you mean the hunk in cover letter, I don't like it because we presume
>> that no other guests were broken.
>
> Yes patch 1.

(I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
patch 2 = [2/2].)

> Do you have an example of another guest that is broken?

Yes, the hot-plug in any relatively recent Linux guest.

> Really, assuming things about the value of the msr read when you
> write to the MSR is very awkward. That SuSE code is incorrect, it fails
> on other occasions as well (see
> 54750f2cf042c42b4223d67b1bd20138464bde0e).

Yeah, I even read the SUSE implementation, sad times.

>> I really don't like it
>
> Because it changes behaviour that was previously unspecified?

No, because it adds another exemption to already ugly code.
(Talking about the hunk in [0/2].)

>> so I thought about other problems with
>> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
>
> Can't unplug VCPU 0 i think.

You can.

>> It doesn't work well ;)
>
> Can you hot-unplug vcpu 0?

I can, I did, hot-plug bugged.

>> > The problem is, "selecting one read as the initial point" is inherently
>> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> > but must be applied to all vcpus.
>>
>> I don't think that is a problem.
>>
>> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> to propagate system_time into guest VCPUs as precisely as possible with
>> the help of TSC.
>
>Different pairs of values (system_time, tsc reads) are fundamentally
>broken. This is why
>
>commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> x86, paravirt: Add a global synchronization point for pvclock
>
>Exists.
>
>Then to guarantee monotonicity you need to stop those updates
>(or perform the pair update in lock step):
>
>commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

(Thanks for the commits, split values are the core design that avoids
modifying rdtsc, so I'll be grad to read its details.)

>> > 2) You rely on monotonicity across vcpus to perform
>> > the 'minus delta that was read on vcpu0' calculation, but
>> > monotonicity across vcpus can fail during runtime
>> > (say host clocksource goes tsc->hpet due to tsc instability).
>>
>> That could be a problem, but I'm adding a VCPU independent constant to
>> all reads -- does the new code rely on monoticity in places where the
>> old one didn't?
>
> The problem is overflow:
> kvmclock non-monotonic across vcpus AND delta subtraction:
>
> ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> 1 1 1
> 2 2 2 1

KVM sched clock not used before this point (I even moved the code to
make sure), so there is no problem with monotonicity.

> 3 3 0 2 -1
^^^
-1 is the overflow. Very unlikely to ever happen in Linux, as we now
boot other VCPUs much later than the KVM clock configuration and it can
only happen if VCPU1 is running as the reconfiguration takes place, but
a simple

if (vcpu[x].sched_clock <= global_sched_clock_offset)
return 0;

would take care of it. The time would stand still for a while, which is
not a huge problem for boot-only scenario. Other VCPUs couldn't be
reading KVM sched before it was configured, so only operations that
happen before vcpu1sched_clock goes positive are affected.
(We have other problems if the window can be long.)

> 4 4 1 3 0
> 5 5 2 4 1
> 6 6 3 5 2
> 7 7 4 6 3
>
> ptime is "physical time".
>
> I prefer that the host counts from zero (there are less problems to
> handle).

The main advantage of a hypervisor solution for me is one saved
subtraction and comparison on every sched_clock().

> Again, that SuSE patch is incorrect and a huge hack.

I'm not disputing that. (Which doesn't justify the breakage.)

> An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> requests the feature.

Yes, and the alternative needs new paravirt interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/