Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

From: Marcelo Tosatti
Date: Mon Sep 21 2015 - 18:37:41 EST


On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim KrÄmÃÅ wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim KrÄmÃÅ wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim KrÄmÃÅ wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >>
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time; we used to allow that, which means an
> >> >> ABI breakage. (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> >
> >> > Because this behaviour was not defined.
> >>
> >> It is defined by implementation.
> >>
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false?
> >> > The patch would be much simpler.
> >>
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> >
> > Yes patch 1.
>
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
> patch 2 = [2/2].)
>
> > Do you have an example of another guest that is broken?
>
> Yes, the hot-plug in any relatively recent Linux guest.
>
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
>
> Yeah, I even read the SUSE implementation, sad times.
>
> >> I really don't like it
> >
> > Because it changes behaviour that was previously unspecified?
>
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
>
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> >
> > Can't unplug VCPU 0 i think.
>
> You can.
>
> >> It doesn't work well ;)
> >
> > Can you hot-unplug vcpu 0?
>
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch?

> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >>
> >> I don't think that is a problem.
> >>
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> > x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> > KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
>
> (Thanks for the commits, split values are the core design that avoids
> modifying rdtsc,

Which is broken (thats the point).

> so I'll be grad to read its details.)
>
> >> > 2) You rely on monotonicity across vcpus to perform
> >> > the 'minus delta that was read on vcpu0' calculation, but
> >> > monotonicity across vcpus can fail during runtime
> >> > (say host clocksource goes tsc->hpet due to tsc instability).
> >>
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> >
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> >
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> > 1 1 1
> > 2 2 2 1
>
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
>
> > 3 3 0 2 -1
> ^^^
> -1 is the overflow. Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it can
> only happen if VCPU1 is running as the reconfiguration takes place, but
> a simple
>
> if (vcpu[x].sched_clock <= global_sched_clock_offset)
> return 0;
>
> would take care of it. The time would stand still for a while, which is
> not a huge problem for boot-only scenario.

Look at

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
x86, paravirt: Add a global synchronization point for pvclock

And think of what an overflow does.

Note: i tried your solution before (to add offset) and saw this issue
in practice.

> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)

The point where vcpu1sched_clock is negative is after kvmclock is
initialized.

>
> > 4 4 1 3 0
> > 5 5 2 4 1
> > 6 6 3 5 2
> > 7 7 4 6 3
> >
> > ptime is "physical time".
> >
> > I prefer that the host counts from zero (there are less problems to
> > handle).
>
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().

To me are such complications as handling overflows.

> > Again, that SuSE patch is incorrect and a huge hack.
>
> I'm not disputing that. (Which doesn't justify the breakage.)
>
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
>
> Yes, and the alternative needs new paravirt interface.

So either:

Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).

Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug,
and fix it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/