Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

From: David Woodhouse

Date: Thu May 21 2026 - 06:12:36 EST

On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> On Wed, May 20, 2026 at 01:21:46PM +0100, David Woodhouse wrote:
> > On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
> > > On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> > > > The design has two major purposes:
> > >
> > > > • Avoiding the redundant work of having *hundreds* of guests on the
> > > > same host *all* calibrating the same underlying oscillator, while
> > > > enjoying the added fun of steal time as they're trying to to so.
> > >
> > > But isn't that work still duplicated, only moved to the kernel?
> >
> > Not the actual calibration of the TSC against real time, no. It is the
> > *host* which gets the 1PPS signal and does all the work of tracking and
> > smoothing the frequency drift over time. The guest basically gets the
> > same as a vDSO, *telling* it a relationship from TSC to real time.
>
> Ok, but I don't see why the phase corrections of the guest need to be
> in the kernel.

I'm not sure I understand.

There are no 'phase corrections' as such, except of course that the
phase of the guest kernel's clock does get corrected, and naturally
that does have to take effect inside the guest kernel.

I think the key here is that this is not a feedback loop based on
corrections to the existing clock output; this is a feedforward design
as described in https://dl.acm.org/doi/pdf/10.1109/TNET.2011.2158443

It seems that when Julien et al lamented that, "Until now, however,
there has been a serious practical issue inhibiting feed-forward
approaches: a lack of kernel support", the basics were actually there
in the kernel's core timekeeping all along.

We didn't have to *do* anything to the core timekeeping other than fix
a few bugs that the NTP feedback mechanism always masked — who *cares*
if there's a systematic +5PPM drift due to accounting errors, as NTP
can just interpret that as the counter running 5PPM fast and adjust for
it?

Although I don't think the errors are quite that consistent, as they
vary with tick length and even from tick to tick with the mult±1
dithering and interrupt latency — so I wouldn't be surprised if these
fixes made a detectable improvement even in the normal NTP case.

> > > I don't like the idea of adding more clock control loops to the kernel
> > > much.
> >
> > I completely agree. I am absolutely not planning to add any more clock
> > control to the kernel than we already have. As you say, we probably
> > have too many already.
>
> If the vmclock driver is feeding the PLL with the offset between the
> host and guest clocks, I think that would count as a loop.

It's not an offset; it's a direct feed-forward "when the TSC is <this>
the time is <this>" relationship, like a vDSO does.

https://uapi-group.org/specifications/specs/vmclock/

The core motivation is for virtual machines (and especially for
consistent time across live migration), but hardware implementations
should be possible using PCIe PTM. I keep meaning to get my hands on a
TimeCard and play, but there are only so many hours in the day...

> > I'm not sure what scaling the guest TSC would buy us. Sure, it would
> > minimise the frequency step at the moment of migration, but a naïve
> > guest which isn't using vmclock's disruption signal is screwed on live
> > migration *anyway*, because there's *also* a step change in the actual
> > TSC value which is bounded by the real time synchronization of the
> > source and destination host.
>
> The TSC offset can be corrected too. I thought that was already
> happening.

Yes, it is. The TSC offset (and the guest's KVM clock, which is a whole
different sad story) can be corrected a bit — but the *accuracy* with
which they can be corrected is limited to the accuracy of the source
vs. destination hosts' time synchronization.

If the guest has been using NTP or a PHC to discipline the counter of
the source host that it just came from, carefully tracking not only the
perceived time, but also error bounds in order to ensure coherency of,
say, a distributed database... there is no way that we can migrate it
to a new host and 'fake' the frequency/offset on the new host to
sufficiently match. Database corruption ensues.

The best thing to do is to advertise a disruption signal ("throw away
anything you know about the existing counter"), and provide information
on the new host in that {cycle_count, reference time, counter period,
error bounds} form to allow the guest to return to service as soon as
possible.

Which is precisely what vmclock does.

> > AFAICT scaling the TSC would just add complexity and wouldn't help
> > much.
>
> I think it's a better place to be solving this kind of problems. It's
> compensating for a hardware change. It doesn't need to happen only at
> migration. You could adjust the frequency continuously if you really
> wanted, kind of like synchronous ethernet is doing for clocks over
> network, improving the stability of the physical clock and phase
> corrections are done on top of it at a higher level.

On the *host* side I might accept a PLL on the actual hardware
oscillator and the 1PPS signal... :)

> > And TSC scaling is pretty much x86-specific; other architectures have a
> > *defined* counter frequency and don't need to support scaling.
>
> There can be a software fallback if hardware scaling and/or offset is
> not supported.

Right. This *is* the software fallback, because the hardware scaling
and offset aren't sufficient even if we only care about x86 where the
former is supported.

> > > > > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > > > > main clock. It would be nice if that could be corrected in migrations.
> > > >
> > > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> > > > is that it *isn't* skewed by NTP?
> > >
> > > It isn't adjusted, but it can be used as a stable reference avoiding
> > > the multiplier-induced jitter, interference from other processes, and
> > > synchronization loops, e.g. when an NTP client is synchronizing to an
> > > NTP server running on the same system (in different containers).
> >
> > We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
> > we?
>
> > (for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
> > specific but 'TSC' is quicker to type...)
>
> Meaning userspace would have to duplicate the kernel's handling of
> the counter (wrapping and scaling) just to avoid a single
> multiplication in the vDSO?

Hm yeah, I guess that makes sense.

The way I've done it in these proof of concept patches is counter-
based, because the interface between host and guest (and from that
theoretical hardware implementation) *is* necessarily in terms of the
hardware — we get told the relationship of the actual *counter* to
realtime.

But as long as the conversions in both directions are quick and
accurate there's no fundamental reason why it *couldn't* be expressed
in terms of MONOTONIC_RAW as it's being passed around.

In my RFC, it's just a call to timekeeping_set_reference() which uses
the *existing* mechanisms to just set tick_length and time_offset
accordingly. Which naturally takes counter-based units too.

But I certainly don't think that doing so *unconditionally* from the
vmclock driver in my proof of concept is the right thing to do.
Userspace needs to set policy like that.

And I wasn't stunningly happy with timekeeping_set_reference() passing
fractional seconds in the vmclock (seconds<<64) units instead of the
native (nanoseconds<<32) of the timekeeping code.

So maybe timekeeping_set_reference() should take its input in
MONOTONIC_RAW terms, and the raw information from vmclock should be
converted accordingly? I can try that...

On the *host* side, I anticipate two modes of operation.

A dedicated hosting environment only really cares about disciplining
the host kernel's TSC, and absolutely doesn't *care* about the host
kernel's timekeeping. That's just for logs.

For migrating KVM guests as accurately as possible, we set the guest
*TSC* (scaling and) offset based on our understanding of the host TSC
on both source and destination. The KVM APIs for doing this based on
the kernel's own CLOCK_REALTIME are... a source of sadness. There's a
whole 30-patch series in flight to deal with that, which you can look
at if if you like pain, but the tl;dr is that we get the host kernel's
timekeeping out of the picture as *much* as possible and operate in
terms of the TSC. Migrate the guest kernel's TSC as accurately as
possible, and everything *else* in the guest is derived from that.

So in that dedicated environment, userspace will take our hardware
devices which literally latch the *counter* value on a 1PPS signal, or
use NTP if they really have to fall back to that, and discipline the
*counter*, then use that information to both provide the vmclock for
guests, and migrate guests as accurately as possible. All in userspace,
*necessarily* in raw counter terms.

But hey, it's nice for logs to have good timestamps too, so we can feed
it to the kernel's CLOCK_REALTIME as an afterthought. Probably by using
a userspace hook for timekeeping_set_reference(). I haven't yet looked
at whether the existing adjtimex() can be used/abused/extended to allow
for precisely setting tick_length/time_offset like that.

And then there's the 'normal' host side, with a host kernel running
chrony and a few guests in QEMU. Obviously this mode needs to be
properly taken into account as a first class citizen, which is why I've
built the support that's already *in* QEMU (disruption signal only) and
now the vmclock_host and additional QEMU patch to expose that.

Again it needs to be in terms of the guest TSC by the time the VMM
actually puts it in the shared page, but I'm entirely open to input on
how we get it *out* of the kernel's timekeeping. I do tend to have the
opinion that what we should expose to guests is the "intended" clock,
with ntpdata->time_offset built in and *not* including the constant ±1
changes to 'mult' from the dithering, but using the *actual* intended
frequency from tick_length / cycle_interval.

But other than that, I'm prepared to consider the whole of the
vmclock_host export part as a straw man, and entirely happy to
completely reimplement it however you like, if you have strong
opinions. I just needed to get *something* implemented and working, as
a starting point.

Attachment: smime.p7s
Description: S/MIME cryptographic signature