Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

From: David Woodhouse

Date: Tue May 26 2026 - 06:01:50 EST

On Tue, 2026-05-26 at 09:10 +0200, Miroslav Lichvar wrote:
> On Mon, May 25, 2026 at 10:14:10AM +0100, David Woodhouse wrote:
> > On Mon, 2026-05-25 at 10:08 +0200, Miroslav Lichvar wrote:
> > > On Thu, May 21, 2026 at 10:54:41AM +0100, David Woodhouse wrote:
> > > > On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> > > > > Ok, but I don't see why the phase corrections of the guest need to be
> > > > > in the kernel.

...

>
> > But that's just using ->time_offset which has *always* been in the
> > kernel.
>
> time_offset is an input of the kernel PLL. My concern is that the PLL
> is fed directly by ptp_vmclock, ignoring everything else. There is no
> setting of the PLL time constant or the flags, no configuration of the
> step threshold, or any other options that a more advanced
> implementation might have. To me it feels like a bad shortcut.

Oh, undoubtedly yes. The hack I put in vmclock for the RFC is very much
a shortcut to prove the concept and enable discussion of what it
*should* look like.

We absolutely do want userspace to be in control of the policy.

Although I do think we want the kernel to be able to seed its
timekeeping at boot from the vmclock — not just the time, but the
precise tick_length. I haven't looked hard at that part yet; only
observed that with the do_settimeofday64() in the existing hack, the
guest often starts up about 100ns from the reference.

> I think this part of the loop should be in userspace, properly using
> the adjtimex() API. The feed-forward part (copying frequency settings
> of the host) is still possible.

I've no fundamental objection to using adjtimex(); I just couldn't see
how to do so with the required precision otherwise I've have done so.
Although I do quite like Thomas's clock[gs]et_time_reference()
suggestion which allows it to discipline AUX clocks too.

Let us assume that userspace, either from vmclock or direct discipline
of the arch counter against external sources, has:
• Reference time T.
• Arch counter value at time T.
• Period of a single arch counter tick.

This translates fairly directly into the kernel's tick_length and
time_offset. But *only* if you know cycle_interval, ntp_error and other
details. Which is why my timekeeping_set_reference() takes the
information in that form, and then translates it within the core
timekeeping.

If you can show me how to do that with adjtimex(), that would be great.

Here's a sample of the output from my test setup. The host is running
chronyd, with the QEMU patch I linked. The guest test is now entirely
in userspace, using PTP to get paired readings of the guest's
CLOCK_REALTIME vs. vmclock for precisely the same counter value.

As chrony introduces a change on the host, QEMU propagates that to the
guest (the vmclock: line is from QEMU), and the guest adjusts
accordingly. And then converges *really* slowly, as even setting the
time constant to 0 gives a half-life for time_offset of about 11
seconds.

EXT[140130] diff=+0ns counter=995f301fc2b1
EXT[140131] diff=+0ns counter=995f77da47f9
EXT[140132] diff=+0ns counter=995fbf92f419
EXT[140133] diff=+0ns counter=9960074d5bfd
EXT[140134] diff=+1ns counter=99604f088779
vmclock: host_cv=0x44e7bb427115f offset=0xfffc4ae4cee46d30 guest_cv=0x9960830b7e8f tsc_khz=2400000
EXT[140135] diff=+1ns counter=996096c408b5
EXT[140136] diff=-9ns counter=9960deab00f5
EXT[140137] diff=-9ns counter=99612660729d
EXT[140138] diff=-9ns counter=99616e184279
EXT[140139] diff=-9ns counter=9961b5d3aca9
EXT[140140] diff=-9ns counter=9961fd8e78dd
EXT[140141] diff=-9ns counter=99624549d909
EXT[140142] diff=-9ns counter=99628d053e61
EXT[140143] diff=-9ns counter=9962d4be6411
EXT[140144] diff=-9ns counter=99631c76dda9
EXT[140145] diff=-8ns counter=99636431ac05
EXT[140146] diff=-9ns counter=9963abed1f91
EXT[140147] diff=-8ns counter=9963f3a82e91
EXT[140148] diff=-8ns counter=99643b639a31
EXT[140149] diff=-8ns counter=9964831d8385
EXT[140150] diff=-8ns counter=9964cad89fe9

Given the simplicity of the 'bad shortcut', and the fact that we do
want the kernel to follow the reference at *boot* time, I do think I'd
like to have a mode for microvms which optionally *allows* the kernel
to continue to track the reference for itself rather than having an
extra userspace tool that literally just polling on /dev/vmclock in
order to feed precisely that same information back into the kernel
directly.

> > There's nothing fundamental in the actual *timekeeping* here that
> > hasn't already been in the guest kernel for decades; I'm just fixing a
> > few arithmetic errors in the core code, and then *driving* it more
> > precisely using its existing parameters (tick_length, time_offset).
>
> Fixing arithmetic errors is great. The driving part is what I'm
> concerned about, like where it is and what it is driving.
>
> > > > Right. This *is* the software fallback, because the hardware scaling
> > > > and offset aren't sufficient even if we only care about x86 where the
> > > > former is supported.
> > >
> > > IMHO it's a solution done at a wrong layer.
> >
> > Understood. What do you believe is the better solution?
>
> I think a better solution is scaling of the clocksource, i.e. a layer
> below the realtime clock. An additional multiplier applied in HW or
> SW. That would address the problem for all system clocks, not just the
> realtime clock. adjtimex() changes are applied on top of that, they
> are not in conflict.

But we literally already have a way to 'scale' the counter in order to
derive CLOCK_MONOTONIC/CLOCK_REALTIME: the kernel's timekeeping code.
Currently driven *only* by NTP/adjtimex().

And we have CLOCK_MONOTONIC_RAW which is explicitly *not* skewed
according to any external idea of time, but tracks raw counter ticks as
if they happen at some nominal frequency — and remains precisely in
sync with what userspace might see by reading the counter directly.

Are you suggesting that the actual clocksource driver in the kernel for
e.g. CSID_ARM_ARCH_COUNTER should *scale* the results it returns,
instead of giving raw counter reads? So we have some NTP-like process
to adjust each clocksource, in *addition* to the core kernel
timekeeping? And then those skewed clocksource values are only
meaningful under a seqlock like the existing kernel timekeeper values
are valid under the tk_data.seq seqlock?

And would we have a separate way to get real value, to use for
CLOCK_MONOTONIC_RAW?

If I'm understanding your proposal correctly, I am... not keen.

> > Aside from the case of actually using NTP or a PHC to discipline the
> > kernel's CLOCK_REALTIME, the use cases I'm trying to enable are:
> >
> > • (Micro)VM guest is *given* the TSC→realtime relationship in a virt
> >    enlightenment, gets an interrupt whenever it changes. Can react to
> >    that interrupt and steer the kernel's timekeeping as quickly as any
> >    userspace dæmon could do anything.
> >
> > • Dedicated virtual hosting environment needs to discipline the *TSC*
> >    directly against external references (PHC, 1PPS) in order to provide
> >    said virt enlightenment directly to guests and allow for accurate
> >    migration. This environment does not care about the host's actual
> >    CLOCK_REALTIME; that's basically cosmetic for logging purposes.
> >
> > • Multi-purpose environment has a standard ntpd/chrony setup, wants
> >    QEMU to be able to provide the same virt enlightenment based on
> >    the kernel's own timekeeping.
>
> Which of those couldn't be done with the clocksource scaling and/or
> adjtimex() if all the necessary information was available to userspace?

Let us assume that (1) can be done using adjtimex() although as noted
above, I couldn't see how.

(2) is resolved by the patches that Arthur, Thomas and I have worked on
over the last few days to enable PTP to return actual counter values,
and then that 'afterthought' about feeding it into the host kernel is
the same as (1). Although if the counter values themselves end up being
*skewed* then that introduces a whole new set of issues.

(3) would still need the clock_get_time_reference() (which I've hacked
up in my proof of concept as exposing a pollable /dev/vmclock_host
directly from the kernel). And again, if the actual *counter* can't be
trusted any more, that introduces a whole new set of issues with
relating the skewed clocksource cycle count, to what guests actually
*see* and what the kernel reports from its timekeeping.

I think I like the clock_[gs]et_time_reference() model. I *really* have
to context switch back to other things this week, but at some point in
the near future I'm planning to knock up a proof of concept of that;
probably via read/write or ioctls on a miscdev for now to play with it,
and the whole boilerplate of wiring up system calls can come later,
*if* it passes muster.

Attachment: smime.p7s
Description: S/MIME cryptographic signature