Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

From: Miroslav Lichvar

Date: Wed May 27 2026 - 03:47:17 EST


On Tue, May 26, 2026 at 11:00:28AM +0100, David Woodhouse wrote:
> Let us assume that userspace, either from vmclock or direct discipline
> of the arch counter against external sources, has:
> • Reference time T.
> • Arch counter value at time T.
> • Period of a single arch counter tick.
>
> This translates fairly directly into the kernel's tick_length and
> time_offset. But *only* if you know cycle_interval, ntp_error and other
> details. Which is why my timekeeping_set_reference() takes the
> information in that form, and then translates it within the core
> timekeeping.
>
> If you can show me how to do that with adjtimex(), that would be great.

tick_length can be set by the adjtimex() modes ADJ_FREQUENCY (in
scaled units of 1/65536 ppm up to 500 ppm) and ADJ_TICK (in
microseconds per 1/USER_HZ tick).

time_offset can be set by the ADJ_OFFSET mode. The PLL needs to be
enabled first by setting the STA_PLL status (ADJ_STATUS mode) and also
the STA_FREQHOLD flag needs to be set to avoid changing the PLL
frequency.

The ntp_error and other details need to be exposed to userspace. Maybe
in the same API that will be used for reporting the time and frequency
offsets between system clocks.

> As chrony introduces a change on the host, QEMU propagates that to the
> guest (the vmclock: line is from QEMU), and the guest adjusts
> accordingly. And then converges *really* slowly, as even setting the
> time constant to 0 gives a half-life for time_offset of about 11
> seconds.

A simple linear slew would be better for this. The offset is accurate,
there is no need for filtering.

> Given the simplicity of the 'bad shortcut', and the fact that we do
> want the kernel to follow the reference at *boot* time, I do think I'd
> like to have a mode for microvms which optionally *allows* the kernel
> to continue to track the reference for itself rather than having an
> extra userspace tool that literally just polling on /dev/vmclock in
> order to feed precisely that same information back into the kernel
> directly.

Setting the values on boot in the kernel makes sense to me. There is
no loop involved. It follows the setting of the system clock from the
RTC.

> > I think a better solution is scaling of the clocksource, i.e. a layer
> > below the realtime clock. An additional multiplier applied in HW or
> > SW. That would address the problem for all system clocks, not just the
> > realtime clock. adjtimex() changes are applied on top of that, they
> > are not in conflict.
>
> But we literally already have a way to 'scale' the counter in order to
> derive CLOCK_MONOTONIC/CLOCK_REALTIME: the kernel's timekeeping code.
> Currently driven *only* by NTP/adjtimex().

I see that as a different purpose than guest migrations. A migrated
guest should have its clocksource frequency corrected while the clock
is controlled by NTP/PTP. If this mechanism was shared, that would not
be possible.

> Are you suggesting that the actual clocksource driver in the kernel for
> e.g. CSID_ARM_ARCH_COUNTER should *scale* the results it returns,
> instead of giving raw counter reads? So we have some NTP-like process
> to adjust each clocksource, in *addition* to the core kernel
> timekeeping?

Not so much NTP-like. There would be no mult dithering or phase
adjustments, only frequency.

> And then those skewed clocksource values are only
> meaningful under a seqlock like the existing kernel timekeeper values
> are valid under the tk_data.seq seqlock?

I guess you are implying here this SW-fallback scaling would have a
significant impact on the performance. Could it not be applied at the
same time as the normal multiplier in the conversion to nanoseconds?

> And would we have a separate way to get real value, to use for
> CLOCK_MONOTONIC_RAW?

All system clocks should be scaled, that's my point.

--
Miroslav Lichvar