Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

From: David Woodhouse

Date: Wed May 27 2026 - 08:34:20 EST


On Wed, 2026-05-27 at 09:46 +0200, Miroslav Lichvar wrote:
> On Tue, May 26, 2026 at 11:00:28AM +0100, David Woodhouse wrote:
> > Let us assume that userspace, either from vmclock or direct discipline
> > of the arch counter against external sources, has:
> >   • Reference time T.
> >   • Arch counter value at time T.
> >   • Period of a single arch counter tick.
> >
> > This translates fairly directly into the kernel's tick_length and
> > time_offset. But *only* if you know cycle_interval, ntp_error and other
> > details. Which is why my timekeeping_set_reference() takes the
> > information in that form, and then translates it within the core
> > timekeeping.
> >
> > If you can show me how to do that with adjtimex(), that would be great.
>
> tick_length can be set by the adjtimex() modes ADJ_FREQUENCY (in
> scaled units of 1/65536 ppm up to 500 ppm) and ADJ_TICK (in
> microseconds per 1/USER_HZ tick).
>
> time_offset can be set by the ADJ_OFFSET mode. The PLL needs to be
> enabled first by setting the STA_PLL status (ADJ_STATUS mode) and also
> the STA_FREQHOLD flag needs to be set to avoid changing the PLL
> frequency.
>
> The ntp_error and other details need to be exposed to userspace. Maybe
> in the same API that will be used for reporting the time and frequency
> offsets between system clocks.

I don't think that's enough. Consider the fact that I've just had to
apply a correction to my existing timekeeping_set_reference() proof of
concept to make it calculate and set time_offset for the moment of the
*next* tick, instead of at the *prior* tick:

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2443,13 +2443,17 @@ int timekeeping_set_reference(const struct tk_reference *ref)
32 + ref->period_shift);
ntp_set_tick_length(tk->id, new_tl);

- /* Compute phase offset at cycle_last and set time_offset to slew */
- delta = tk->tkr_mono.cycle_last - ref->counter_value;
+ /*
+ * Compute phase offset at the *next* tick boundary, where the new
+ * tick_length will first take effect. Using cycle_last would leave
+ * a gap where the old mult accumulates additional phase error.
+ */
+ delta = tk->tkr_mono.cycle_last + tk->cycle_interval - ref->counter_value;
ref_frac = mul_u64_u64_shr(delta, ref->period_frac_sec,
ref->period_shift) + ref->time_frac_sec;
ref_err = (s64)mul_u64_u64_shr(ref_frac,
(u64)NSEC_PER_SEC << tk->tkr_mono.shift, 64) -
- (s64)tk->tkr_mono.xtime_nsec;
+ (s64)(tk->tkr_mono.xtime_nsec + tk->xtime_interval);
ntp_set_time_offset(tk->id, ref_err >> tk->tkr_mono.shift);
tk->ntp_error = 0;


I just don't think we can do this from userspace, and I don't really
see the *need* to.

It seems cleaner just to have clock_set_time_reference() which matches
what clock_get_time_reference() exports, instead of trying to shoe-horn
it into the adjtimex API and force userspace to jump through hoops to
reverse engineer things and apply racy adjustments.

> > As chrony introduces a change on the host, QEMU propagates that to the
> > guest (the vmclock: line is from QEMU), and the guest adjusts
> > accordingly. And then converges *really* slowly, as even setting the
> > time constant to 0 gives a half-life for time_offset of about 11
> > seconds.
>
> A simple linear slew would be better for this. The offset is accurate,
> there is no need for filtering.

Perhaps so, although I was trying to avoid making any real changes to
the core timekeeping other than fixing its accounting. In fact if I set
the time constant to zero *and* set STA_NANO, that gives a half-life of
about 2.4 seconds which should be fine.

> > > I think a better solution is scaling of the clocksource, i.e. a layer
> > > below the realtime clock. An additional multiplier applied in HW or
> > > SW. That would address the problem for all system clocks, not just the
> > > realtime clock. adjtimex() changes are applied on top of that, they
> > > are not in conflict.
> >
> > But we literally already have a way to 'scale' the counter in order to
> > derive CLOCK_MONOTONIC/CLOCK_REALTIME: the kernel's timekeeping code.
> > Currently driven *only* by NTP/adjtimex().
>
> I see that as a different purpose than guest migrations. A migrated
> guest should have its clocksource frequency corrected while the clock
> is controlled by NTP/PTP. If this mechanism was shared, that would not
> be possible.

If the *host* wants to use hardware frequency scaling to try to mask
the effects of live migration by making the effective frequency of the
TSC on the destination match the effective frequency of the TSC on the
source at the moment of migration, then that's a choice for the host.

I don't think it's likely to happen, as it brings a bunch of complexity
on the host side for relatively little benefit.

I don't think there's *any* chance of Linux ever doing the scaling of
the clocksources on the software side.

> > Are you suggesting that the actual clocksource driver in the kernel for
> > e.g. CSID_ARM_ARCH_COUNTER should *scale* the results it returns,
> > instead of giving raw counter reads? So we have some NTP-like process
> > to adjust each clocksource, in *addition* to the core kernel
> > timekeeping?
>
> Not so much NTP-like. There would be no mult dithering or phase
> adjustments, only frequency.

So clocksources would no longer be monotonic?

> > And then those skewed clocksource values are only
> > meaningful under a seqlock like the existing kernel timekeeper values
> > are valid under the tk_data.seq seqlock?
>
> I guess you are implying here this SW-fallback scaling would have a
> significant impact on the performance. Could it not be applied at the
> same time as the normal multiplier in the conversion to nanoseconds?
>
> > And would we have a separate way to get real value, to use for
> > CLOCK_MONOTONIC_RAW?
>
> All system clocks should be scaled, that's my point.

I'm not sure you'll achieve universal consensus on the concept that
CLOCK_MONOTONIC_RAW should be skewed.

I suspect it's best to ignore the special case of live migration for
the moment. Treat it like any other update from the host which adjusts
the frequency and phase_offset. It's up to the host to make it appear
as if the guest TSC continued to tick at the source frequency while the
guest was in the ether, and provide a vmclock update the moment it
starts on the new host, letting the guest know the new frequency. The
frequency adjustment is applied almost immediately (via an interrupt,
directly *within* the kernel in my proof of concept case), and the
resulting phase delta should be tiny.

Attachment: smime.p7s
Description: S/MIME cryptographic signature