Re: [RFC PATCH 1/4] timekeeping: Remove xtime_remainder from ntp_error accumulation

From: John Stultz

Date: Mon May 18 2026 - 21:38:21 EST

On Sat, May 16, 2026 at 1:25 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>
> On Fri, 2026-05-15 at 15:49 -0700, John Stultz wrote:
> > On Wed, May 13, 2026 at 2:02 PM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> > >
> > > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > >
> > > The ntp_error accumulator tracks the difference between intended and
> > > actual clock advance. Each tick it adds ntp_tick (the intended advance)
> > > and subtracts what the clock actually advanced.
> > >
> > > The subtraction was (xtime_interval + xtime_remainder), but only
> > > xtime_interval is actually added to xtime_nsec each tick.
> > > xtime_remainder was a boot-time constant representing the rounding error
> > > from converting the tick period to an integer number of counter cycles.
> > > It was never added to xtime_nsec, so subtracting it from ntp_error
> > > created a phantom credit that biased the dithering ratio.
> > >
> > > The effect is a systematic drift whose magnitude depends on the value of
> > > xtime_remainder and the NTP frequency correction. NTP masks this by
> > > continuously adjusting the frequency to compensate, but with a fixed
> > > frequency (or an external reference clock like vmclock), the drift is
> > > exposed.
> > >
> > > Also remove xtime_remainder from the mult computation in
> > > timekeeping_adjust(), which used it to offset the division for the same
> > > (incorrect) reason.
> > >
> > > Fixes: a386b5af8edd ("time: Compensate for rounding on odd-frequency clocksources")
> >
> > Hey, thanks for sending this out!
> >
> > Looks like it would be more accurate to say it is a revert then a fix?
> >
> > Hrm. So this is maybe too far back for my goldfish brain to really remember.
> >
> > My hazy sense of it was that for coarser-grained clocksources, there
> > will be an inherent granularity error for what we want to use as an
> > accumulation interval, and the hardware allows.
> >
> > So if we want a 1ms ntp interval, taking the hardware freq and finding
> > cycles/ms, might be something like 10.4 (imaginary example here), so
> > that rounds 10 cycles per interval which is really gives us a 962us
> > interval.
> >
> > Now, the 10.4 cycles/ms is still accurate, and the resulting
> > xtime_interval is still correct, but there is a resulting 38us
> > difference between the xtime_interval and the ntp_tick due to this
> > granularity error in the calculation., When we start accumulating
> > ntp_error, it will grow at that 38us/ms rate, causing the internal
> > logic to try to steer the clock faster to compensate.
> >
> > Thus, instead of assuming the ntp-interval is actually 1ms, we want it
> > to match the actual accumulation interval, so we subtract that error
> > off of the ntp-interval. That "phantom credit" is effectively
> > intended to address this inherent error (which is why it isn't applied
> > to the other side - since it's already there as part of the
> > calculation).
> >
> > Now, again, it's been far too long, and I may be misremembering
> > details, and the correction may very well be unnecessary or otherwise
> > erronious! But I think it deserves a bit more evidence/rationale in
> > the commit message as to why.
>
> Thanks. This has been making my brain hurt for most of the last week,
> but I think I finally have a handle on it.
>
> It looks like we track three different times (ignoring units):
>
> • A: The xtime that we actually output to the vDSO/etc.
>
> • B: xtime+ntp_error is the time we *want* to be outputting right now,
> but the mult dithering and monotonicity clawback keep us from it.

I think of B, or specifically ntp_error, as the delta between our time
and what *NTP* (well, the in-kernel ntp machinery) wants the system to
be right now.

As an aside, apologies if I'm asking obvious questions here, some of
your terminology is unfamiliar. While it's not used around the code or
patches, I can understand the dithering metaphor for the long-term
error adjustments to effectively allow for sub-integer mult
adjustments over time (similar to b&w dithering to approximate levels
of grey), but there is also the common use of indecisively dithering
time away or the astrophotography sense of intentionlally adding
noise, which I worry might cause some confusion to others as to what
you mean.

Also I'm not sure its very clear what you mean by "monotonicity clawback".

> • C: xtime+ntp_error+time_offset is the time we *eventually* want to
> output, once we've finished skewing towards it.

And yeah, I'd say C is where the userland NTP wants it to slew towards.

> On each tick:
>
> • xtime (A) advances by xtime_interval (and the clawback; there's
> another patch in my tree to account for that in ntp_error now).
>
> • (B) advances by whatever tick_length happens to be at the moment
> (adjusted by second_overflow to effect a skew).
>
> • (C) advances by the original tick_length_base actually set according
> to the adjtimex frequency.
>
> So ntp_error, being the delta between (B) and (A), needs to advance by
> tick_length - xtime_interval. Before this patch, xtime_remainder was
> *also* being subtracted from the 'what xtime advanced' side, but it
> isn't actually added to xtime; it *is* roughly the amount that needed
> to be accumulated in ntp_error here (except for the fact that
> xtime_remainder was calculated once at boot time and never updated).

Again, I'm sure it could be miscalculated, or be misapplied, but as I
mentioned previously, the xtime_remainder is trying to address a
granularity error that is effectively baked into the delta between
xtime_interval and the initial ntp interval (essentially the initial
ntp_tick), which doesn't seem to be addressed here.

For fine-grained clocksources like the TSC its not likely a big issue,
but for coarser grained clocksources it seems like just removing this
would be a regression.

> I spent a while booting kernels in QEMU with a VMClock reference clock
> precisely specifying the TSC to real-time relationship for (C), and
> tracking the *nanosecond* delta of the output from what it was told.
>
> I made it redundantly track (B) and (C) above as absolute values, so
> that I could spot per-tick when the accounting of the existing *deltas*
> in ntp_error and time_offset went astray (and compare C with the actual
> refclock, to check that too).
>
> In my test tree, xtime now correctly dithers precisely around the
> desired time (B) and doesn't continually drift from it any more.
>
> And I can even inject, say, a 10μs delta via time_offset and it'll skew
> by *exactly* 10,000ns and stay there, still with with single digit
> nanosecond divergence from where it's meant to be.

Sounds like very nice results!

> So now I don't *need* the external oracle to drive the dithering per
> tick for the reference clock (as I had added in patch 3 in this
> series), because the kernel can actually stay on the y=mx+c line
> configured by tick_length and ntp_error/time_offset all by itself, so
> all I have to do is set the *existing* parameters in
> timekeeping_set_reference().
>
> (Arithmetic precision would *eventually* catch up with it, of course,
> but in reality it won't be following a hard-coded refclock and the
> reference itself will be periodically adjusted as the real counter
> varies, and reclamping will occur.)

I'm excited your efforts seem to be doing well! The ntp/timekeeping
core has been pretty static for awhile now, outside of some of the ptp
work, so its nice to see new major contributions. That said, it being
in place for so long, and apparently working for folks, means changes
do need close consideration to make sure we don't introduce subtle
regressions.

thanks
-john