Re: [PATCH v6 4/7] timekeeping: Drive time_offset skew via per-tick ntp_error transfer

From: David Woodhouse

Date: Fri Jun 19 2026 - 06:50:11 EST

On Thu, 2026-06-18 at 20:40 -0700, John Stultz wrote:
> On Sun, Jun 14, 2026 at 7:40 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> >
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> >
> > Instead of inflating tick_length to effect the time_offset slew,
> > transfer the intended per-tick skew into ntp_error to achieve the
> > desired rate.
>
> Thanks again for sending this out! Apologies for my slow reviews!
>
> I'm wondering, maybe could you better express the problem with the
> current code that your patch improves? This feels a bit opaque to me
> initially.
>
> In my (admittedly goldfish-in-the-fog) sense of current code, we
> adjust consume tick_offset in chunks added to tick_length which then
> gets copied to tk->ntp_tick, which then modifies ntp_error.
>
> It seems you're switching it to be a bit more direct, but still
> introduces a new skew_delta value, which consumes from time_offset,
> and is then the skew_delta gets pulled out of ntp_error. So while
> looking at the code I can see you're handling things more carefully,
> its not immediately clear to what end. This isn't a critique of your
> code here, just in how the motivation for it is explained.

Yeah, this is the commit message I spent most time on — and the one
where I completely threw out what the AI had proposed and rewrote it
from scratch.

I'm *conceptually* changing the model of what it's doing entirely...
while not actually changing the implementation that much, once all the
details are settled. Which I think is why it hurts :)

The old code would just adjust tick_length by an amount equivalent to
the new skew_delta variable. It would *assume* that there are HZ ticks
per second, which isn't always true. And any overrun/underrun was also
lost to the noise. We just... ran with an inflated tick_length for a
bit longer than we meant to, and never accounted for that.

Conceptually, I'm dropping that model completely. I don't want to touch
tick_length, and I want the logarithmic_accumulation() code to be able
to *precisely* account the delta between the intended tick_length and
what xtime_interval actually delivered, as $DEITY intended.

So the new model is extremely simple: each tick, drain a bit from
time_offset, and add it to ntp_error. The timekeeping "wants" to drive
ntp_error down to zero via the ntp_err_mult 'dithering' anyway, so just
let it do its job to make xtime actually catch up.

Except...

If we *only* do that, then the ntp_err_mult dithering can't actually
reach the target, and ntp_error just grows and grows. Because it can
only choose between mult and mult+1, and can't skew hard enough.

(I have *considered* automatically adjusting ntp_err_mult according to
the *magnitude* of ntp_error, not just its sign. But that's something
to think about for another day; it looks we tried that once, and I'd
need to fully understand the history.)

So what we do in *this* patch is simply adjust 'mult' too, so that
{mult, mult+1} do correctly bracket the desired effective rate
(tick_length + skew_delta) and the dithering *can* keep ntp_error at
zero.

So that part is *very* similar in practice to what the old code did by
adjusting tick_length, which had broadly the same effect on mult and
the actual xtime progression. The real difference is in the
*accounting*, which is all now done from the tick side without any
overrun or underrun or assumptions about how many ticks it'll be
applied for.

> > In second_overflow(), calculate skew_delta which is the per-tick slew
> > rate, in the same units as time_offset: (ns << NTP_SCALE_SHIFT) / HZ.
> >
> > In logarithmic_accumulation(), drain up to 'skew_delta' time units from
> > time_offset into ntp_error to drive the overall effective rate.
> >
> > In timekeeping_adjust(), take skew_delta into account when calculating
> > 'mult', such that the available choices (mult, mult+1) bracket the
> > overall effective rate including the skew — otherwise the delta would
> > just build up in ntp_error.
> >
> > This give behaviour equivalent to the old tick_length += delta approach
> > but with exact per-tick accounting of the time_offset actually imparted
> > to the clock, which was previously *extremely* approximate.
>
> Clarifying this approximateness would probably help the reader.

In the subsequent time_adjust patch I did include an example —
requesting 5ms, only 4997.5µs got delivered. I don't have numbers to
hand for the time_offset case but it should be broadly similar as the
mechanism is the same. It's large enough that when I was working on
timekeeping_set_reference() to simply set the frequency and offset and
*expect* the clock to converge to the reference as requested... it
noticeably wasn't. Which is why I even started frowning at it in the
first place.

> > +               /* Reduce to per-tick, then floor. */
> > +               ntpdata->skew_delta = div_s64(off_chunk, NTP_INTERVAL_FREQ);
> > +               if (!ntpdata->skew_delta)
> > +                       ntpdata->skew_delta = (off_chunk > 0) ? 1 : -1;
>
> Maybe would using the signof() function you added above make sense here?

Hm, yes. The signof() thing was a late addition and I thought I'd
caught all its potential use sites. Will do that one too; thanks.

> > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> > index bdafd599413d..5fd06d94c4b5 100644
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> > @@ -2430,17 +2431,26 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
> > static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
> > {
> >         u64 ntp_tl = ntp_tick_length(tk->id);
> > +       s64 skew = ntp_get_skew_delta(tk->id);
> >         u32 mult;
> >
> >         /*
> > -        * Determine the multiplier from the current NTP tick length.
> > -        * Avoid expensive division when the tick length doesn't change.
> > +        * Determine the multiplier from the current NTP tick length plus
> > +        * skew_delta. The skew biases mult so that ±1 dithering can deliver
> > +        * the time_offset slew rate. Recompute when either changes.
> >          */
> > -       if (likely(tk->ntp_tick == ntp_tl)) {
> > +       if (likely(tk->ntp_tick == ntp_tl && tk->skew_delta == skew)) {
> > +               /* Revert to the base mult rate. */
> >                 mult = tk->tkr_mono.mult - tk->ntp_err_mult;
> >         } else {
> >                 tk->ntp_tick = ntp_tl;
> > -               mult = div64_u64(tk->ntp_tick >> tk->ntp_error_shift,
> > +               tk->skew_delta = skew;
> > +               /*
> > +                * skew_delta is stored pre-divided by HZ (matching time_offset);
> > +                * scale it back up to the full per-tick rate for the mult bias.
> > +                */
> > +               skew *= NTP_INTERVAL_FREQ;
> > +               mult = div64_u64((tk->ntp_tick + skew) >> tk->ntp_error_shift,
> >                                  tk->cycle_interval);
> >         }
>
> So not really an issue, but just to make sure I understand: a side
> effect here is we may be doing the division more frequently? I guess
> potentially not much, since skew is /HZ so hopefully it lines up to
> the second_overflow, but it seems like since xtime_interval won't
> necessarily match NTP_INTERVAL_FREQ, you'd potentially have one
> "drained" iteration before the next second_overflow?

I don't think so. A *prior* iteration of this did have division in the
skew_delta draining code below. But then I pre-divided skew_delta by
NTP_INTERVAL_FREQ to avoid that, which is why we have to *multiply* to
get 'skew' here. The 'mult = div64_u64(...)' part is just the same as
before; it's just that we're adding 'skew' to the dividend.

And even that hasn't really changed from before; it's just that the
skew used to be added into ntp_tick by second_overflow() and now we're
adding it here instead so that we can precisely track the accounting.

>
> > @@ -2568,6 +2578,31 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
> >         tk->ntp_error += tk->ntp_tick << shift;
> >         tk->ntp_error -= tk->xtime_interval << (tk->ntp_error_shift + shift);
> >
> > +       /*
> > +        * The above accounting of ntp_error includes the part of clock
> > +        * skew which was *intentional*, imparted through deliberately
> > +        * adjusting 'mult' in timekeeping_adjust() taking skew_delta
> > +        * into account.
> > +        *
> > +        * Drain the intentional skew from time_offset, and readjust
> > +        * ntp_error by the amount that *could* actually be drained.
> > +        * This ensures that any *overshoot* is correctly left in
> > +        * ntp_error and will be correctly compensated for over time.
> > +        */
> > +       if (tk->skew_delta) {
> > +               /*
> > +                * skew_delta is stored pre-divided by HZ, matching time_offset,
> > +                * so drain it directly. Fold the amount actually drained back
> > +                * into ntp_error in full clock units (× NTP_INTERVAL_FREQ); any
> > +                * undrainable overshoot is left in ntp_error to be compensated
> > +                * by the dithering over subsequent ticks.
> > +                */
> > +               s64 drain = tk->skew_delta << shift;
> > +               s64 unclaimed = ntp_drain_time_offset(tk->id, drain);
>
> bikeshed nit: drain / unclaimed feels like mixing metaphors?
> consumed / remaining, maybe?

Heh. I've bikeshedded that one all by myself about three times already,
flipping the return of ntp_drain_time_offset() (and
ntp_drain_time_adjust()) in the next patch) between returning the
*drained* vs. returning the *undrained* amount. You're right, it's
still a bit opaque.

Those two drain functions are also in a different C unit and can't be
inlined without LTO, which makes me want to combine them — which could
*also* make a lot more readable in logarithmic_accumulation(). How
about making it look like this...

/* Accumulate error between NTP and clock interval */
tk->ntp_error += tk->ntp_tick << shift;
tk->ntp_error -= tk->xtime_interval << (tk->ntp_error_shift + shift);

/*
* When skewing, do so by adjusting ntp_error to impart an extra
* target delta into ntp_error per tick, limited to what can be
* drained from time_offset / time_adjust to avoid overshoot.
*
* The base 'mult' value was calculated with the skew taken into
* account, such that the per-tick choice of 'mult' vs. 'mult+1'
* allows for the desired effective rate and ntp_error does not
* grow unbounded.
*
* Once the full desired phase offset is delivered, any remaining
* skew imparted by the adjusted 'mult', accounted above, remains
* in ntp_error and will be compensated by the dithering over time.
*/
if (tk->skew_delta)
tk->ntp_error += ntp_drain_skew(tk->id, tk->skew_delta << shift,
shift) * NTP_INTERVAL_FREQ;

... and in ntp.c:

/*
* Drain one accumulation's worth of intentional skew as it is delivered.
*
* @amount is the total intentional per-tick skew for this accumulation
* (skew_delta << shift), in time_offset units (shifted_ns / HZ). The
* adjtime() linear share is taken from time_adjust first (capped at the
* MAX_TICKADJ rate, hence @shift), then the exponential remainder from
* time_offset. Returns the amount actually claimed (same ÷HZ units).
*
* The two helpers are local to this file and inline into here, so the
* timekeeping fast path makes one out-of-line call in the case where
* skew is being applied.
*/
s64 ntp_drain_skew(unsigned int tkid, s64 amount, unsigned int shift)
{
s64 unclaimed = ntp_drain_time_adjust(tkid, amount, shift);

unclaimed = ntp_drain_time_offset(tkid, unclaimed);

/*
* Return the amount actually drained from the intentional
* phase offset in time_offset and/or time_adjust.
*/
return amount - unclaimed;
}

I'll test that, then look at how to fold it back into the two existing
time_offset / time_adjust commits, or whether to make it an additional
cleanup on top. I would prefer the latter, if you're OK with that?

Attachment: smime.p7s
Description: S/MIME cryptographic signature