Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE
From: Thomas Gleixner
Date: Sat Mar 15 2025 - 05:06:42 EST
On Fri, Mar 14 2025 at 12:21, John Stultz wrote:
> On Thu, Mar 13, 2025 at 11:32 PM Lei Chen <lei.chen@xxxxxxxxxx> wrote:
> My suspicion is that if we are coming into timekeeping_advance() more
> frequently then cycle_interval cycles, than its possible we didn't
> actually accumulate anything, but had some left over ntp error that
> triggered a mult adjustment, causing a xtime_nsec to be decremented
> without the normal accumulation before that. Opening up a window for
> the inconsistency.
>
> The "if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)"
> check there is supposed to catch that, but with
> timekeeping_advance(TK_ADV_FREQ) it looks like during an ntp
> adjustment we can try to do the mult adjustment without accumulation.
Even if the delta is > cycle_interval in the normal TK_ADV_TICK case,
then the accumulation will always have a leftover offset.
After accumulation timekeeping_advance() invokes timekeeping_adjust(offset).
timekeeping_adjust() does multiplier adjustment based on [NTP] tick
length and ntp error. That's completely independent of TV_ADV_FREQ.
So _ANY_ adjustment of the multiplier has to adjust xtime_nsec so that
T1nsec[OLD] = xtime_sec[OLD] * NSEC_PER_SEC +
(xtime_nsec[OLD] + offset * mult[OLD]) >> shift;
T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
(xtime_nsec[NEW] + offset * mult[NEW]) >> shift;
results in:
T1nsec[OLD] == T1nsec[NEW]
So while Lei's change fulfils that requirement by advancing xtime_nsec
so that this becomes:
T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
xtime_nsec[NEW] >> shift;
it changes a fundamental property of the timekeeping code:
cycles_last has to be incremented by multiples of cycles_interval to
ensure that the accumulation, on which the NTP correction relies, has
a periodic base. That allows to steer the clock in a coordinated way
against ntp_tick_length and guarantees smooth error accumulation.
With Lei's changes this property is lost because every multiplicator
adjustment moves cycles_last around by the random offset which happened
to be left over in the accumulation. So the period of cycles_last
advancement gets lost and becomes a self modifying random number
generator depening on the size of the reminder (offset).
Now coming back to your suspicion that this is related to TK_ADV_FREQ.
That's one way to trigger it, but even the regular steering algorithm,
which smoothes the multplicator adjustment with
mult [+-] = 1
exposes this because there is no guarantee that the remaining offset is
small enough to be not visible. All it takes is:
(offset >> shift) >= 1
which means it's greater than or equal one nanosecond. Assume a clock
frequency of 1GHz. Then
mult ~= shift
So all it takes is to hit timekeeping_advance() at a point in time where
the reminder is becomes greater than shift, which is pretty much
guaranteed to happen with NOHZ on a fully idle system.
In the 1GHz clock frequency case and HZ=100 and the in use
shift = 1 << 23 = 8388608
this opens a window of ~1.62 milliseconds, which is easy enough to
hit. It does not even require NOHZ because in the NTP steering case the
period of the tick is not longer the same as the cycle_interval. Which
means that the reminder slightly increases per tick and hits the window
where the reminder becomes large enough roughly once per second.
I've instrumented timekeeping_advance() accordingly and the
trace_printk() nicely fills the trace buffer over time with xtime_nsec
going backwards. Except for the TK_ADV_FREQ case, this is always just
one nanosecond, but it's one nanosecond too much.
As I said in the previous mail, the fundamental flaw of the coarse time
getters is the assumption that xtime_sec + xtime_nsec does not go
backwards (except when the clock was set).
That's not the case and we can't change that without rewriting the
accumulation algorithm. I looked at the math required and it becomes a
total nightmare to compensate for that. I really don't have any
interrest in chasing the resulting NTP regressions for the next year
just for that.
Providing a seperate and accurate coarse_nsec value is easy enough and
does not touch any of the well established and correct accumulation and
steering code at all.
Thanks,
tglx