Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE
From: John Stultz
Date: Fri Mar 14 2025 - 15:21:58 EST
On Thu, Mar 13, 2025 at 11:32 PM Lei Chen <lei.chen@xxxxxxxxxx> wrote:
>
> Hi John,
> Thanks for your reply.
>
> On Fri, Mar 14, 2025 at 1:20 AM John Stultz <jstultz@xxxxxxxxxx> wrote:
> >
> > On Sun, Mar 9, 2025 at 8:00 PM Lei Chen <lei.chen@xxxxxxxxxx> wrote:
> > >
> > > timekeeping_apply_adjustment try to adjust tkr_mono.mult by @mult_adj.
> > > If @mult_adj > 0, tk->tkr_mono.xtime_nsec will be decreased by @offset.
> > > Then timekeeping_update flushes shadow_timekeeper to real tk and vdso
> > > data region. Then rolling back happens.
> > >
> > > The drawing below illustrates the reason why timekeeping_apply_adjustment
> > > descreases tk->tkr_mono.xtime_nsec.
> > >
> > > cycle_interval offset clock_delta
> > > x-----------------------x----------x---------------------------x
> > >
> > > P0 P1 P2 P3
> > >
> > > N(P) means the nano sec count at the point P.
> > >
> > > Assume timekeeping_apply_adjustment runs at P1, with unaccumulated
> > > cycles @offset. Then tkr_mono.mult is adjusted from M1 to M2.
> > >
> > > Since offset happens before tkr_mono.mult adjustment, so we want to
> > > achieve:
> > > N(P3) == offset * M1 + clock_delta * M2 + N(P1) -------- (1)
> > >
> > > But at P3, the code works as following:
> > > N(P3) := (offset + clock_delta) * M2 + N(P1)
> > > = offset * M2 + clock_delta * M2 + N(P1)
> > >
> > > Apprently, N(P3) goes away from equation (1). To correct it, N(P1)
> > > should be adjusted at P2:
> > > N(P1) -= offset * (M2 - M1)
> > >
> > > To fix this issue, the patch accumulates offset into tk, and export
> > > N(P2) to real tk and vdso.
> > >
> > > tk.tkr_mono := N(P2) = N(P1) + offset * M1
> > >
> > > Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> > > N(P3) := N(P2) + clock_delta * M2
> > >
> > > Signed-off-by: Lei Chen <lei.chen@xxxxxxxxxx>
> >
> > Thanks for the email and the patch!
> >
> > So, I'm having a bit of a hard time understanding the issue you're
> > actually seeing. It seems to be that you're seeing
> > CLOCK_MONOTONIC_COARSE go backwards?
> >
> I'm sorry for that.
> Yes, it's CLOCK_MONOTONIC_COARSE that goes backwards.
>
> I hope the code flow can help to explain it.
>
> In user space, clock_gettime(CLOCK_MONOTONIC_COARSE) actually reads
> tk->xtime_sec and tk->tkr_mono.xtime_nsec.
>
> But when ntp calls adjtimex, the code works as following:
> do_adjtimex
> timekeeping_advance
> timekeeping_apply_adjustment
> tk->tkr_mono.xtime_nsec -= offset; ------------------- (1)
> timekeeping_update
> update_vsyscall -------------------------(2)
>
> At (1) , if offset > 0, xtime_nsec will go backwards.
> And after (2) CLOCK_MONOTONIC_COARSE will go backwards.
So, I understand we subtract offset from xtime_nsec, when the mult is
incremented, as this is necessary to avoid time inconsistencies with
the non-coarse clockids, since we have unaccumulated cycles in offset.
Briefly:
mult_2 = mult_1 + 1
xtime_nsec_1 + (mult_1 * offset) == xtime_nsec_2 + (mult_2 * offset)
== xtime_nsec_2 + (mult_1 +1) * offset)
== xtime_nsec_2 + (mult_1 * offset) + offset
Then cancelling from both sides:
xtime_nsec_1 == xtime_nsec_2 + offset
And re-arranging as such:
xtime_nsec_2 == xtime_nsec_1 - offset
So yeah, I see the concern, as when we are dealing with the _COARSE
clocks, we don't use the offset value. So the subtracting offset from
xtime_nsec initially seems problematic.
But the key here is the timekeeping_adjust() logic, which adjusts the
multiplier and is all done *after* we do the accumulation, adding
(possibly multiple) (mult*cycle_interval) intervals from the offset to
xtime_nsec.
After the accumulation, offset must be smaller than cycle_interval.
So the negative (un-multiplied) offset size adjustment to xtime_nsec
should be smaller than what was immediately before added to it. Both
accumulation and adjustment are done atomically together under the
tk_core.lock, so I'd not expect to see an inconsistency here.
My suspicion is that if we are coming into timekeeping_advance() more
frequently then cycle_interval cycles, than its possible we didn't
actually accumulate anything, but had some left over ntp error that
triggered a mult adjustment, causing a xtime_nsec to be decremented
without the normal accumulation before that. Opening up a window for
the inconsistency.
The "if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)"
check there is supposed to catch that, but with
timekeeping_advance(TK_ADV_FREQ) it looks like during an ntp
adjustment we can try to do the mult adjustment without accumulation.
Thomas just tells me now he's found a fix, so hopefully that will be
incoming soon. I'll probably be drafting my own approach just to make
sure we're aligned.
I've also found this is pretty easy to reproduce but unfortunately the
kselftest skew_consistency focused on MONOTONIC and didn't check
_COARSE clockids, so I'll try to tweak that test as well so that we
have better coverage for this case.
Thanks so much for finding and reporting this!
thanks
-john