Re: Extreme time jitter with suspend/resume cycles

From: Thomas Gleixner
Date: Thu Oct 05 2017 - 14:01:33 EST


Gabriel,

On Thu, 5 Oct 2017, Gabriel Beddingfield wrote:
> On Thu, Oct 5, 2017 at 4:01 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > i.e. the 32bit rollover of the clocksource. So, if the clocksource->read()
> > function returns a full 64bit counter value, then it must have protection
> > against observing the rollover independent of the clock which feeds that
> > counter. Of course the frequency changes the probablity of observing it,
> > but still the read function must be protected against observing the
> > rollover unconditionally.
>
> Right, but isn't this what clocksource->mask is supposed to do? When we change
> the back-end frequency, we're still using the same front-end 32-bit register and
> we don't see the same jumps.

Right. That's what the mask should protect. I was assuming that this is one
of the fancy clocksources which expose two 32bit registers of a 64bit
counter and the rollover protection was missing. So that's not the
case. Good, or not so good :)

> > Which SoC/clocksource driver are you talking about?
>
> NXP i.MX 6SoloX
> drivers/clocksource/timer-imx-gpt.c

So that clocksource driver looks correct. Do you have an idea in which
context this time jump happens? Does it happen when you exercise your high
frequency suspend/resume dance or is that happening just when you let the
machine run forever as well?

The timekeeping_resume() path definitely has an issue:

cycle_now = tk_clock_read(&tk->tkr_mono);
if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
cycle_now > tk->tkr_mono.cycle_last) {

This works nice for clocksources which wont wrap across suspend/resume but
not for those which can. That cycle_now -> cycle_last check should take
cs-mask into account ...

Of course for clocksources which can wrap within realistic suspend times,
which 36 hours might be accounted for, this would need an extra sanity
check against a RTC whether wrap time has been exceeded.

I haven't thought it through whether that buggered check fully explains
what you are observing, but it's wrong nevertheless. John?

Thanks,

tglx