Re: [RFC][PATCH 4/4] time: Do leapsecond adjustment in gettime fastpaths

From: John Stultz
Date: Tue Jun 02 2015 - 11:53:08 EST


On Tue, Jun 2, 2015 at 2:01 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * John Stultz <john.stultz@xxxxxxxxxx> wrote:
>
>> Currently, leapsecond adjustments are done at tick time.
>>
>> As a result, the leapsecond was applied at the first timer
>> tick *after* the leapsecond (~1-10ms late depending on HZ),
>> rather then exactly on the second edge.
>>
>> This was in part historical from back when we were always
>> tick based, but correcting this since has been avoided since
>> it adds extra conditional checks in the gettime fastpath,
>> which has performance overhead.
>>
>> However, it was recently pointed out that ABS_TIME
>> CLOCK_REALTIME timers set for right after the leapsecond
>> could fire a second early, since some timers may be expired
>> before we trigger the timekeeping timer, which then applies
>> the leapsecond.
>>
>> This isn't quite as bad as it sounds, since behaviorally
>> it is similar to what is possible w/ ntpd made leapsecond
>> adjustments done w/o using the kernel discipline. Where
>> due to latencies, timers may fire just prior to the
>> settimeofday call. (Also, one should note that all
>> applications using CLOCK_REALTIME timers should always be
>> careful, since they are prone to quirks from settimeofday()
>> disturbances.)
>>
>> However, the purpose of having the kernel do the leap adjustment
>> is to avoid such latencies, so I think this is worth fixing.
>>
>> So in order to properly keep those timers from firing a second
>> early, this patch modifies the gettime accessors to do the
>> extra checks to apply the leapsecond adjustment on the second
>> edge. This prevents the timer core from expiring timers too
>> early.
>>
>> This patch does not handle VDSO time implementations, so
>> userspace using vdso gettime will still see the leapsecond
>> applied at the first timer tick after the leapsecond.
>> This is a bit of a tradeoff, since the performance impact
>> would be greatest to VDSO implementations, and since vdso
>> interfaces don't provide the TIME_OOP flag, one can't
>> distinquish the leapsecond from a time discontinuity (such
>> as settimeofday), so correcting the VDSO may not be as
>> important there.
>>
>> Apologies to Richard Cochran, who pushed for such a change
>> years ago, which I resisted due to the concerns about the
>> performance overhead.
>>
>> While I suspect this isn't extremely critical, folks who
>> care about strict leap-second correctness will likely
>> want to watch this, and it will likely be a -stable candidate.
>>
>> Cc: Prarit Bhargava <prarit@xxxxxxxxxx>
>> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
>> Cc: Richard Cochran <richardcochran@xxxxxxxxx>
>> Cc: Jan Kara <jack@xxxxxxx>
>> Cc: Jiri Bohac <jbohac@xxxxxxx>
>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Shuah Khan <shuahkh@xxxxxxxxxxxxxxx>
>> Originally-suggested-by: Richard Cochran <richardcochran@xxxxxxxxx>
>> Reported-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
>> Reported-by: Prarit Bhargava <prarit@xxxxxxxxxx>
>> Signed-off-by: John Stultz <john.stultz@xxxxxxxxxx>
>> ---
>> include/linux/time64.h | 1 +
>> include/linux/timekeeper_internal.h | 7 +++
>> kernel/time/ntp.c | 73 +++++++++++++++++++++++++---
>> kernel/time/ntp_internal.h | 1 +
>> kernel/time/timekeeping.c | 97 ++++++++++++++++++++++++++++++++-----
>> 5 files changed, 159 insertions(+), 20 deletions(-)
>
> So I don't like the complexity of this at all: why do we add over 100 lines of
> code for something that occurs (literally) once in a blue moon?

So yea. I very much felt the same way before the early timer
expiration issue came up.


> ... and for that reason I'm not surprised at all that it broke in non-obvious
> ways.
>
> Instead of having these super rare special events, how about implementing leap
> second smearing instead? That's far less radical and a lot easier to test as well,
> as it's a continuous mechanism. It will also confuse user-space a lot less,
> because there are no sudden time jumps.

So yea. Leap smearing/slewing is an attractive solution. The first
issue is that there's no standard yet for the range of time that the
slew occurs (or even if the slew is linear or a curve). The second is
I don't think we can actually get away from supporting UTC w/ leap, as
applications may depend on precision. Also things like NTP sync w/
mixed systems would be problematic, as NTPd and others need to become
savvy of which mode they are working with.

The leap smearing method of only doing it in private networks and
controlling it by the NTP server is becoming more widespread, but it
has its own problems, since it doesn't handle CLOCK_TAI properly, and
since CLOCK_REALTIME isn't yet frequency steerable separately from the
other clockids, this method ends up slowing down CLOCK_TAI and
CLOCK_MONOTONIC as well.

I'd like to try to get something working in the kernel so we could
support CLOCK_UTC and CLOCK_UTCSLS (smeared-leap-second) clockids,
then allow applications that care to migrate explicitly to the one
they care about. Possibly allowing CLOCK_REALTIME to be compile-time
directed to CLOCK_UTCSLS so that most applications that don't care can
just ignore it. But finding time to do this has been hard (if anyone
is interested in working on it, I'd be excited to hear!).

But if you think this patch is complicated, creating a new separately
steered clockid is not going to be trvial (as there will be lots of
ugly edge cases, like what if a leap second is cancelled mid-way
through the slewing adjustment, etc).

> Secondly, why is there a directional flag? I thought leap seconds can only be
> inserted.

A leap delete isn't likely to occur, but its supported by the adjtimex
interface. And given the irregularity of the earths rotation, I'm not
sure I'd rule it out completely.

> So all in one, the leap second code is fragile and complex - lets re-think the
> whole topic instead of complicating it even more ...

So the core complexity with this patch is that we're basically having
to do state-machine transitions in a read-only path (since the reads
may happen before the update path runs). Since there's a number of
read-paths, there's some duplication, and in some cases variance if
the read path exports more state (ie: adjtimex).

I do agree that the complexity of the time subsystem is getting hard
to manage. I'm at the point where I think we need to avoid keeping
duplicated timespec and ktime_t data (we can leave the ktime->timespec
caching to the VDSOs). That will help cut down the read paths a bit,
but will also simplify updates since we'll have less data to keep in
sync. How we manage the ntp state also needs a rework, since the
locking rules are getting too complex (bit me in an earlier version of
this patch), and we're in effect duplicating some of that state in the
timekeeper with this patch to handle the reads safely.

But even assuming all those changes were already made, I think we'd
still need something close to this patch.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/