Re: [RFC][PATCH 4/4] time: Do leapsecond adjustment in gettime fastpaths
From: Thomas Gleixner
Date: Mon Jun 08 2015 - 15:06:05 EST
On Mon, 8 Jun 2015, John Stultz wrote:
> On Sat, Jun 6, 2015 at 2:44 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > On Fri, 5 Jun 2015, Peter Zijlstra wrote:
> >
> >> On Fri, 2015-06-05 at 11:04 +0200, Richard Cochran wrote:
> >> > On Fri, Jun 05, 2015 at 09:29:13AM +0200, Peter Zijlstra wrote:
> >> > > That leaves the question; for who is this exact second edge important?
> >> >
> >> > Distributed applications using the UTC time scale.
> >> >
> >> > Many control applications are done with a 1 millisecond period.
> >> > Having the time wrong by a second for 10 or 100 loops is bad news.
> >>
> >> Firstly I would strongly suggest such applications not use UTC because
> >> of this, I think TAI was invented for just this reason.
> >>
> >> Secondly, how would John's patches help with this? Usespace loops
> >> reading time would be using the VDSO and would still not get the right
> >> time, and timers would be subject to the same IRQ latency that a hrtimer
> >> based leap second insert would, and would still very much not be in-sync
> >> across the cluster.
> >
> > So the only thing which is fixed are in kernel users and therefor
> > hrtimers.
>
> Well, for vdso systems, hrtimers and adjtimex (which is the only
> interface that provides enough information to understand where in a
> leapsecond you actually are).
>
> And again, vdsos are fixable, but I hesitated due to my concerns about
> the extra performance overhead, the smaller benefit it provides
> relative to not having timers expiring early.
Right, and I'm concerned about the extra overhead of your patch. Just
look at the cache layout.
Current:
struct timekeeper {
struct tk_read_base tkr_mono; /* 0 56 */
struct tk_read_base tkr_raw; /* 56 56 */
/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
u64 xtime_sec; /* 112 8 */
long unsigned int ktime_sec; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct timespec wall_to_monotonic; /* 128 16 */
ktime_t offs_real; /* 144 8 */
ktime_t offs_boot; /* 152 8 */
ktime_t offs_tai; /* 160 8 */
s32 tai_offset; /* 168 4 */
unsigned int clock_was_set_seq; /* 172 4 */
struct timespec raw_time; /* 176 16 */
/* --- cacheline 3 boundary (192 bytes) --- */
cycle_t cycle_interval; /* 192 8 */
u64 xtime_interval; /* 200 8 */
s64 xtime_remainder; /* 208 8 */
u32 raw_interval; /* 216 4 */
/* XXX 4 bytes hole, try to pack */
u64 ntp_tick; /* 224 8 */
s64 ntp_error; /* 232 8 */
u32 ntp_error_shift; /* 240 4 */
u32 ntp_err_mult; /* 244 4 */
};
Four cachelines where everything which is considered hotpath is in the
first two cache lines.
With your change this becomes:
struct timekeeper {
struct tk_read_base tkr_mono; /* 0 56 */
struct tk_read_base tkr_raw; /* 56 56 */
/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
u64 xtime_sec; /* 112 8 */
long unsigned int ktime_sec; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct timespec wall_to_monotonic; /* 128 16 */
ktime_t offs_real; /* 144 8 */
ktime_t offs_boot; /* 152 8 */
ktime_t offs_tai; /* 160 8 */
s32 tai_offset; /* 168 4 */
unsigned int clock_was_set_seq; /* 172 4 */
struct timespec raw_time; /* 176 16 */
/* --- cacheline 3 boundary (192 bytes) --- */
cycle_t cycle_interval; /* 192 8 */
u64 xtime_interval; /* 200 8 */
s64 xtime_remainder; /* 208 8 */
u32 raw_interval; /* 216 4 */
/* XXX 4 bytes hole, try to pack */
u64 ntp_tick; /* 224 8 */
s64 ntp_error; /* 232 8 */
u32 ntp_error_shift; /* 240 4 */
u32 ntp_err_mult; /* 244 4 */
time64_t next_leap_sec; /* 248 8 */
/* --- cacheline 4 boundary (256 bytes) --- */
ktime_t next_leap_ktime; /* 256 8 */
int leap_direction; /* 264 4 */
};
That's 5 cache lines, but that's not the worst thing.
For every readout of CLOCK_REALTIME, which will affect in kernel
callers AND all systems which lack a VDSO, we have to load TWO more
cache lines.
We could mitigate that partially by moving the leap second stuff into
the 3rd cacheline, but you also add a conditional into every readout.
Did you try to measure the impact of your changes with sys_gettime()
and VDSO disabled?
> > That means the whole leap mess added into the gettime fast path is
> > just constant overhead for that corner case.
> >
> > We can be smarter than that and just block hrtimer delivery for clock
> > realtime timers across the leap edge. There should be ways to do that
> > non intrusive if we think hard enough about it.
>
> This approach seems like it would add more work to the timer-add
> function (to check leapstate and adjust the expiration), which might
> be a heavier use case (we adjust for each timer) then the similar
> logic done in the update_base_offsets_now() at hrtimer_interrupt time
> (adjust for each hrtimer_interrupt).
No, certainly not in the timer add function. How should that work at
all? If we arm the timer to expire 24 hours from now, how should we
know about the leap state at expiry time?
> Now, It could be possible to do a lighter weight version of my patch,
> which just does the adjustment only for the hrtimer_interrupt code
> (leaving the rest of the read paths alone).
Yes, that should work. As long as I can keep the cached values in the
hrtimer cpu bases and the whole thing keeps the clock_was_set_seq
logic intact.
If we do not do the conditional version, then on every hrtimer
interrupt we write THREE cachelines for nothing.
And if we cannot cache the offsets, then we end up calling into the
timekeeping code for every timer which is not CLOCK_MONOTONIC based
and retrieve the offset. That hurts especially on 32bit machines,
because we need to protect the readout with the timekeeper sequence
counter.
> If that is something you'd prefer. I'll have to think a bit to
> ensure the internal inconsistency isn't otherwise problematic.
>
> Though I suspect fixing adjtimex is worth it as well, since its really
> the only interface that can provide a sane view of the leapsecond, and
> isn't normally considered a hot path.
I'm not worried about adjtimex at all. You can do whatever you want
there, but please leave the fast pathes alone.
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/