[RFC patch 0/8] timekeeping: Implement shadow timekeeper to shortenin kernel reader side blocking

From: Thomas Gleixner
Date: Thu Feb 21 2013 - 17:53:38 EST


The vsyscall based timekeeping interfaces for userspace provide the
shortest possible reader side blocking (update of the vsyscall gtod
data structure), but the kernel side interfaces to timekeeping are
blocked over the full code sequence of calculating update_wall_time()
magic which can be rather "long" due to ntp, corner cases, etc...

Eric did some work a few years ago to distangle the seqcount write
hold from the spinlock which is serializing the potential updaters of
the kernel internal timekeeper data. I couldn't be bothered to reread
the old mail thread and figure out why this got turned down, but I
remember that there were objections due to the potential inconsistency
between calculation, update and observation.

In hindsight that's nonsense, because even back at that time we did
the vsyscall update at the very least moment and unsychronized to the
in kernel data update.

While we never got any complaints about that there is a real issue
versus virtualization:

VCPU0 VCPU1

update_wall_time()
write_seqlock_irqsave(&tk->lock, flags);
....

Host schedules out VCPU0

Arbitrary delay

Host schedules in VCPU0
__vdso_clock_gettime()#1
update_vsyscall();
__vdso_clock_gettime()#2

Depending on the length of the delay which kept VCPU0 away from
executing and depending on the direction of the ntp update of the
timekeeping variables __vdso_clock_gettime()#2 can observe time going
backwards.

You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1
to physical core 1. Now remove all load from physical core 1 except
VCPU1 and put massive load on physical core 0 and make sure that the
NTP adjustment lowers the mult factor. It's extremly hard to
reproduce, but it's possible.

So this patch series is going to expose the same issue to the kernel
side timekeeping. I'm not too worried about that, because

- it's extremly hard to trigger

- we are aware of the issue vs. vsyscalls already

- making the kernel behave the same way as vsyscall does not make
things worse

- John Stultz has already an idea how to fix it.
See https://lkml.org/lkml/2013/2/19/569

Though that's not the scope of this patch series, but I want to make
sure that it's documented.

Now the obvious question whether this is worth the trouble can be
answered easily. Preempt-RT users and HPC folks have complained about
the long write hold time of the timekeeping seqcount since years and a
quick test on a preempt-RT enabled kernel shows, that this series
lowers the maximum latency on the non-timekeeping cores from 8 to 4
microseconds. That's a whopping factor of 2. Defintely worth the
trouble!

Thanks,

tglx
---
include/linux/jiffies.h | 1
include/linux/timekeeper_internal.h | 4
kernel/time/tick-internal.h | 2
kernel/time/timekeeping.c | 176 +++++++++++++++++++++---------------
4 files changed, 107 insertions(+), 76 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/