[PATCH] time, ntp: Do not update time_state in middle of leap second [v3]

From: Prarit Bhargava
Date: Thu Feb 12 2015 - 08:58:56 EST


During leap second insertion testing it was noticed that a small window
exists where the time_state could be reset such that
time_state = TIME_OK, which then causes the leap second to not occur, or
causes the entire leap second state machine to fail with time_state =
TIME_INS at the end of the leap second.

The test did the following in userspace:

tx.modes = ADJ_STATUS;
tx.status = STA_INS;

/* send leap second request */
ret = adjtimex(&tx);

/* Check adjtimex output every half second */
now = tx.time.tv_sec;
while (now < next_leap+2) {
char buf[26];
ret = adjtimex(&tx);

ctime_r(&tx.time.tv_sec, buf);
buf[strlen(buf)-1] = 0; /*remove trailing\n */

printf("%s + %6ld us\t%s\n",
buf,
tx.time.tv_usec,
time_state_str(ret));
now = tx.time.tv_sec;
/* Sleep for another half second */
ts.tv_sec = 0;
ts.tv_nsec = NSEC_PER_SEC/2;
clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL);
}

which was intended to mimic the insertion of a leap second. A
successful run of the test would result in the time_state transitioning
from TIME_OK to TIME_INS, then to TIME_OOP when the leap second was
inserted, and then to TIME_WAIT when the leap second was completed. While
running this code failures were seen in which the time_state remained TIME_INS,
even though the leap second had occurred.

After some investigation it was noted that the test contained a small error:
the test does not reinitialize tx.status and reissues the STA_INS every
1/2 second. As a result of this broken test, the following failure was noticed
(the output below is a mix of kernel messages and the output from the test
program, the remaining annotations are printk's in the code and my own
additional notes):

[ 942.952833] time_state [1] change from TIME_OK to TIME_INS

Fri Feb 13 18:59:51 2015 + 318126 us TIME_INS
Fri Feb 13 18:59:51 2015 + 818167 us TIME_INS
Fri Feb 13 18:59:52 2015 + 318208 us TIME_INS
Fri Feb 13 18:59:52 2015 + 818248 us TIME_INS
Fri Feb 13 18:59:53 2015 + 318290 us TIME_INS
Fri Feb 13 18:59:53 2015 + 818331 us TIME_INS
Fri Feb 13 18:59:54 2015 + 318372 us TIME_INS
Fri Feb 13 18:59:54 2015 + 818413 us TIME_INS
Fri Feb 13 18:59:55 2015 + 318454 us TIME_INS
Fri Feb 13 18:59:55 2015 + 818495 us TIME_INS
Fri Feb 13 18:59:56 2015 + 318534 us TIME_INS
Fri Feb 13 18:59:56 2015 + 818575 us TIME_INS
Fri Feb 13 18:59:57 2015 + 318617 us TIME_INS
Fri Feb 13 18:59:57 2015 + 818660 us TIME_INS
Fri Feb 13 18:59:58 2015 + 318702 us TIME_INS
Fri Feb 13 18:59:58 2015 + 818744 us TIME_INS
Fri Feb 13 18:59:59 2015 + 318785 us TIME_INS
Fri Feb 13 18:59:59 2015 + 818837 us TIME_INS

[ 952.953143] time_state [4] change from TIME_INS to TIME_OOP
[ 952.953150] Clock: inserting leap second 23:59:60 UTC
[ 953.299905] process_adj_status: insert_leap_sec[1223] setting time_state back
to TIME_OK [1, 1] <<< adjtimex() call every 1/2 second
[ 953.299913] time_state [9] change from TIME_OOP to TIME_OK

Fri Feb 13 18:59:59 2015 + 318878 us TIME_OK
Fri Feb 13 18:59:59 2015 + 818931 us TIME_OK

[ 954.064237] time_state [1] change from TIME_OK to TIME_INS

Fri Feb 13 19:00:00 2015 + 318972 us TIME_INS
Fri Feb 13 19:00:00 2015 + 819012 us TIME_INS
Fri Feb 13 19:00:01 2015 + 319051 us TIME_INS
Fri Feb 13 19:00:01 2015 + 819089 us TIME_INS
Fri Feb 13 19:00:02 2015 + 319128 us TIME_INS

As previously stated, the time_state remains TIME_INS even though the leap
second has already occurred @ 952.953150.

The test was changed to reset tx.status to 0 in the loop, and the test then
succeeded with a 100% rate with the time state ending in TIME_WAIT.

While this is highly unlikely to ever happen in the real world it is
still something we should protect against, as breaking the state machine
is bad.

If the time_state == TIME_OOP (ie, the leap second is in progress) do not
allow an external update to time_state in process_adj_status(). This will
prevent external adjtimex() calls from breaking the leap second state
machine.

[v2]: Only block time_state change when TIME_OOP
[v3]: Write a much more detailed explanation of the bug.

Signed-off-by: Prarit Bhargava <prarit@xxxxxxxxxx>
Cc: John Stultz <john.stultz@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Miroslav Lichvar <mlichvar@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
---
kernel/time/ntp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 28bf91c..6ff5cd5 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -535,7 +535,8 @@ void ntp_notify_cmos_timer(void) { }
static inline void process_adj_status(struct timex *txc, struct timespec64 *ts)
{
if ((time_status & STA_PLL) && !(txc->status & STA_PLL)) {
- time_state = TIME_OK;
+ if (time_state != TIME_OOP)
+ time_state = TIME_OK;
time_status = STA_UNSYNC;
/* restart PPS frequency calibration */
pps_reset_freq_interval();
--
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/