Ensuring wall_to_monotonic is not positive breaks use case
From: Rick Ratzel
Date: Wed Sep 05 2018 - 17:06:02 EST
Hello,
We have a use case that was broken by the commit e1d7ba873555 (time: Always make sure wall_to_monotonic isn't positive). We've been reverting the commit in our builds, but we'd greatly prefer a solution consistent with the mainline. We also think our use case isn't unique to us, and may become more common in the near future.
Our use case is as follows: we have devices that have no notion of traceable time and often boot up with a time value of 0 (the Epoch). These devices are networked and share time using protocols such as IEEE 1588 (PTP) or IEEE 802.1AS. These protocols involve automatically electing a device to act as the source of time for all other devices on the network (the "grandmaster" in PTP speak) to transmit its time to the other "slave" devices. This common shared time is used as a means to synchronize I/O operations across all devices to create a distributed measurement or control system. The devices often interoperate with other 3rd party devices that also share time using the same protocol, and may also boot up with a time very near the Epoch. We have no control over the 3rd party devices and cannot change the time that they boot up with, or the standardized algorithm they use to elect a common grandmaster.
In this case, time is used only as a means to synchronize periodic operations, where stable monotonically-increasing counts (this also implies no leap seconds!) are all that's needed and traceability to a standardized timescale is not necessary.
The problem arises when a device that's been elected grandmaster is sending out time at or very near (maybe only a few seconds past) the Epoch, and a slave device has an uptime of, say, several minutes past the Epoch. The slave device will never be able to synchronize to the master in this situation, since the master is sending out time values lower than the slave's Epoch+uptime lower bound.
The presence of an RTC helps mitigate this situation, but only if the RTC has been set accordingly and its batteries have not failed. We cannot guarantee these conditions, and many of the networked devices participating will not even have RTCs.
We're looking for suggestions on how best to proceed with a new change that ideally both supports the use case described above, as well as addresses the symptoms brought up in the initial commit (negative boot time causes get_expiry() to overflow time_t, and show_stat() uses "unsigned long" to print negative btime). Any thoughts on this would be greatly appreciated.
Link to initial commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e1d7ba8735551ed79c7a0463a042353574b96da3
Thanks,
Rick Ratzel - National Instruments