Unreliable 11-minute RTC sync

From: Miroslav Lichvar
Date: Wed Nov 27 2019 - 06:20:23 EST


When the system clock is synchronized (i.e. the STA_UNSYNC flag is
cleared by NTP/PTP), the kernel is expected to copy the system time to
the RTC every 11 minutes.

There are reports that it doesn't work. I checked some of my machines
and indeed some have their RTC off by more than a second. IIRC this
worked better few years ago.

In order for the RTC to be set precisely the update needs to happen at
some fraction of the second (e.g. 0.5s on x86_64). Originally, the RTC
was set only if it the update was scheduled correctly to one jiffie.
Later this requirement was relaxed to 5 jiffies. It seems with current
kernels that rarely happens. The update seems to be consistently late
by tens of milliseconds, sometimes by hundreds of milliseconds. This
repeats every second until an update is on time with some luck.
Apparently, this may take days or longer.

I'm not sure if workqueues changed how they behave, or they now have
more work to do, preventing the RTC update to be on time. I tried
switching to the non-power-efficient wq and also the high priority wq.
The former worked best in my tests, taking about 5 attempts on average
to make an update. I suspect that may be specific to this machine and
workload.

I'm not sure what would be the best fix.

Some ideas:
- relax the requirements on accuracy even more (e.g. 0.1 second)
- limit the number of retries (e.g. to 5) and force the update on the
last one, no matter how inaccurate it is
- measure the scheduling delay and try to compensate for it
- randomize the requested delay
- switch to timer

Suggestions?

--
Miroslav Lichvar