Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
From: Alexandre Belloni
Date: Tue Jun 16 2026 - 02:58:08 EST
Hi,
On 16/06/2026 10:10:19+0800, Jinjie Ruan wrote:
>
>
> On 6/15/2026 11:22 PM, Alexandre Belloni wrote:
> > Hello,
> >
> > On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote:
> >> In virtualization environments like QEMU [1], or during hardware
> >> clocksource anomalies, an extreme time-warp event can occur. When
> >> the system time abruptly jumps forward, the rtc_timer_do_work() handler
> >> falls into a prolonged processing loop to clear accumulated historical
> >> timers via timerqueue_getnext(). Running this loop indefinitely under
> >> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
> >> the system.
> >>
> >> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
> >> visibility and prevent softlockups:
> >>
> >> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
> >> check if the loop has monopolized the CPU for more than 1s (HZ). If so,
> >> the handler prints a telemetry warning, triggers a WARN stack dump, and
> >> breaks the loop to safely yield the CPU.
> >>
> >> 2. Track the execution via a `loop_count` metric. Printing this counter
> >> in the warning log provides vital diagnostics to distinguish
> >> an aggressive time-warp storm (high count) from a bogged-down callback
> >> bug (low count).
> >>
> >> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
> >> into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
> >> developers to instantly pinpoint the exact boundary of the time
> >> jump in dmesg.
> >>
> >> This non-destructive telemetry guard provides precise hardware/emulator
> >> diagnostic visibility while ensuring core kernel availability.
> >>
> >> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@xxxxxxxxxx/
> >> Signed-off-by: Jinjie Ruan <ruanjinjie@xxxxxxxxxx>
> >> ---
> >> drivers/rtc/interface.c | 15 +++++++++++++--
> >> 1 file changed, 13 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
> >> index 1906f4884a83..f6c5fd16cc4e 100644
> >> --- a/drivers/rtc/interface.c
> >> +++ b/drivers/rtc/interface.c
> >> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
> >> */
> >> void rtc_timer_do_work(struct work_struct *work)
> >> {
> >> - struct rtc_timer *timer;
> >> + unsigned long start_jiffies = jiffies;
> >> struct timerqueue_node *next;
> >> - ktime_t now;
> >> + struct rtc_timer *timer;
> >> struct rtc_time tm;
> >> + int loop_count = 0;
> >> + ktime_t now;
> >> int err;
> >>
> >> struct rtc_device *rtc =
> >> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
> >> }
> >> now = rtc_tm_to_ktime(tm);
> >> while ((next = timerqueue_getnext(&rtc->timerqueue))) {
> >> + loop_count++;
> >> +
> >> + if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
> >> + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
> >> + loop_count, &tm);
> >> + WARN_ON_ONCE(1);
> >
> > So, your issue is that it is too slow so you make it even slower? There
> > are already plenty of tracepoints that allow proper debugging in this
> > loop, I'm pretty sure we don't want to bloat the kernel with more
> > messages.
>
> Hi, Alexandre,
>
> The point here is not about the performance of the rtc_timer_do_work()
> loop — it’s about making the problem debuggable when things go wrong.
> And we can put it under a debug Kconfig option, so production kernels
> see no extra overhead at all.
But then aren't the tracepoint enough? There are 3 tracepoints in the
loop that are exactly for debugging your issue.
>
> The patch is installed in the following scenarios:
>
> If the RTC hardware fails, or if the QEMU-emulated RTC device code in a
> KVM virtual machine has a problem (for example, the x86 RTC emulation
> hardware mc146818 has an overflow issue[1]), the time may jump as shown
> in the log below, which can cause a soft lockup.
This is not clear, you are not explaining the issue. You seem to mix
system time and RTC time. What I understand is that there was a periodic
timer enqueued and then for some reason, the system time jumped forward
by a large amount and now rtc_timer_do_work is firing events for each of
the missed timers. You are not explaining the relationship with the RTC
hardware (I see none).
>
> However, when the issue occurs, it is only possible to know that too
> many pending timers have accumulated in the timerqueue (for example, the
> log shows that tens of millions of timer nodes have been processed) by
> temporarily adding diagnostic code in rtc_timer_do_work().
>
This is not true, there are 3 tracepoints to know what is happening with
the timers. Also, there are always exactly zero, one or two timers in
the queue, never tens of millions.
> To determine whether the root cause is hardware, kernel RTC code, an RTC
> driver issue, or an RTC hardware problem, more debugging is needed. But
> if the problem is indeed caused by RTC hardware, adding a diagnostic
> print of the current RTC time when the loop takes too long (as this
> diagnostic patch does) would make it easy to tell whether QEMU or the
> hardware is faulty.
>
> [1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@xxxxxxxxxx/
>
> kworker/0:1-37 [000] .N.. 489.159634: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281423
> kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281424
> kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281425
> kworker/0:1-37 [000] .N.. 489.159636: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281426
> kworker/0:1-37 [000] .N.. 489.159637: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281427
> kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281428
> kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281429
> kworker/0:1-37 [000] .N.. 489.159639: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281430
> kworker/0:1-37 [000] .N.. 489.159640: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281431
> kworker/0:1-37 [000] .N.. 489.159641: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281432
>
>
> swapper/0-1 [001] .N.. 11.579334: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
> swapper/0-1 [001] .N.. 11.579421: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
> swapper/0-1 [001] .N.. 11.579469: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
> swapper/0-1 [001] .N.. 11.580816: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
> syz-executor.5-7492 [003] .... 129.807406: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2033:05:04:07:03:51
> syz-executor.5-7492 [003] .... 129.807419:
> __rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now:
> 2033:05:04:07:03:51, expire: 2033:05:04:07:03:52
>
>
> Best regards,
> Jinjie
>
> >
> >
>
--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com