Re: [patch 01/11] hrtimer: Add a mechanism to catch runaway timers

From: Dmitry Vyukov
Date: Fri Sep 24 2021 - 04:16:33 EST


On Thu, 23 Sept 2021 at 18:04, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> A recent report from syzbot unearthed a problem with self rearming timers
> which fire late and try to catch up to now with a short period. That causes
> the timer to be rearmed in the past until it eventually catches up with
> now. If that rearming happens from the timer callback the hard or soft
> interrupt expiry loop can run for a long time with either interrupts or
> bottom halves disabled which causes RCU stalls and other lockups.
>
> There is no safety net to stop or at least identify such runaway timers.
>
> Detection is trivial. Cache the pointer to the last expired timer. The next
> invocation from the same loop compares the pointer with the next expiring
> hrtimer pointer and if they match 10 times in a row (in the same hard or
> soft interrupt expiry instance) then it's reasonable to declare it as a
> runaway.
>
> In that case emit a warning and skip the callback invocation which stops
> the misbehaving timer right there.
>
> It's obviously incomplete, but it's definitely better than nothing and would
> have caught the reported issue in mac80211_hwsim.
>
> Suggested-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> kernel/time/hrtimer.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1714,6 +1714,13 @@ static void __run_hrtimer(struct hrtimer
> base->running = NULL;
> }
>
> +static void hrtimer_del_runaway(struct hrtimer_clock_base *base,
> + struct hrtimer *timer)
> +{
> + __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
> + pr_warn("Runaway hrtimer %p %ps stopped\n", timer, timer->function);

Thanks for implementing this, Thomas.
Please use some standard kernel bug reporting facility here, e.g. WARN
I think will be appropriate. The ad-hoc format won't be recognized by
any testing system.
Otherwise looks good to me.

> +}
> +
> static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
> unsigned long flags, unsigned int active_mask)
> {
> @@ -1722,6 +1729,8 @@ static void __hrtimer_run_queues(struct
>
> for_each_active_base(base, cpu_base, active) {
> struct timerqueue_node *node;
> + struct hrtimer *last = NULL;
> + unsigned int cnt = 0;
> ktime_t basenow;
>
> basenow = ktime_add(now, base->offset);
> @@ -1732,6 +1741,22 @@ static void __hrtimer_run_queues(struct
> timer = container_of(node, struct hrtimer, node);
>
> /*
> + * Catch timers which rearm themself with a expiry
> + * time in the past over and over which makes this
> + * loop run forever.
> + */
> + if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_TIMERS)) {
> + if (unlikely(last == timer)) {
> + if (++cnt == 10) {
> + hrtimer_del_runaway(base, timer);
> + continue;
> + }
> + }
> + last = timer;
> + cnt = 0;
> + }
> +
> + /*
> * The immediate goal for using the softexpires is
> * minimizing wakeups, not running timers at the
> * earliest interrupt after their soft expiration.
>