Re: [syzbot] INFO: rcu detected stall in syscall_exit_to_user_mode
From: Dmitry Vyukov
Date: Tue Sep 14 2021 - 14:00:33 EST
On Tue, 14 Sept 2021 at 16:58, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Tue, Sep 14 2021 at 20:37, Hillf Danton wrote:
>
> > On Mon, 13 Sep 2021 12:28:14 +0200 Thomas Gleixner wrote:
> >>On Tue, Aug 31 2021 at 15:45, Hillf Danton wrote:
> >>> On Mon, 30 Aug 2021 12:58:58 +0200 Dmitry Vyukov wrote:
> >>>>> ieee80211_iterate_active_interfaces_atomic+0x70/0x180 net/mac80211/util.c:829
> >>>>> mac80211_hwsim_beacon+0xd5/0x1a0 drivers/net/wireless/mac80211_hwsim.c:1861
> >>>>> __run_hrtimer kernel/time/hrtimer.c:1537 [inline]
> >>>>> __hrtimer_run_queues+0x609/0xe50 kernel/time/hrtimer.c:1601
> >>>>> hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1618
> >>>>> __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
> >>>
> >>> Add debug info only to help kasan catch the timer running longer than 2 ticks.
> >>>
> >>> Is it anything in the right direction, tglx?
> >>
> >>Not really. As Dmitry pointed out this seems to be related to
> >
> > Thanks for taking a look.
> >
> >>mac80211_hwsim and if you look at the above stacktrace then how is
> >>adding something to the timer wheel helpful?
> >
> > Given the stall was printed on CPU1 while the supposedly offending timer was
> > expiring on CPU0, what was proposed is the lame debug info only for kasan to
> > catch the timer red handed.
> >
> > It is more appreciated if the tglx dude would likely spend a couple of minutes
> > giving us a lesson on the expertises needed for collecting evidence that any
> > timer runs longer than two ticks. It helps beyond the extent of kasan.
>
> That tglx dude already picked the relevant part of the stack trace (see
> also above):
>
> >>>>> ieee80211_iterate_active_interfaces_atomic+0x70/0x180 net/mac80211/util.c:829
> >>>>> mac80211_hwsim_beacon+0xd5/0x1a0 drivers/net/wireless/mac80211_hwsim.c:1861
> >>>>> __run_hrtimer kernel/time/hrtimer.c:1537 [inline]
> >>>>> __hrtimer_run_queues+0x609/0xe50 kernel/time/hrtimer.c:1601
> >>>>> hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1618
> >>>>> __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
>
> and then asked the question how a timer wheel timer runtime check
> helps. He just omitted the appendix "if the timer in question is a
> hrtimer" as he assumed that this is pretty obvious from the stack trace.
>
> Aside of that if the wireless timer callback runs in an endless loop,
> what is a runtime detection of that in the hrtimer softirq invocation
> helping to decode the problem if the stall detector catches it when it
> hangs there?
>
> Now that mac80211 hrtimer callback might actually be not the real
> problem. It's certainly containing a bunch of loops, but I couldn't find
> an endless loop there during a cursory inspection.
>
> But that callback does rearm the hrtimer and that made me look at
> hrtimer_run_queues() which might be the reason for the endless loop as
> it only terminates when there is no timer to expire anymore.
>
> Now what happens when the mac80211 callback rearms the timer so it
> expires immediately again:
>
> hrtimer_forward(&data->beacon_timer, hrtimer_get_expires(timer),
> ns_to_ktime(bcn_int * NSEC_PER_USEC));
>
> bcn is a user space controlled value. Now lets assume that bcn_int is <=1,
> which would certainly cause the loop in hrtimer_run_queues() to keeping
> looping forever.
>
> That should be easy to verify by implementing a simple test which
> reschedules a hrtimer from the callback with a expiry time close to now.
>
> Not today as I'm about to head home to fire up the pizza oven.
This question definitely shouldn't take priority over the pizza. But I
think I saw this "rearm a timer with a user-controlled value without
any checks" pattern lots of times and hangs are inherently harder to
localize and reproduce. So I wonder if it makes sense to add a debug
config that would catch such cases right when the timer is set up
(issue a WARNING)?
However, for automated testing there is the usual question of
balancing between false positives and false negatives. The check
should not produce false positives, but at the same time it should
catch [almost] all actual stalls so that they don't manifest as
duplicate stall reports.
If I understand it correctly the timer is not actually set up as
periodic, but rather each callback invocation arms it again. Setting
up a timer for 1 ns _once_ (or few times) is probably fine (right?),
so the check needs to be somewhat more elaborate and detect "infinite"
rearming.