Re: Something is leaking RCU holds from interrupt context

From: Matthew Wilcox
Date: Sun Apr 04 2021 - 14:25:06 EST


On Sun, Apr 04, 2021 at 09:48:08AM -0700, Paul E. McKenney wrote:
> On Sun, Apr 04, 2021 at 11:24:57AM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 03, 2021 at 09:15:17PM -0700, syzbot wrote:
> > > HEAD commit: 2bb25b3a Merge tag 'mips-fixes_5.12_3' of git://git.kernel..
> > > git tree: upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=1284cc31d00000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=dde0cc33951735441301
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+dde0cc33951735441301@xxxxxxxxxxxxxxxxxxxxxxxxx
> > >
> > > WARNING: suspicious RCU usage
> > > 5.12.0-rc5-syzkaller #0 Not tainted
> > > -----------------------------
> > > kernel/sched/core.c:8294 Illegal context switch in RCU-bh read-side critical section!
> > >
> > > other info that might help us debug this:
> > >
> > >
> > > rcu_scheduler_active = 2, debug_locks = 0
> > > no locks held by systemd-udevd/4825.
> >
> > I think we have something that's taking the RCU read lock in
> > (soft?) interrupt context and not releasing it properly in all
> > situations. This thread doesn't have any locks recorded, but
> > lock_is_held(&rcu_bh_lock_map) is true.
> >
> > Is there some debugging code that could find this? eg should
> > lockdep_softirq_end() check that rcu_bh_lock_map is not held?
> > (if it's taken in process context, then BHs can't run, so if it's
> > held at softirq exit, then there's definitely a problem).
>
> Something like the (untested) patch below?

Maybe? Will this tell us who took the lock? I was really trying to
throw out a suggestion in the hope that somebody who knows this area
better than I do would tell me I was wrong.

> Please note that it does not make sense to also check for
> either rcu_lock_map or rcu_sched_lock_map because either of
> these might be held by the interrupted code.

Yes! Although if we do it somewhere like tasklet_action_common(),
we could do something like:

+++ b/kernel/softirq.c
@@ -774,6 +774,7 @@ static void tasklet_action_common(struct softirq_action *a,

while (list) {
struct tasklet_struct *t = list;
+ unsigned long rcu_lockdep = rcu_get_lockdep_state();

list = list->next;

@@ -790,6 +791,10 @@ static void tasklet_action_common(struct softirq_action *a,
}
tasklet_unlock(t);
}
+ if (rcu_lockdep != rcu_get_lockdep_state()) {
+ printk(something useful about t);
+ RCU_LOCKDEP_WARN(... something else useful ...);
+ }

local_irq_disable();

where rcu_get_lockdep_state() returns a bitmap of whether the four rcu
lockdep maps are held.

We might also need something similar in __do_softirq(), in case it's
not a tasklet that's the problem.