Re: [syzbot] WARNING: suspicious RCU usage in copy_page_range

From: Dmitry Vyukov
Date: Wed Mar 31 2021 - 05:58:07 EST


On Wed, Mar 31, 2021 at 9:31 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Mar 31, 2021 at 08:11:38AM +0200, Dmitry Vyukov wrote:
> > On Wed, Mar 31, 2021 at 12:26 AM syzbot
> > <syzbot+1a33233ccd8201ec2322@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit: db24726b Merge tag 'integrity-v5.12-fix' of git://git.kern..
> > > git tree: upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=16c16b7cd00000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=daeff30c2474a60f
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=1a33233ccd8201ec2322
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+1a33233ccd8201ec2322@xxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > I think this is a LOCKDEP issue. +LOCKDEP maintainers.
> >
> > Another bug happened on another thread ("WARNING: possible circular
> > locking dependency detected"). Lockdep disabled lock tracking
> > ("debug_locks = 0" in the report), which probably made it miss
> > rcu_unlock somewhere, but it did not turn off reporting yet and
> > produced the false positive first.
> >
> > I think if LOCKDEP disables lock tracking, it must also disable
> > reporting of issues that require lock tracking. That would avoid false
> > positives.
>
> Still early and brain hasn't really booted yet, but features that
> require lock tracking are supposed to check debug_locks.
>
> And afaict debug_lockdep_rcu_enabled(), which is called by
> RCU_LOCKDEP_WARN(), which is called by rcu_sleep_check() does just that.

Right... yet it somehow happens.
Looking at a dozen of reports, all with 2 concurrent lockdep splats
and "debug_locks = 0" in the report, I am pretty sure there is some
kind of race in lockdep.
I see there are at least 2 places where lockdep can falsely assume rcu
lock is held:
https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/locking/lockdep.c#L5543
https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/rcu/update.c#L105
both to "avoid false positives", but for "Illegal context switch in
RCU-bh read-side critical section" it can actually lead to false
positives, right?

Is there something else that turns off tracking before setting
debug_locks=0? Perhaps we get into that window where tracking is
disabled, but debug_locks is not reset yet?

lockdep_enabled() returns false if lockdep_recursion var is set:
https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/locking/lockdep.c#L87

but lockdep_lock() sets it _before_ taking the lock:
https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/locking/lockdep.c#L111

Is it possible that lockdep_recursion is set, then the task is
rescheduled and another task sees wrong value for lockdep_recursion?
Shouldn't lockdep_recursion be set _after_ arch_spin_unlock(&__lock)?
Though, I assume lockdep_lock() is called frequently and not only on
reports, so if my reasoning would be true, it would produce false
positives all the time, not necessary on concurrent reports... this
does not agree with the observed failure mode...