Re: srcu: BUG in __synchronize_srcu
From: Paul E. McKenney
Date: Sat Mar 11 2017 - 15:06:28 EST
On Sat, Mar 11, 2017 at 02:25:14PM +0000, Mathieu Desnoyers wrote:
> ----- On Mar 10, 2017, at 5:26 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote:
>
> > On Fri, Mar 10, 2017 at 08:29:55PM +0100, Andrey Konovalov wrote:
> >> On Fri, Mar 10, 2017 at 8:28 PM, Andrey Konovalov <andreyknvl@xxxxxxxxxx> wrote:
> >> > Hi,
> >> >
> >> > I've got the following error report while fuzzing the kernel with
> >> > syzkaller on an arm64 board.
> >>
> >> This also happened on x86 a few times during fuzzing, however it
> >> wasn't reproducible.
> >>
> >> >
> >> > On linux-next commit 56b8bad5e066c23e8fa273ef5fba50bd3da2ace8 (Mar 8).
> >> >
> >> > A reproducer and .config are attached.
> >> >
> >> > The bug happens while executing the following syzkaller program in a loop.
> >> > While it looks kvm-related, it might be that kvm just stresses the
> >> > srcu subsystem.
> >> >
> >> > mmap(&(0x7f0000000000/0xfff000)=nil, (0xfff000), 0x3, 0x32,
> >> > 0xffffffffffffffff, 0x0)
> >> > r0 = openat$kvm(0xffffffffffffff9c,
> >> > &(0x7f0000a05000)="2f6465762f6b766d00", 0x0, 0x0)
> >> > ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
> >> >
> >> > ------------[ cut here ]------------
> >> > kernel BUG at kernel/rcu/srcu.c:436!
> >
> > This is on v4.10, correct?
>
> No, as stated above, this is on linux-next as of March 8, 2017.
>
> I'm currently looking at the diff of kernel/rcu/srcu.c between v4.10 and
> next-20170310. There are a few changes there.
OK, I must have ended up in the wrong place -- I found line 436 to be
a comment, which caused me to guess a different version.
> One of them is the introduction of the new srcu algorithm with separate
> lock vs unlock counters, which simplifies read lock/unlock by removing the
> sequence counter.
>
> Another change that directly impacts __synchronize_rcu() (near the
> BUG()) is bc138e7a "srcu: Allow mid-boot use of synchronize_srcu()".
>
> Odds are that this change introduce an unforeseen scenario that
> skips awaiting for completion later after boot than expected. It
> might be worthwhile to review it closely once more.
SRCU is under active development, so yes, bugs are likely. I will take
another look.
At the very least, a READ_ONCE() on rcu_scheduler_active would be good.
And that fastpath is removed entirely in later commits not yet pushed out,
so maybe rearranging or combining commits might be helpful. Though it
would of course still be good to identify the bug in case it is something
subtle that persists.
Thanx, Paul
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> >
> >> > Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> >> > Modules linked in: meson_drm drm_kms_helper drm dwmac_generic realtek
> >> > dwmac_meson8b stmmac_platform stmmac meson_rng rng_core meson_gxbb_wdt
> >> > ipv6
> >> > CPU: 3 PID: 4250 Comm: a.out Not tainted 4.11.0-rc1-next-20170308-xc2-dirty #3
> >> > Hardware name: Hardkernel ODROID-C2 (DT)
> >> > task: ffff800063699700 task.stack: ffff800063cfc000
> >> > PC is at[< none >] __synchronize_srcu+0x3d0/0x470
> >> > kernel/rcu/srcu.c:412
> >> > LR is at[< none >] __synchronize_srcu+0x130/0x470
> >> > kernel/rcu/srcu.c:434
> >> > pc : [<ffff20000821a3b8>] lr : [<ffff20000821a118>] pstate: 80000145
> >> > sp : ffff800063cffb00
> >> > x29: ffff800063cffb00 x28: ffff80005b1d6e00
> >> > x27: 1fffe4000156b242 x26: ffff800063cffb70
> >> > x25: 1fffe4000156b23b x24: ffff20000ab591d8
> >> > x23: ffff200009dbf000 x22: ffff20000ab591a0
> >> > x21: ffff20000ab59210 x20: ffff800063cffb70
> >> > x19: ffff20000ab59190 x18: 0000000000000a03
> >> > x17: 0000ffff944f3950 x16: ffff20000811f818
> >> > x15: 0000000000000000 x14: 0000000000000007
> >> > x13: 0000000000000002 x12: 0000000000000000
> >> > x11: 0000000000000040 x10: 1fffe400014b568c
> >> > x9 : ffff20000ab29000 x8 : 0000000000000007
> >> > x7 : 0000000000000001 x6 : 0000000000000000
> >> > x5 : 0000000000000040 x4 : 0000000000000003
> >> > x3 : ffff20000ab59208 x2 : 1fffe4000156b243
> >> > x1 : 0000000000000000 x0 : ffff80005e71fb70
> >> >
> >> > Process a.out (pid: 4250, stack limit = 0xffff800063cfc000)
> >> > Stack: (0xffff800063cffb00 to 0xffff800063d00000)
> >> > fb00: ffff800063cffbd0 ffff20000821a480 ffff20000ab59190 1ffff0000b63adc0
> >> > fb20: dfff200000000000 ffff20000ab59190 ffff80004b9a8a00 1ffff000097351bc
> >> > fb40: ffff80004b9a8de0 0000000000000000 ffff800060cad328 ffff80005b1d6e00
> >> > fb60: ffff80004b9a8a00 1ffff000097351bc ffff80004f5e7b70 ffff200008217968
> >> > fb80: ffff800000000001 dead4ead00010001 dfff2000ffffffff ffffffffffffffff
> >> > fba0: ffff20000ab4c4b0 0000000000000000 0000000000000000 ffff200009b0b358
> >> > fbc0: ffff800063cffbc0 ffff800063cffbc0 ffff800063cffbf0 ffff2000083ffd20
> >> > fbe0: ffff80005b1d6e00 0000000000000140 ffff800063cffc50 ffff2000083aedfc
> >> > fc00: ffff80004b9a8a00 ffff80004b9a8a00 ffff80004b9a8d78 0000000000000001
> >> > fc20: 00000000000002a6 ffff80004f406780 ffff80004b9a8aa0 ffff800063699ac8
> >> > fc40: 1ffff0000c6d3359 ffff800063699700 ffff800063cffd20 ffff20000810caec
> >> > fc60: ffff80004b9a8a00 ffff80004b9a8b20 ffff80004b9a8d78 0000000000000001
> >> > fc80: ffff80005ebae2d8 ffff800063699ac8 ffff800063cffca0 ffff20000840fc08
> >> > fca0: ffff800063cffce0 ffff20000843327c ffff80005ebae2d8 ffff80004b9a8a00
> >> > fcc0: ffff80005ebae0f0 ffff200009de8000 ffff800063cffce0 ffff2000084332c4
> >> > fce0: ffff800063cffd20 ffff20000810cc64 ffff80004b9a8a00 ffff80004b9a8a48
> >> > fd00: ffff80004b9a8d78 0000000000000001 ffff800063cffd20 ffff20000810cae0
> >> > fd20: ffff800063cffd60 ffff20000811db88 ffff800063699700 ffff800063699700
> >> > fd40: ffff80004b9a8a00 0000000000000001 00000000000002a6 ffff80004f406780
> >> > fd60: ffff800063cffe40 ffff20000811f694 ffff80004f406780 0000000000000000
> >> > fd80: ffff80004f40681c 0000000000000004 1ffff00009e80d03 1ffff00009e80d02
> >> > fda0: ffff80004f406810 ffff800063699bd0 ffff800063699700 ffff800063699700
> >> > fdc0: ffff800063cffe80 ffff200008490868 0000000000000000 ffff80005fd45000
> >> > fde0: ffff80006369972c ffff800063699d48 1ffff0000c6d32e5 0000000000000004
> >> > fe00: 0000000000000123 000000000000001d 1ffff0000c6d33a9 ffff800063699700
> >> > fe20: ffff800063cffe30 ffff200008813c5c ffff800063cffe40 ffff20000811f688
> >> > fe40: ffff800063cffea0 ffff20000811f838 0000000000000000 000060006d24d000
> >> > fe60: ffffffffffffffff 0000ffff944f3974 0000000000000000 0000000000000015
> >> > fe80: 0000000000000123 000000000000005e ffff200009852000 ffff20000811f82c
> >> > fea0: 0000000000000000 ffff200008083f70 0000000000000000 0000000000000015
> >> > fec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> >> > fee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> >> > ff00: 000000000000005e ffffff80ffffffd0 0101010101010101 0000000000000020
> >> > ff20: 0000000000000018 0000000056bcb768 0000000000000000 0000ffff945be000
> >> > ff40: 0000000000413110 0000ffff944f3950 0000000000000a03 00000000004020f8
> >> > ff60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> >> > ff80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> >> > ffa0: 0000000000000000 0000ffffc0cc99b0 0000000000401248 0000ffffc0cc99b0
> >> > ffc0: 0000ffff944f3974 0000000000000000 0000000000000000 000000000000005e
> >> > ffe0: 0000000000000000 0000000000000000 0403030303030100 0807060606060605
> >> > Call trace:
> >> > Exception stack(0xffff800063cff910 to 0xffff800063cffa40)
> >> > f900: ffff20000ab59190 0001000000000000
> >> > f920: ffff800063cffb00 ffff20000821a3b8 0000000080000145 000000000000003d
> >> > f940: 1fffe4000156b23b ffff800063cffb70 ffff800063cff980 0001000000000000
> >> > f960: ffff800063cff9d0 ffff2000081da8e8 ffff800063699ec0 ffff200009df9000
> >> > f980: ffff800063cff990 ffff20000891e1e0 ffff800063cff9d0 ffff20000891e23c
> >> > f9a0: ffff200009dbfe18 0000000000000040 0000000000000004 0000000000000001
> >> > f9c0: 00000000000008ac 00000000000008ac ffff80005e71fb70 0000000000000000
> >> > f9e0: 1fffe4000156b243 ffff20000ab59208 0000000000000003 0000000000000040
> >> > fa00: 0000000000000000 0000000000000001 0000000000000007 ffff20000ab29000
> >> > fa20: 1fffe400014b568c 0000000000000040 0000000000000000 0000000000000002
> >> > [<ffff20000821a3b8>] __synchronize_srcu+0x3d0/0x470 kernel/rcu/srcu.c:412
> >> > [<ffff20000821a480>] synchronize_srcu+0x28/0x60 kernel/rcu/srcu.c:516
> >> > [<ffff2000083ffd20>] __mmu_notifier_release+0x268/0x3e0 mm/mmu_notifier.c:102
> >> > [< inline >] mmu_notifier_release ./include/linux/mmu_notifier.h:235
> >> > [<ffff2000083aedfc>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
> >> > [< inline >] __mmput kernel/fork.c:881
> >> > [<ffff20000810caec>] mmput+0xdc/0x2e0 kernel/fork.c:903
> >> > [< inline >] exit_mm kernel/exit.c:557
> >> > [<ffff20000811db88>] do_exit+0x648/0x2020 kernel/exit.c:865
> >> > [<ffff20000811f694>] do_group_exit+0xdc/0x260 kernel/exit.c:982
> >> > [< inline >] SYSC_exit_group kernel/exit.c:993
> >> > [<ffff20000811f838>] __wake_up_parent+0x0/0x60 kernel/exit.c:991
> >> > [<ffff200008083f70>] el0_svc_naked+0x24/0x28 arch/arm64/kernel/entry.S:813
> >> > Code: 97feee10 35fff680 17ffff1c d503201f (d4210000)
> >> > ---[ end trace b727e9858bfac1ff ]---
> >> > Kernel panic - not syncing: Fatal exception
> >
> > So the theory is that if !sp->running, all of SRCU's queues must be empty.
> > So if you are holding ->queue_lock (with irqs disabled) and you see
> > !sp->running, and then you enqueue a callback on ->batch_check0, then
> > that callback must be the first in the list. And the code preceding
> > the WARN_ON() you triggered does in fact check and enqueue shile holding
> > ->queue_lock with irqs disabled.
> >
> > And rcu_batch_queue() does operate FIFO as required. (Otherwise,
> > srcu_barrier() would not work.)
> >
> > There are only three calls to rcu_batch_queue(), and the one involved with
> > the WARN_ON() enqueues to ->batch_check0. The other two enqueue to
> > ->batch_queue. Callbacks move from ->batch_queue to ->batch_check0 to
> > ->batch_check1 to ->batch_done, so nothing should slip in front.
> >
> > Of course, if ->running were ever set to false with any of ->batch_check0,
> > ->batch_check1, or ->batch_done non-empty, this WARN_ON() would trigger.
> > But srcu_reschedule() sets it to false only if all four batches are
> > empty (and checks and sets under ->queue_lock()), and all other cases
> > where it is set to false happen at initialization time, and also clear
> > out the queues. Of course, if someone raced an init_srcu_struct() with
> > either a call_srcu() or synchronize_srcu(), all bets are off. Now,
> > mmu_notifier.c does invoke init_srcu_struct() manually, but it does
> > so at subsys_initcall() time. Which -might- be after other things are
> > happening, so one "hail Mary" attempted fix is to remove mmu_notifier_init()
> > and replace the "static struct srcu_struct srcu" with:
> >
> > DEFINE_STATIC_SRCU(srcu);
> >
> > But this might require changing the name -- I vaguely recall some
> > strangeness where the names of statically defined per-CPU variables need
> > to be globally unique even when static. Easy enough to do, though.
> > Might need a similar change to the "srcu" instances defined in vmd.c
> > and kvm_host.h -- assuming that this change helps.
> >
> > Another possibility is that something in SRCU is messing with either the
> > queues or the ->running field without holding ->queue_lock. And that does
> > seem to be happening -- srcu_advance_batches() invokes rcu_batch_move()
> > without holding anything. Which seems like it could cause trouble
> > if someone else was invoking synchronize_srcu() concurrently. Those
> > particular invocations might be safe due to access only by a single
> > kthread/workqueue, given that all updates to ->batch_queue are protected
> > by ->queue_lock (aside from initialization).
> >
> > But ->batch_check0 is updated by __synchronize_srcu(), though protected
> > by ->queue_lock, and only if ->running is false, and with both the
> > check and the update protected by the same ->queue_lock critical section.
> > If ->running is false, then the workqueue cannot be running, so it remains
> > to see if all other updates to ->batch_check0 are either with ->queue_lock
> > held and ->running false on the one hand or from the workqueue handler
> > on the other:
> >
> > srcu_collect_new() updates with ->queue_lock held, but does not check
> > ->running. It is invoked only from process_srcu(), which in
> > turn is invoked only as a workqueue handler. The work is queued
> > from:
> >
> > call_srcu(), which does so with ->queue_lock held having just
> > set ->running to true.
> >
> > srcu_reschedule(), which invokes it if there are non-empty
> > queues. This is invoked from __synchronize_srcu()
> > in the case where it has set ->running to true
> > after finding the queues empty, which should imply
> > no other instances.
> >
> > It is also invoked from process_srcu(), which is
> > invoked only as a workqueue handler. (Yay
> > recursive inquiry!)
> >
> > srcu_advance_batches() updates without locks held. It is invoked as
> > follows:
> >
> > __synchronize_srcu() in the case where ->running was set, which
> > as noted before excludes workqueue handlers.
> >
> > process_srcu() which as noted before is only invoked from
> > a workqueue handler.
> >
> > So an SRCU workqueue is invoked only from a workqueue handler, or from
> > some other task that transitioned ->running from false to true while
> > holding ->queuelock. There should therefore only be one SRCU workqueue
> > per srcu_struct, so this should be safe. Though I hope that it can
> > be simplified a bit. :-/
> >
> > So the only suggestion I have at the moment is static definition of
> > the "srcu" variable. Lai, Josh, Steve, Mathieu, anything I missed?
> >
> > Thanx, Paul
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>