Re: WARNING in ib_umad_kill_port

From: Dmitry Vyukov
Date: Thu Apr 09 2020 - 09:35:17 EST


On Tue, Apr 7, 2020 at 4:35 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Tue, Apr 07, 2020 at 02:39:42PM +0200, Dmitry Vyukov wrote:
> > On Tue, Apr 7, 2020 at 1:55 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > >
> > > On Tue, Apr 07, 2020 at 11:56:30AM +0200, Dmitry Vyukov wrote:
> > > > > I'm not sure what could be done wrong here to elicit this:
> > > > >
> > > > > sysfs group 'power' not found for kobject 'umad1'
> > > > >
> > > > > ??
> > > > >
> > > > > I've seen another similar sysfs related trigger that we couldn't
> > > > > figure out.
> > > > >
> > > > > Hard to investigate without a reproducer.
> > > >
> > > > Based on all of the sysfs-related bugs I've seen, my bet would be on
> > > > some races. E.g. one thread registers devices, while another
> > > > unregisters these.
> > >
> > > I did check that the naming is ordered right, at least we won't be
> > > concurrently creating and destroying umadX sysfs of the same names.
> > >
> > > I'm also fairly sure we can't be destroying the parent at the same
> > > time as this child.
> > >
> > > Do you see the above commonly? Could it be some driver core thing? Or
> > > is it more likely something wrong in umad?
> >
> > Mmmm... I can't say, I am looking at some bugs very briefly. I've
> > noticed that sysfs comes up periodically (or was it some other similar
> > fs?).
>
> Hmm..
>
> Looking at the git history I see several cases where there are
> ordering problems. I wonder if the rdma parent device is being
> destroyed before the rdma devices complete destruction?
>
> I see the syzkaller is creating a bunch of virtual net devices, and I
> assume it has created a software rdma device on one of these virtual
> devices.
>
> So I'm guessing that it is also destroying a parent? But I can't guess
> which.. Some simple tests with veth suggest it is OK because the
> parent is virtual. But maybe bond or bridge or something?
>
> The issue in rdma is that unregistering a netdev triggers an async
> destruction of the RDMA devices. This has to be async because the
> netdev notification is delivered with RTNL held, and a rdma device
> cannot be destroyed while holding RTNL.
>
> So there is a race, I suppose, where the netdev can complete
> destruction while rdma continues, and if someone deletes the sysfs
> holding the netdev before rdma completes, I'm going to guess, that we
> hit this warning?
>
> Could it be? I would love to know what netdev the rdma device was
> created on, but it doesn't seem to show in the trace :\
>
> This theory could be made more likely by adding a sleep to
> ib_unregister_work() to increase the race window - is there some way
> to get syzkaller to search for a reproducer with that patch?


Bad it happened in kthread context. Otherwise it's usually possible to
pinpoint the test based on process name.

syz-repro utility will do reproduction process with a any kernel you give it:
https://github.com/google/syzkaller/blob/master/docs/reproducing_crashes.md

Or it's possible to run individual programs, or whole log with
syz-execprog utility:
https://github.com/google/syzkaller/blob/master/docs/executing_syzkaller_programs.md

Or maybe you could pinpoint the guilty test program by hand in the log
(it's probably somewhere closer to the end):
https://syzkaller.appspot.com/x/log.txt?x=119dd16de00000