Re: [syzbot] [nfs?] INFO: task hung in nfsd_nl_listener_set_doit
From: Jeff Layton
Date: Wed Sep 04 2024 - 10:39:10 EST
On Wed, 2024-09-04 at 10:23 -0400, Chuck Lever wrote:
> On Mon, Sep 02, 2024 at 11:57:55AM +1000, NeilBrown wrote:
> > On Sun, 01 Sep 2024, syzbot wrote:
> > > syzbot has found a reproducer for the following issue on:
> >
> > I had a poke around using the provided disk image and kernel for
> > exploring.
> >
> > I think the problem is demonstrated by this stack :
> >
> > [<0>] rpc_wait_bit_killable+0x1b/0x160
> > [<0>] __rpc_execute+0x723/0x1460
> > [<0>] rpc_execute+0x1ec/0x3f0
> > [<0>] rpc_run_task+0x562/0x6c0
> > [<0>] rpc_call_sync+0x197/0x2e0
> > [<0>] rpcb_register+0x36b/0x670
> > [<0>] svc_unregister+0x208/0x730
> > [<0>] svc_bind+0x1bb/0x1e0
> > [<0>] nfsd_create_serv+0x3f0/0x760
> > [<0>] nfsd_nl_listener_set_doit+0x135/0x1a90
> > [<0>] genl_rcv_msg+0xb16/0xec0
> > [<0>] netlink_rcv_skb+0x1e5/0x430
> >
> > No rpcbind is running on this host so that "svc_unregister" takes a
> > long time. Maybe not forever but if a few of these get queued up all
> > blocking some other thread, then maybe that pushed it over the limit.
> >
> > The fact that rpcbind is not running might not be relevant as the test
> > messes up the network. "ping 127.0.0.1" stops working.
> >
> > So this bug comes down to "we try to contact rpcbind while holding a
> > mutex and if that gets no response and no error, then we can hold the
> > mutex for a long time".
> >
> > Are we surprised? Do we want to fix this? Any suggestions how?
>
> In the past, we've tried to address "hanging upcall" issues where
> the kernel part of an administrative command needs a user space
> service that isn't working or present. (eg mount needing a running
> gssd)
>
> If NFSD is using the kernel RPC client for the upcall, then maybe
> adding the RPC_TASK_SOFTCONN flag might turn the hang into an
> immediate failure.
>
> IMO this should be addressed.
>
Looking at rpcb_register_call, it looks like we already set SOFTCONN if
is_set is true. We probably did that assuming that we only call
svc_unregister on shutdown. svc_rpcb_setup does this though:
/* Remove any stale portmap registrations */
svc_unregister(serv, net);
return 0;
What would be the risk in just setting SOFTCONN unconditionally?
--
Jeff Layton <jlayton@xxxxxxxxxx>