Re: [PATCH RFC 0/9] nfs/sunrpc: stop holding netns references in client-side NFS and RPC objects
From: Jeff Layton
Date: Tue Mar 18 2025 - 07:31:33 EST
On Mon, 2025-03-17 at 22:11 +0000, Trond Myklebust wrote:
> On Mon, 2025-03-17 at 17:57 -0400, Jeff Layton wrote:
> > On Mon, 2025-03-17 at 21:35 +0000, Trond Myklebust wrote:
> > > On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote:
> > > > We have a long-standing problem with containers that have NFS
> > > > mounts
> > > > in
> > > > them. Best practice is to unmount gracefully, of course, but
> > > > sometimes
> > > > containers just spontaneously die (e.g. SIGSEGV in the init task
> > > > in
> > > > the
> > > > container). When that happens the orchestrator will see that all
> > > > of
> > > > the
> > > > tasks are dead, and will detach the mount namespace and kill off
> > > > the
> > > > network connection.
> > > >
> > > > If there are RPCs in flight at the time, the rpc_clnt will try to
> > > > retransmit them indefinitely, but there is no hope of them ever
> > > > contacting the server since nothing in userland can reach the
> > > > netns
> > > > at that point to fix anything.
> > > >
> > > > This patchset takes the approach of changing various nfs client
> > > > and
> > > > sunrpc objects to not hold a netns reference. Instead, when a
> > > > nfs_net
> > > > or
> > > > sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt
> > > > objects
> > > > associated with it are shut down, and the pre_exit functions
> > > > block
> > > > until they are gone.
> > > >
> > > > With this approach, when the last userland task in the container
> > > > exits,
> > > > the NFS and RPC clients get cleaned up automatically. As a bonus,
> > > > this
> > > > fixes another bug with the gssproxy RPC client that causes net
> > > > namespace
> > > > leaks in any container where it runs (details in the patch
> > > > descriptions).
> > > >
> > >
> > > So with this approach, what happens if the NFS mount was created in
> > > a
> > > container, but got bind mounted somewhere else?
> > >
> >
> > The lifetime of these objects are tied to the net namespace. If it
> > gets
> > bind-mounted into a different mount namespace, while the tasks are
> > setns()'ed into the correct net namespace, then I expect the mount
> > would end up shut down at that point and be unusable, just like if
> > you
> > echo 1 into the shutdown file in sysfs.
> >
> > Hopefully no one is doing anything that silly. You wouldn't be able
> > to
> > upcall, for one thing, since there wouldn't be any more userland
> > processes attached to the netns.
> >
> > I'll test that scenario and get back to you though. I do want to make
> > sure that that's not going to lead to a crash or anything.
>
> I agree with you that it's not a sane scenario, and that there is no
> need to try to make it work. However the user space tools are there to
> allow it to happen, so we need to ensure that the kernel won't panic or
> cause any new exotic hangs.
Unfortunately, this does create a hang.
Bind-mounting it will cause the superblock's refcount to increase,
which keeps the nfs_server struct active. That holds a reference to the
nfs_client, which prevents everything from coming down properly in
pre_exit.
I'll have to think about how we can solve that. Let me know if you have
ideas.
--
Jeff Layton <jlayton@xxxxxxxxxx>