Re: [PATCH] sunrpc: add a rpc_clnt shutdown control in debugfs
From: Benjamin Coddington
Date: Thu Mar 13 2025 - 09:35:37 EST
On 13 Mar 2025, at 9:15, Jeff Layton wrote:
> On Wed, 2025-03-12 at 22:31 +0000, Trond Myklebust wrote:
>> On Wed, 2025-03-12 at 10:37 -0400, Jeff Layton wrote:
>>> On Wed, 2025-03-12 at 09:52 -0400, Benjamin Coddington wrote:
>>>> On 12 Mar 2025, at 9:36, Jeff Layton wrote:
>>>>
>>>>> There have been confirmed reports where a container with an NFS
>>>>> mount
>>>>> inside it dies abruptly, along with all of its processes, but the
>>>>> NFS
>>>>> client sticks around and keeps trying to send RPCs after the
>>>>> networking
>>>>> is gone.
>>>>>
>>>>> We have a reproducer where if we SIGKILL a container with an NFS
>>>>> mount,
>>>>> the RPC clients will stick around indefinitely. The orchestrator
>>>>> does a MNT_DETACH unmount on the NFS mount, and then tears down
>>>>> the
>>>>> networking while there are still RPCs in flight.
>>>>>
>>>>> Recently new controls were added[1] that allow shutting down an
>>>>> NFS
>>>>> mount. That doesn't help here since the mount namespace is
>>>>> detached from
>>>>> any tasks at this point.
>>>>
>>>> That's interesting - seems like the orchestrator could just reorder
>>>> its
>>>> request to shutdown before detaching the mount namespace. Not an
>>>> objection,
>>>> just wondering why the MNT_DETACH must come first.
>>>>
>>>
>>> The reproducer we have is to systemd-nspawn a container, mount up an
>>> NFS mount inside it, start some I/O on it with fio and then kill -9
>>> the
>>> systemd running inside the container. There isn't much the
>>> orchestrator
>>> (root-level systemd) can do to at that point other than clean up
>>> what's
>>> left.
>>>
>>> I'm still working on a way to reliably detect when this has happened.
>>> For now, we just have to notice that some clients aren't dying.
>>>
>>>>> Transplant shutdown_client() to the sunrpc module, and give it a
>>>>> more
>>>>> distinct name. Add a new debugfs sunrpc/rpc_clnt/*/shutdown knob
>>>>> that
>>>>> allows the same functionality as the one in /sys/fs/nfs, but at
>>>>> the
>>>>> rpc_clnt level.
>>>>>
>>>>> [1]: commit d9615d166c7e ("NFS: add sysfs shutdown knob").
>>>>>
>>>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>>
>>>> I have a TODO to patch Documentation/ for this knob mostly to write
>>>> warnings
>>>> because there are some potential "gotchas" here - for example you
>>>> can have
>>>> shared RPC clients and shutting down one of those can cause
>>>> problems for a
>>>> different mount (this is true today with the
>>>> /sys/fs/nfs/[bdi]/shutdown
>>>> knob). Shutting down aribitrary clients will definitely break
>>>> things in
>>>> weird ways, its not a safe place to explore.
>>>>
>>>
>>> Yes, you really do need to know what you're doing. 0200 permissions
>>> are
>>> essential for this file, IOW. Thanks for the R-b!
>>
>> Sorry, but NACK! We should not be adding control mechanisms to debugfs.
>>
>
> Ok. Would adding sunrpc controls under sysfs be more acceptable? I do
> agree that this is a potential footgun, however. It would be nicer to
> clean this situation up automagically.
>
>> One thing that might work in situations like this is perhaps to make
>> use of the fact that we are monitoring whether or not rpc_pipefs is
>> mounted. So if the mount is containerised, and the orchestrator
>> unmounts everything, including rpc_pipefs, we might take that as a hint
>> that we should treat any future connection errors as being fatal.
>>
>
> rpc_pipefs isn't being mounted at all in the container I'm using. I
> think that's not going to be a reliable test for this.
>
>> Otherwise, we'd have to be able to monitor the root task, and check if
>> it is still alive in order to figure out if out containerised world has
>> collapsed.
>>
>
> If by the root task, you mean the initial task in the container, then
> that method seems a little sketchy too. How would we determine that
> from the RPC layer?
>
> To be clear: the situation here is that we have a container with a veth
> device that is communicating with the outside world. Once all of the
> processes in the container exit, the veth device in the container
> disappears. The rpc_xprt holds a ref on the netns though, so that
> sticks around trying to retransmit indefinitely.
>
> I think what we really need is a lightweight reference on the netns.
> Something where we can tell that there are no userland tasks that care
> about it anymore, so we can be more aggressive about giving up on it.
>
> There is a "passive" refcount inside struct net, but that's not quite
> what we need as it won't keep the sunrpc_net in place.
>
> What if instead of holding a netns reference in the xprt, we have it
> hold a reference on a new refcount_t that lives in sunrpc_net? Then, we
> add a pre_exit pernet_ops callback that does a shutdown_client() on all
> of the rpc_clnt's attached to the xprts in that netns. The pre_exit can
> then just block until the sunrpc_net refcount goes to 0.
>
> I think that would allow everything to be cleaned up properly?
Do you think that might create unwanted behaviors for a netns that might
still be repairable? Maybe that doesn't make a lot of sense if there are no
processes in it, but I imagine a network namespace could be in this state
and we'd still want to try to use it.
I think there's an out-of-kernel (haven't tried yet) way to do it with udev,
which, if used, creates an explicit requirement for the orchestrator to
define exactly what should happen if the veth goes away. When creating the
namespace, the orchestrator should insert a rule that says "when this veth
disappears, we shutdown this fs".
Again, I'm not sure if that's even possible, but I'm willing to muck around
a bit and give it a try.
Ben