On Mon, Nov 25, 2024 at 1:48 PM Li Lingfeng <lilingfeng3@xxxxxxxxxx> wrote:
Hi, we have found a hungtask issue recently.I disagree with that comment. Most small companies have NFS client and
Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
memory condition") adds a shrinker to NFSD, which causes NFSD to try to
obtain shrinker_rwsem when starting and stopping services.
Deploying both NFS client and server on the same machine may lead to the
following issue, since they will share the global shrinker_rwsem.
nfsd nfs
drop_cache // hold shrinker_rwsem
write back, wait for rpc_task to exit
// stop nfsd threads
svc_set_num_threads
// clean up xprts
svc_xprt_destroy_all
rpc_check_timeout
rpc_check_connected
// wait for the connection to be disconnected
unregister_shrinker
// wait for shrinker_rwsem
Normally, the client's rpc_task will exit after the server's nfsd thread
has processed the request.
When all the server's nfsd threads exit, the client’s rpc_task is expected
to detect the network connection being disconnected and exit.
However, although the server has executed svc_xprt_destroy_all before
waiting for shrinker_rwsem, the network connection is not actually
disconnected. Instead, the operation to close the socket is simply added
to the task_works queue.
svc_xprt_destroy_all
...
svc_sock_free
sockfd_put
fput_many
init_task_work // ____fput
task_work_add // add to task->task_works
The actual disconnection of the network connection will only occur after
the current process finishes.
do_exit
exit_task_work
task_work_run
...
____fput // close sock
Although it is not a common practice to deploy NFS client and server on
the same machine, I think this issue still needs to be addressed,
otherwise it will cause all processes trying to acquire the shrinker_rwsem
to hang.
NFS server on the same machine, the client being used to allow logins
by users, or to support schroot or containers.
Mark