Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask

From: Li Lingfeng
Date: Mon Nov 25 2024 - 21:29:09 EST

Next message: Baichuan Qi: "[PATCH] wifi: ath11k: Fix NULL pointer check in ath11k_ce_rx_post_pipe()"
Previous message: lihuisong (C): "Re: [PATCH v1 3/4] hwmon: (acpi_power_meter) Remove redundant 'sensors_valid' variable"
In reply to: Mark Liam Brown: "Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask"
Next in thread: Li Lingfeng: "Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2024/11/26 1:32, Mark Liam Brown 写道:

On Mon, Nov 25, 2024 at 1:48 PM Li Lingfeng <lilingfeng3@xxxxxxxxxx> wrote:

Hi, we have found a hungtask issue recently.

Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
memory condition") adds a shrinker to NFSD, which causes NFSD to try to
obtain shrinker_rwsem when starting and stopping services.

Deploying both NFS client and server on the same machine may lead to the
following issue, since they will share the global shrinker_rwsem.

nfsd nfs
drop_cache // hold shrinker_rwsem
write back, wait for rpc_task to exit
// stop nfsd threads
svc_set_num_threads
// clean up xprts
svc_xprt_destroy_all
rpc_check_timeout
rpc_check_connected
// wait for the connection to be disconnected
unregister_shrinker
// wait for shrinker_rwsem

Normally, the client's rpc_task will exit after the server's nfsd thread
has processed the request.
When all the server's nfsd threads exit, the client’s rpc_task is expected
to detect the network connection being disconnected and exit.
However, although the server has executed svc_xprt_destroy_all before
waiting for shrinker_rwsem, the network connection is not actually
disconnected. Instead, the operation to close the socket is simply added
to the task_works queue.

svc_xprt_destroy_all
...
svc_sock_free
sockfd_put
fput_many
init_task_work // ____fput
task_work_add // add to task->task_works

The actual disconnection of the network connection will only occur after
the current process finishes.
do_exit
exit_task_work
task_work_run
...
____fput // close sock

Although it is not a common practice to deploy NFS client and server on
the same machine, I think this issue still needs to be addressed,
otherwise it will cause all processes trying to acquire the shrinker_rwsem
to hang.

I disagree with that comment. Most small companies have NFS client and
NFS server on the same machine, the client being used to allow logins
by users, or to support schroot or containers.

Mark

Sorry for my hasty conclusion.

By the way, nfsd_reply_cache_shrinker triggers this too.

Li

Next message: Baichuan Qi: "[PATCH] wifi: ath11k: Fix NULL pointer check in ath11k_ce_rx_post_pipe()"
Previous message: lihuisong (C): "Re: [PATCH v1 3/4] hwmon: (acpi_power_meter) Remove redundant 'sensors_valid' variable"
In reply to: Mark Liam Brown: "Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask"
Next in thread: Li Lingfeng: "Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]