Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

From: Benjamin Coddington

Date: Sun Nov 09 2025 - 13:36:40 EST


On 6 Nov 2025, at 12:05, Dai Ngo wrote:

> When a layout conflict triggers a call to __break_lease, the function
> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> its loop, waiting indefinitely for the conflicting file lease to be
> released.
>
> If the number of lease conflicts matches the number of NFSD threads (which
> defaults to 8), all available NFSD threads become occupied. Consequently,
> there are no threads left to handle incoming requests or callback replies,
> leading to a total hang of the NFS server.
>
> This issue is reliably reproducible by running the Git test suite on a
> configuration using SCSI layout.
>
> This patchset fixes this problem by introducing the new lm_breaker_timedout
> operation to lease_manager_operations and using timeout for layout
> lease break.

Hey Dai,

I like your solution here, but I worry it can cause unexpected or
unnecessary client fencing when the problem is server-side (not enough
threads). Clients might be dutifully sending LAYOUTRETURN, but the server
can't service them - and this change will cause some potentially unexpected
fencing in environments where things could be fixed (by adding more knfsd
threads). Also, I think we significantly bumped default thread counts
recently in nfs-utils:
eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16

You probably have already seen previous discussions about this:
https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@xxxxxxxxxx/

This also changes the behavior for all layouts, I haven't thought through
the implications of that - but I wish we could have knob for this behavior,
or perhaps a knfsd-specific fl_break_time tuneable.

Last thought (for now): I think Neil has some work for dynamic knfsd thread
count.. or Jeff? (I am having trouble finding it) Would that work around
this problem?

Regards,
Ben