RE: [PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn
From: Ben Roberts
Date: Fri Apr 17 2026 - 06:26:13 EST
Hi Ben,
> Did you reproduce and diagnose this problem on a recent upstream kernel
> version?
I'm not able to switch the production systems to a more recent kernel at
this time, and don't have a reliable way to reproduce the issue in the
wild without risking production systems back on an unpatched kernel. The
best evidence I have that this patch is needed, is that we were seeing
this deadlock occur repeatedly under high loads and memory pressure last
year before applying this patch locally. It was rolled out to all
production systems in early Jan and we have not seen a single reoccurrence
since. The relevant code paths look similar between the modified EL9
kernel and the current git HEAD.
I've spent all this week trying to devise a precise reproduction (with a
lot of help from an LLM since I'm not that familar with kernel
development) on both 7.0.0-rc7 (9a9c8ce300cd) and 5.14.0-611.9.1 kernels
to definitively prove this patch is needed, but without success. The
initial analysis suggested the deadlock might be triggered from a single
process via a recursive call. This theory has been ruled out; all calling
paths triggered by a single process are guarded in such a way that a
recursive call does not happen. Revised theory for why this patch helps
is that the deadlock is subject to an inter-process race, where one
process suffers a memory allocation failure, causing a second process to
become a victim of bad state, leading to an unserviceable RPC request.
The timing appears to be very sensitive and hard to reproduce on demand.
The LLM-assisted analysis follows, and to my non-expert eyes seems
compelling.
Thanks,
Ben
---
When pnfs_send_layoutreturn() fails to allocate memory for the layoutreturn
structure, it returns -ENOMEM after calling pnfs_clear_layoutreturn_waitbit
to wake waiting processes. However, it fails to clear the
NFS_LAYOUT_RETURN_REQUESTED flag via pnfs_clear_layoutreturn_info().
This creates a race condition where processes woken by
pnfs_clear_layoutreturn_waitbit will see NFS_LAYOUT_RETURN_REQUESTED still
set and may attempt another layoutreturn on the corrupted layout state,
leading to hung tasks in rpc_wait_bit_killable().
The bug manifests during memory pressure when:
1. Process A calls pnfs_send_layoutreturn(), kzalloc fails
2. Error path calls pnfs_clear_layoutreturn_waitbit() (wakes waiters)
3. Error path does NOT call pnfs_clear_layoutreturn_info() (flag remains
set)
4. Process B wakes from wait_on_bit() at _pnfs_return_layout:1484
5. Process B calls pnfs_put_layout_hdr(), refcount reaches zero
6. pnfs_layoutreturn_before_put_layout_hdr() checks flag at line 1400
7. Flag is still set, so Process B calls pnfs_send_layoutreturn() (async)
8. RPC operates on inconsistent/dying layout state
9. Process B hangs indefinitely in rpc_wait_bit_killable()
The fix ensures that when allocation fails in the error path, both the
waitbit AND the layoutreturn info are cleared, so waiting processes wake
to consistent state and do not attempt operations on the corrupted layout.
The Deadlock Mechanism
Thread A (allocation failure):
// Line 1360-1368 in pnfs_send_layoutreturn()
if (unlikely(lrp == NULL)) {
status = -ENOMEM;
spin_lock(&ino->i_lock);
pnfs_clear_layoutreturn_waitbit(lo); // Wakes Thread B
spin_unlock(&ino->i_lock);
// BUG: NFS_LAYOUT_RETURN_REQUESTED still set!
put_cred(cred);
pnfs_put_layout_hdr(lo); // Decrements refcount
goto out;
}
Thread B (wakes up):
// In _pnfs_return_layout() around line 1458
if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
spin_unlock(&ino->i_lock);
if (wait_on_bit(...)) // Thread B was blocked HERE
goto out_put_layout_hdr; // Now wakes up and goes here
spin_lock(&ino->i_lock);
}
// ... continues ...
wait_on_bit(...); // Line 1484 - passes through (bit already clear)
out_put_layout_hdr:
pnfs_free_lseg_list(&tmp_list);
pnfs_put_layout_hdr(lo); // Line 1487 - THE CRITICAL CALL
Inside Thread B's pnfs_put_layout_hdr() call (line 306):
void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo)
{
inode = lo->plh_inode;
pnfs_layoutreturn_before_put_layout_hdr(lo); // 313 Called FIRST
if (refcount_dec_and_lock(&lo->plh_refcount, &inode->i_lock)) {
// If refcount reaches 0, layout is FREED here
pnfs_detach_layout_hdr(lo);
pnfs_free_layout_hdr(lo); // Layout memory FREED!
}
}
Inside pnfs_layoutreturn_before_put_layout_hdr() (line 1395):
static void
pnfs_layoutreturn_before_put_layout_hdr(struct pnfs_layout_hdr *lo)
{
if (!test_bit(NFS_LAYOUT_RETURN_REQUESTED, &lo->plh_flags))
return; // Line 1399-1400 Would early return if flag was clear
// BUG: Flag is STILL SET, so continues!
spin_lock(&inode->i_lock);
if (pnfs_layout_need_return(lo)) {
...
send = pnfs_prepare_layoutreturn(lo, &stateid, &cred, &iomode);
spin_unlock(&inode->i_lock);
if (send) {
// Sends ANOTHER layoutreturn RPC!
pnfs_send_layoutreturn(lo, &stateid, &cred, iomode,
PNFS_FL_LAYOUTRETURN_ASYNC); // 1412
}
}
}
Why It Deadlocks
The key issue is timing and state corruption:
1. Thread B initiates async RPC at line 1412 that references layout lo
2. Then Thread B returns to pnfs_put_layout_hdr() line 315
3. If refcount reaches 0: Layout is DETACHED and FREED (lines 318-323)
4. But the RPC from step 1 is still in flight!
The RPC is now operating on a layout structure that may be:
- Partially freed
- Corrupted
- Have invalid stateid
- Have detached inode reference
When the RPC machinery tries to complete this operation, it hits
inconsistent state and hangs in rpc_wait_bit_killable() waiting for an
RPC response that can never complete properly because the underlying
state is corrupted.
---
Example stack trace for a deadlocked process:
[<0>] rpc_wait_bit_killable+0xd/0x60 [sunrpc]
[<0>] __rpc_execute+0x13a/0x300 [sunrpc]
[<0>] rpc_execute+0xc5/0xf0 [sunrpc]
[<0>] rpc_run_task+0x14d/0x1c0 [sunrpc]
[<0>] nfs4_proc_layoutreturn+0x14f/0x270 [nfsv4]
[<0>] pnfs_send_layoutreturn+0x119/0x190 [nfsv4]
[<0>] _pnfs_return_layout+0x1b6/0x280 [nfsv4]
[<0>] nfs4_evict_inode+0x6d/0x70 [nfsv4]
[<0>] evict+0xcc/0x1d0
[<0>] dispose_list+0x48/0x70
[<0>] evict_inodes+0x1a0/0x1b0
[<0>] generic_shutdown_super+0x37/0x100
[<0>] kill_anon_super+0x12/0x40
[<0>] nfs_kill_super+0x22/0x40 [nfs]
[<0>] deactivate_locked_super+0x2e/0xb0
[<0>] cleanup_mnt+0x100/0x160
[<0>] task_work_run+0x59/0x90
[<0>] do_exit+0x264/0x480
[<0>] do_group_exit+0x2d/0x90
[<0>] get_signal+0x839/0x860
[<0>] arch_do_signal_or_restart+0x25/0x100
[<0>] exit_to_user_mode_loop+0x9c/0x130
[<0>] exit_to_user_mode_prepare+0xb9/0x100
[<0>] syscall_exit_to_user_mode+0x12/0x40
[<0>] do_syscall_64+0x6b/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
Ben Roberts
For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice
This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.