Re: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation

From: Wang Zhaolong

Date: Mon Jan 05 2026 - 23:11:26 EST


Hi Trond,

Thanks a lot for taking the time to look into this and for the patch proposal.


I think we need to treat nfs_release_folio() as being different from
nfs_wb_folio() altogether. There are too many different ways in which
waiting on I/O can cause deadlocks, including waiting on the writeback
to complete. My suggestion is therefore that we just make this simple,
and see it as a hint that we should kick off either writeback or a
commit, but not that we should be waiting for it to complete.
So how about the following?


I agree with the overall direction.Avoiding any synchronous waits from
->release_folio() makes a lot of sense

8<------------------------------------------------------------------
From 3533434037066b610d50e7bd36f3525ace296928 Mon Sep 17 00:00:00 2001
Message-ID: <3533434037066b610d50e7bd36f3525ace296928.1767204181.git.trond.myklebust@xxxxxxxxxxxxxxx>
From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
Date: Wed, 31 Dec 2025 11:42:31 -0500
Subject: [PATCH] NFS: Fix a deadlock involving nfs_release_folio()

Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery
waiting on kthreadd, which is attempting to reclaim memory by calling
nfs_release_folio(). The latter cannot make progress due to state
recovery being needed.

It seems that the only safe thing to do here is to kick off a writeback
of the folio, without waiting for completion, or else kicking off an
asynchronous commit.

Reported-by: Wang Zhaolong <wangzhaolong@xxxxxxxxxxxxxxx>
Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
---
fs/nfs/file.c | 3 ++-
fs/nfs/nfstrace.h | 3 +++
fs/nfs/write.c | 32 ++++++++++++++++++++++++++++++++
include/linux/nfs_fs.h | 1 +
4 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..d1c138a416cf 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -511,7 +511,8 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
current_is_kswapd() || current_is_kcompactd())
return false;
- if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+ if (nfs_wb_folio_reclaim(folio->mapping->host, folio) < 0 ||
+ folio_test_private(folio))
return false;
}
return nfs_fscache_release_folio(folio, gfp);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 6ce55e8e6b67..9f9ce4a565ea 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1062,6 +1062,9 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
DEFINE_NFS_FOLIO_EVENT(nfs_aop_readpage);
DEFINE_NFS_FOLIO_EVENT_DONE(nfs_aop_readpage_done);
+DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio_reclaim);
+DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_reclaim_done);
+
DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..5de9ec6c72a2 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2024,6 +2024,38 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
return ret;
}
+/**
+ * nfs_wb_folio_reclaim - Write back all requests on one page
+ * @inode: pointer to page
+ * @folio: pointer to folio
+ *
+ * Assumes that the folio has been locked by the caller
+ */
+int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio)
+{
+ loff_t range_start = folio_pos(folio);
+ size_t len = folio_size(folio);
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0,
+ .range_start = range_start,
+ .range_end = range_start + len - 1,
+ .for_sync = 1,
+ };
+ int ret;

One small thing: in nfs_wb_folio_reclaim(), the else
nfs_commit_inode(inode, 0); branch seems to return an
uninitialized ret (unless I missed something).

+
+ if (folio_test_writeback(folio))
+ return -EBUSY;
+ if (folio_clear_dirty_for_io(folio)) {
+ trace_nfs_writeback_folio_reclaim(inode, range_start, len);
+ ret = nfs_writepage_locked(folio, &wbc);
+ trace_nfs_writeback_folio_reclaim_done(inode, range_start, len,
+ ret);
+ } else
+ nfs_commit_inode(inode, 0);

Maybe init ret = 0, or return the value from
nfs_commit_inode() for tracing.

+ return ret;
+}
+
/**
* nfs_wb_folio - Write back all requests on one page
* @inode: pointer to page
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..8dd79a3f3d66 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -637,6 +637,7 @@ extern int nfs_update_folio(struct file *file, struct folio *folio,
extern int nfs_sync_inode(struct inode *inode);
extern int nfs_wb_all(struct inode *inode);
extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio);
int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
extern int nfs_commit_inode(struct inode *, int);
extern struct nfs_commit_data *nfs_commitdata_alloc(void);

Happy to test this with my reproducer once you post it.

Thanks,
Wang Zhaolong