[PATCH v1 04/13] ceph: fix race condition in cleanup_session_requests()

From: Ionut Nechita (Wind River)

Date: Thu Mar 12 2026 - 04:17:00 EST


From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>

When an MDS session is closed or reset, cleanup_session_requests()
only unregisters requests that are on the session's s_unsafe list.
However, requests are only added to s_unsafe after receiving an
"unsafe" reply from the MDS.
This creates a race condition: if a write request has been sent
but the MDS becomes unavailable before sending the unsafe reply,
the request will:
- Have r_session set (points to the failed session)
- Be in the request_tree
- NOT be on s_unsafe list
- Never have r_safe_completion signaled
Meanwhile, flush_mdlog_and_wait_mdsc_unsafe_requests() iterates
the request_tree looking for write requests with r_session set,
and waits on r_safe_completion for each one. Since the request
is not on s_unsafe, cleanup_session_requests() won't unregister
it, and the completion is never signaled - causing an indefinite
hang.
This was observed in production when running xfstests generic/013
in a loop, with stack traces showing:
INFO: task fsstress:14466 blocked for more than 122 seconds.
Call Trace:
wait_for_completion+0x14a/0x340
ceph_mdsc_sync+0x4b4/0xe80
ceph_sync_fs+0xa0/0x4c0
sync_filesystem+0x182/0x240
Fix this by extending cleanup_session_requests() to also unregister
requests that:
- Belong to the closing session (r_session->s_mds matches)
- Have NOT received an unsafe reply (CEPH_MDS_R_GOT_UNSAFE not set)
- Have NOT received a safe reply (CEPH_MDS_R_GOT_SAFE not set)
These are requests that were in-flight when the session failed and
will never complete. Unregistering them signals r_safe_completion,
unblocking any waiters.
Requests that received an unsafe reply but not yet a safe reply
are already on s_unsafe and handled by the existing code. For
these, we preserve the original behavior of resetting r_attempts
to allow re-sending when the session reconnects.
Fixes: e3ec8d689cf4 ("ceph: clean up unsafe requests when reconnecting is denied")
Signed-off-by: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
---
fs/ceph/mds_client.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 37899464101f7..45abddd7f317e 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1792,6 +1792,8 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,

doutc(cl, "mds%d\n", session->s_mds);
mutex_lock(&mdsc->mutex);
+
+ /* First, handle requests on the unsafe list */
while (!list_empty(&session->s_unsafe)) {
req = list_first_entry(&session->s_unsafe,
struct ceph_mds_request, r_unsafe_item);
@@ -1803,14 +1805,30 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
mapping_set_error(req->r_unsafe_dir->i_mapping, -EIO);
__unregister_request(mdsc, req);
}
- /* zero r_attempts, so kick_requests() will re-send requests */
+
+ /*
+ * Iterate through all pending requests for this session.
+ * Requests that haven't received an unsafe reply yet will never
+ * complete on this session - unregister them to signal waiters.
+ * Requests that got unsafe but not safe are handled above via
+ * s_unsafe list; for any remaining, reset r_attempts to allow
+ * re-sending when session reconnects.
+ */
p = rb_first(&mdsc->request_tree);
while (p) {
req = rb_entry(p, struct ceph_mds_request, r_node);
p = rb_next(p);
if (req->r_session &&
- req->r_session->s_mds == session->s_mds)
- req->r_attempts = 0;
+ req->r_session->s_mds == session->s_mds) {
+ if (!test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags) &&
+ !test_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags)) {
+ doutc(cl, " dropping pending request %llu\n",
+ req->r_tid);
+ __unregister_request(mdsc, req);
+ } else {
+ req->r_attempts = 0;
+ }
+ }
}
mutex_unlock(&mdsc->mutex);
}
--
2.53.0