Re: [PATCH v1 04/13] ceph: fix race condition in cleanup_session_requests()

From: Viacheslav Dubeyko

Date: Thu Mar 12 2026 - 15:32:58 EST

On Thu, 2026-03-12 at 10:16 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
>
> When an MDS session is closed or reset, cleanup_session_requests()
> only unregisters requests that are on the session's s_unsafe list.
> However, requests are only added to s_unsafe after receiving an
> "unsafe" reply from the MDS.
> This creates a race condition: if a write request has been sent
> but the MDS becomes unavailable before sending the unsafe reply,
> the request will:
> - Have r_session set (points to the failed session)
> - Be in the request_tree
> - NOT be on s_unsafe list
> - Never have r_safe_completion signaled
> Meanwhile, flush_mdlog_and_wait_mdsc_unsafe_requests() iterates
> the request_tree looking for write requests with r_session set,
> and waits on r_safe_completion for each one. Since the request
> is not on s_unsafe, cleanup_session_requests() won't unregister
> it, and the completion is never signaled - causing an indefinite
> hang.
> This was observed in production when running xfstests generic/013
> in a loop, with stack traces showing:
> INFO: task fsstress:14466 blocked for more than 122 seconds.
> Call Trace:
> wait_for_completion+0x14a/0x340
> ceph_mdsc_sync+0x4b4/0xe80
> ceph_sync_fs+0xa0/0x4c0
> sync_filesystem+0x182/0x240
> Fix this by extending cleanup_session_requests() to also unregister
> requests that:
> - Belong to the closing session (r_session->s_mds matches)
> - Have NOT received an unsafe reply (CEPH_MDS_R_GOT_UNSAFE not set)
> - Have NOT received a safe reply (CEPH_MDS_R_GOT_SAFE not set)
> These are requests that were in-flight when the session failed and
> will never complete. Unregistering them signals r_safe_completion,
> unblocking any waiters.
> Requests that received an unsafe reply but not yet a safe reply
> are already on s_unsafe and handled by the existing code. For
> these, we preserve the original behavior of resetting r_attempts
> to allow re-sending when the session reconnects.
> Fixes: e3ec8d689cf4 ("ceph: clean up unsafe requests when reconnecting is denied")
> Signed-off-by: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
> ---
> fs/ceph/mds_client.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 37899464101f7..45abddd7f317e 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1792,6 +1792,8 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
>
> doutc(cl, "mds%d\n", session->s_mds);
> mutex_lock(&mdsc->mutex);
> +
> + /* First, handle requests on the unsafe list */
> while (!list_empty(&session->s_unsafe)) {
> req = list_first_entry(&session->s_unsafe,
> struct ceph_mds_request, r_unsafe_item);
> @@ -1803,14 +1805,30 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
> mapping_set_error(req->r_unsafe_dir->i_mapping, -EIO);
> __unregister_request(mdsc, req);
> }
> - /* zero r_attempts, so kick_requests() will re-send requests */
> +
> + /*
> + * Iterate through all pending requests for this session.
> + * Requests that haven't received an unsafe reply yet will never
> + * complete on this session - unregister them to signal waiters.
> + * Requests that got unsafe but not safe are handled above via
> + * s_unsafe list; for any remaining, reset r_attempts to allow
> + * re-sending when session reconnects.
> + */
> p = rb_first(&mdsc->request_tree);
> while (p) {
> req = rb_entry(p, struct ceph_mds_request, r_node);
> p = rb_next(p);
> if (req->r_session &&
> - req->r_session->s_mds == session->s_mds)
> - req->r_attempts = 0;
> + req->r_session->s_mds == session->s_mds) {
> + if (!test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags) &&
> + !test_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags)) {
> + doutc(cl, " dropping pending request %llu\n",
> + req->r_tid);
> + __unregister_request(mdsc, req);
> + } else {
> + req->r_attempts = 0;
> + }
> + }
> }
> mutex_unlock(&mdsc->mutex);
> }

Nice fix.

Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx>

Thanks,
Slava.