Re: [EXTERNAL] [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset

From: Alex Markuze

Date: Wed May 06 2026 - 07:34:42 EST

Hi Slava,

Thanks for the careful review. Let me address the questions that
need some discussion.

xa_destroy() thread safety

Good question. At this point in send_mds_reconnect() the session is
transitioning to RECONNECTING; async creates belong to the prior
session epoch and cannot start new ones once we enter reconnect. The
delegated ino list is stale after a
session reset since the MDS will not honor those delegations.
That said, I want to ensure I'm not missing a window you're seeing.

Could you describe the specific scenario you have in mind? Is it a
case where ceph_get_deleg_ino() is called from a path that doesn't
hold s_mutex?

mutex_lock(&mdsc->mutex) under session->s_mutex

The nesting here is s_mutex -> mdsc->mutex, which is the same order
used by cleanup_session_requests() and other existing paths.
The inverted order (mdsc->mutex -> s_mutex) appears in
check_new_map() but there it drops mdsc->mutex before taking s_mutex.
So this should not introduce a new lock inversion.

If you see a path where the opposite nesting can happen
concurrently, please point it out so I can understand it?

MDS state check narrowed from >= to ==

This is intentional. When the MDS is past RECONNECT (i.e. in REJOIN,
CLIENTREPLAY, or ACTIVE), it will reject CLIENT_RECONNECT messages
because it only accepts them during its own RECONNECT window. The old
code sent a doomed reconnect that
the MDS would reject or ignore. The new behavior tears the session
down locally and lets new requests re-open a fresh session, which is
the correct recovery for a connection drop against an active MDS.

That said, I understand the concern about changing established
behavior. If you think there are scenarios where sending a reconnect
to an MDS in REJOIN or CLIENTREPLAY could still succeed, I'd be
interested to hear them. My understanding
from the MDS side is that it won't accept them outside the RECONNECT phase.

Thanks,
Alex

On Thu, Apr 30, 2026 at 12:23 AM Viacheslav Dubeyko <vdubeyko@xxxxxxxxxx> wrote:
>
> On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> > Change send_mds_reconnect() to return an error code so callers can detect
> > and report reconnect failures instead of silently ignoring them. Add early
> > bailout checks for sessions that are already closed, rejected, or
> > unregistered, which avoids sending reconnect messages for sessions that
> > can no longer be recovered.
> >
> > The early -ESTALE and -ENOENT bailouts use a separate fail_return label
> > that skips the pr_err_client diagnostic, since these codes indicate
> > expected concurrent-teardown races rather than genuine reconnect build
> > failures.
> >
> > Move the "reconnect start" log after the early-bailout checks so it
> > only appears for sessions that actually proceed with reconnect.
> >
> > Save the prior session state before transitioning to RECONNECTING,
> > and restore it in the failure path. Without this, a transient
> > build or encoding failure (-ENOMEM, -ENOSPC) strands the session
> > in RECONNECTING indefinitely because check_new_map() only retries
> > sessions in RESTARTING state.
> >
> > Rewrite mds_peer_reset() to handle the case where the MDS is past its
> > RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
> > messages because it only accepts them during its own RECONNECT window
> > after restart. Previously, the client would send a doomed reconnect
> > that the MDS would reject or ignore. Now, the client tears the session
> > down locally and lets new requests re-open a fresh session, which is
> > the correct recovery for this scenario. The RECONNECTING state is
> > handled on the same teardown path, since the MDS will reject reconnect
> > attempts from an active client regardless of the session's local state.
> >
> > Add explicit cases for CLOSED and REJECTED session states in
> > mds_peer_reset() since these are terminal states where a connection
> > drop is expected behavior.
> >
> > The session teardown path in mds_peer_reset() follows the established
> > drop-and-reacquire locking pattern from check_new_map(): take
> > mdsc->mutex for session unregistration, release it, then take s->s_mutex
> > separately for cleanup. This avoids introducing a new simultaneous lock
> > nesting pattern.
> >
> > Log reconnect failures from check_new_map() and mds_peer_reset() at
> > pr_warn level rather than pr_err, since return codes like -ESTALE
> > (closed/rejected session) and -ENOENT (unregistered session) are
> > expected during concurrent teardown. Log dropped messages for
> > unregistered sessions via doutc() (dynamic debug) rather than
> > pr_info, as post-reset message arrival is routine and does not
> > warrant unconditional logging.
> >
> > Signed-off-by: Alex Markuze <amarkuze@xxxxxxxxxx>
> > ---
> > fs/ceph/mds_client.c | 169 +++++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 155 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 871f0eef468d..b62abae72c4c 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *session,
> > break;
> >
> > case CEPH_SESSION_REJECT:
> > - WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
> > - pr_info_client(cl, "mds%d rejected session\n",
> > - session->s_mds);
> > + WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
> > + session->s_state != CEPH_MDS_SESSION_RECONNECTING);
> > + if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
> > + pr_info_client(cl, "mds%d reconnect rejected\n",
> > + session->s_mds);
> > + else
> > + pr_info_client(cl, "mds%d rejected session\n",
> > + session->s_mds);
> > session->s_state = CEPH_MDS_SESSION_REJECTED;
> > cleanup_session_requests(mdsc, session);
> > remove_session_caps(session);
> > @@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
> > cap->mseq = 0; /* and migrate_seq */
> > cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
> >
> > + /*
> > + * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
> > + * Instead, locks are submitted for best-effort MDS reclaim
> > + * via the flock_len field below. If reclaim fails (e.g.,
> > + * another client grabbed a conflicting lock), future lock
> > + * operations will fail and set the error flag at that point.
> > + */
> > +
> > /* These are lost when the session goes away */
> > if (S_ISDIR(inode->i_mode)) {
> > if (cap->issued & CEPH_CAP_DIR_CREATE) {
> > @@ -4892,20 +4905,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
> > *
> > * This is a relatively heavyweight operation, but it's rare.
> > */
> > -static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> > - struct ceph_mds_session *session)
> > +static int send_mds_reconnect(struct ceph_mds_client *mdsc,
> > + struct ceph_mds_session *session)
> > {
> > struct ceph_client *cl = mdsc->fsc->client;
> > struct ceph_msg *reply;
> > int mds = session->s_mds;
> > int err = -ENOMEM;
> > + int old_state;
> > struct ceph_reconnect_state recon_state = {
> > .session = session,
> > };
> > LIST_HEAD(dispose);
> >
> > - pr_info_client(cl, "mds%d reconnect start\n", mds);
> > -
> > recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
> > if (!recon_state.pagelist)
> > goto fail_nopagelist;
> > @@ -4917,6 +4929,32 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> > xa_destroy(&session->s_delegated_inos);
>
> Is xa_destroy(&session->s_delegated_inos) thread safe now? If an in-flight async
> create is mid-way through ceph_get_deleg_ino (which does xa_for_each + xa_erase)
> while xa_destroy runs on the same XArray from a different thread, that is
> unsafe. What do you think?
>
> >
> > mutex_lock(&session->s_mutex);
> > + if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
> > + session->s_state == CEPH_MDS_SESSION_REJECTED) {
> > + pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
> > + mds,
> > + ceph_session_state_name(session->s_state));
> > + mutex_unlock(&session->s_mutex);
> > + ceph_msg_put(reply);
> > + err = -ESTALE;
> > + goto fail_return;
> > + }
> > +
> > + mutex_lock(&mdsc->mutex);
>
> I started to guess is it safe to call mutex_lock(&mdsc->mutex) under session-
> >s_mutex lock? Could we introduce potential deadlock here?
>
> > + if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
> > + mutex_unlock(&mdsc->mutex);
> > + pr_info_client(cl,
> > + "mds%d skipping reconnect, session unregistered\n",
> > + mds);
> > + mutex_unlock(&session->s_mutex);
> > + ceph_msg_put(reply);
> > + err = -ENOENT;
> > + goto fail_return;
> > + }
> > + mutex_unlock(&mdsc->mutex);
> > +
> > + pr_info_client(cl, "mds%d reconnect start\n", mds);
> > + old_state = session->s_state;
> > session->s_state = CEPH_MDS_SESSION_RECONNECTING;
> > session->s_seq = 0;
> >
> > @@ -5046,18 +5084,34 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> >
> > up_read(&mdsc->snap_rwsem);
> > ceph_pagelist_release(recon_state.pagelist);
> > - return;
> > + return 0;
> >
> > fail:
> > ceph_msg_put(reply);
> > up_read(&mdsc->snap_rwsem);
> > + /*
> > + * Restore prior session state so map-driven reconnect logic
> > + * (check_new_map) can retry. Without this, a transient build
> > + * failure strands the session in RECONNECTING indefinitely.
> > + */
> > + session->s_state = old_state;
> > mutex_unlock(&session->s_mutex);
> > fail_nomsg:
> > ceph_pagelist_release(recon_state.pagelist);
> > fail_nopagelist:
> > pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
> > err, mds);
> > - return;
> > + return err;
> > +
> > +fail_return:
> > + /*
> > + * Early-exit path for expected concurrent-teardown races
> > + * (-ESTALE for closed/rejected sessions, -ENOENT for
> > + * unregistered sessions). Skip the pr_err_client diagnostic
> > + * since these are not genuine reconnect build failures.
> > + */
> > + ceph_pagelist_release(recon_state.pagelist);
> > + return err;
> > }
> >
> >
> > @@ -5138,9 +5192,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
> > */
> > if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
> > newstate >= CEPH_MDS_STATE_RECONNECT) {
> > + int rc;
> > +
> > mutex_unlock(&mdsc->mutex);
> > clear_bit(i, targets);
> > - send_mds_reconnect(mdsc, s);
> > + rc = send_mds_reconnect(mdsc, s);
> > + if (rc)
> > + pr_warn_client(cl,
> > + "mds%d reconnect failed: %d\n",
> > + i, rc);
> > mutex_lock(&mdsc->mutex);
> > }
> >
> > @@ -5204,7 +5264,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
> > }
> > doutc(cl, "send reconnect to export target mds.%d\n", i);
> > mutex_unlock(&mdsc->mutex);
> > - send_mds_reconnect(mdsc, s);
> > + err = send_mds_reconnect(mdsc, s);
> > + if (err)
> > + pr_warn_client(cl,
> > + "mds%d export target reconnect failed: %d\n",
> > + i, err);
> > ceph_put_mds_session(s);
> > mutex_lock(&mdsc->mutex);
> > }
> > @@ -6284,12 +6348,87 @@ static void mds_peer_reset(struct ceph_connection *con)
> > {
> > struct ceph_mds_session *s = con->private;
> > struct ceph_mds_client *mdsc = s->s_mdsc;
> > + int session_state;
> >
> > pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
> > s->s_mds);
> > - if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
> > - ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
> > - send_mds_reconnect(mdsc, s);
> > +
> > + if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
> > + ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
> > + return;
> > +
> > + if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {
>
> We had >= CEPH_MDS_STATE_RECONNECT before. When an MDS restarts, it moves
> through these states in order:
>
> RECONNECT (10) -> waits for clients to reconnect
> REJOIN (11) -> rejoins the distributed MDS cache
> CLIENTREPLAY (12) -> replays in-flight client operations
> ACTIVE (13) -> fully operational
>
> If the connection dropped and the MDS was in any of those four states, the
> client sent a CLIENT_RECONNECT message. REJOIN (11) and CLIENTREPLAY (12) now
> fall through to the teardown path, not the reconnect path.
>
> Thanks,
> Slava.
>
> > + int rc = send_mds_reconnect(mdsc, s);
> > +
> > + if (rc)
> > + pr_warn_client(mdsc->fsc->client,
> > + "mds%d reconnect failed: %d\n",
> > + s->s_mds, rc);
> > + return;
> > + }
> > +
> > + /*
> > + * MDS is active (past RECONNECT). It will not accept a
> > + * CLIENT_RECONNECT from us, so tear the session down locally
> > + * and let new requests re-open a fresh session.
> > + *
> > + * Snapshot session state with READ_ONCE, then revalidate under
> > + * mdsc->mutex before acting. The subsequent mdsc->mutex
> > + * section rechecks s_state to catch concurrent transitions, so
> > + * the lockless snapshot here is safe. s->s_mutex is taken
> > + * separately for cleanup after unregistration, which avoids
> > + * introducing a new s->s_mutex + mdsc->mutex nesting.
> > + */
> > + session_state = READ_ONCE(s->s_state);
> > +
> > + switch (session_state) {
> > + case CEPH_MDS_SESSION_RESTARTING:
> > + case CEPH_MDS_SESSION_RECONNECTING:
> > + case CEPH_MDS_SESSION_CLOSING:
> > + case CEPH_MDS_SESSION_OPEN:
> > + case CEPH_MDS_SESSION_HUNG:
> > + case CEPH_MDS_SESSION_OPENING:
> > + mutex_lock(&mdsc->mutex);
> > + if (s->s_mds >= mdsc->max_sessions ||
> > + mdsc->sessions[s->s_mds] != s ||
> > + s->s_state != session_state) {
> > + pr_info_client(mdsc->fsc->client,
> > + "mds%d state changed to %s during peer reset\n",
> > + s->s_mds,
> > + ceph_session_state_name(s->s_state));
> > + mutex_unlock(&mdsc->mutex);
> > + return;
> > + }
> > +
> > + ceph_get_mds_session(s);
> > + s->s_state = CEPH_MDS_SESSION_CLOSED;
> > + __unregister_session(mdsc, s);
> > + __wake_requests(mdsc, &s->s_waiting);
> > + mutex_unlock(&mdsc->mutex);
> > +
> > + mutex_lock(&s->s_mutex);
> > + cleanup_session_requests(mdsc, s);
> > + remove_session_caps(s);
> > + mutex_unlock(&s->s_mutex);
> > +
> > + wake_up_all(&mdsc->session_close_wq);
> > +
> > + mutex_lock(&mdsc->mutex);
> > + kick_requests(mdsc, s->s_mds);
> > + mutex_unlock(&mdsc->mutex);
> > +
> > + ceph_put_mds_session(s);
> > + break;
> > + case CEPH_MDS_SESSION_CLOSED:
> > + case CEPH_MDS_SESSION_REJECTED:
> > + break;
> > + default:
> > + pr_warn_client(mdsc->fsc->client,
> > + "mds%d peer reset in unexpected state %s\n",
> > + s->s_mds,
> > + ceph_session_state_name(session_state));
> > + break;
> > + }
> > }
> >
> > static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> > @@ -6301,6 +6440,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> >
> > mutex_lock(&mdsc->mutex);
> > if (__verify_registered_session(mdsc, s) < 0) {
> > + doutc(cl, "dropping tid %llu from unregistered session %d\n",
> > + le64_to_cpu(msg->hdr.tid), s->s_mds);
> > mutex_unlock(&mdsc->mutex);
> > goto out;
> > }
>