Re: [PATCH] ceph: fix hanging __ceph_get_caps() with stale `mds_wanted`

From: Viacheslav Dubeyko

Date: Tue Apr 28 2026 - 14:47:04 EST


On Mon, 2026-04-27 at 17:58 +0200, Max Kellermann wrote:
> A reader can hang forever in __ceph_get_caps() when the client no
> longer holds `FILE_RD`, but local cap state still says that the
> capability is already wanted (via `mds_wanted`).
>
> One way to trigger this is through MDS cap revocation. If another
> client performs a conflicting operation, the MDS can revoke `FILE_RD`
> from the reader; the next read then has to reacquire `FILE_RD`. If
> the cap update that should request `FILE_RD` never reaches the MDS
> after `cap->mds_wanted` was raised, the reader is left holding only
> non-file caps while local `mds_wanted` still includes the file read
> caps.
>
> In that state, try_get_cap_refs() sees `need <= mds_wanted` and
> returns 0, so __ceph_get_caps() just waits on `i_cap_wq`. If the cap
> update that was supposed to request `FILE_RD never reaches the MDS
> after `cap->mds_wanted was` raised, no further request is sent and the
> waiter can sleep indefinitely until unrelated cap traffic happens to
> wake it up.
>
> The ordering issue is that `cap->mds_wanted` is updated in
> __prep_cap() before the `CEPH_MSG_CLIENT_CAPS message` is actually
> queued for send. That makes one field serve two different meanings at
> once: what this client wants, and what the client believes the MDS
> already knows it wants.
>
> A proper fix would be to split those states and track whether a cap
> update is actually in flight or has been observed by the MDS.
> However, simply moving the `cap->mds_wanted assignment` later would
> not be sufficient: queueing the message in the messenger does not
> guarantee that the MDS processed that specific wanted set, and
> reconnect or message loss can still invalidate that assumption.
> Fixing that properly would require a larger rework of the cap state
> machine.
>
> To allow simpler backports to stable kernels, this patch implements a
> simpler workaround:
>
> - stop waiting forever in __ceph_get_caps(); after a bounded wait,
> fall back to the renew path
>
> - make ceph_renew_caps() issue a synchronous `OPEN` request whenever
> the inode still does not actually hold the wanted caps, instead of
> only calling ceph_check_caps()
>
> The extra issued-vs-wanted check in ceph_renew_caps() is necessary
> because the previous test only checked whether the inode still had any
> real caps at all. That is not enough after revocation: the client can
> still hold something like `pLs` and yet be missing `FILE_RD`
> completely. In that case, falling back to ceph_check_caps() is not
> sufficient, because it still trusts `cap->mds_wanted` and may resend
> nothing. By requiring `(issued & wanted) == wanted` before taking the
> asynchronous path, the code only uses ceph_check_caps() when the
> `wanted caps` are already actually issued. Otherwise, it sends the
> synchronous `OPEN` renew.
>
> This preserves the existing asynchronous fast path when the wanted
> caps are already issued, avoids changing cap-state semantics, and
> fixes the hang by guaranteeing that a stalled waiter eventually
> retries through a path that does not rely on the stale `mds_wanted`
> state.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Max Kellermann <max.kellermann@xxxxxxxxx>
> ---
> fs/ceph/caps.c | 20 +++++++++++++++++---
> fs/ceph/file.c | 9 +++++----
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d51454e995a8..dd11611f250b 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -3087,11 +3087,24 @@ int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
> flags |= NON_BLOCKING;
> while (!(ret = try_get_cap_refs(inode, need, want,
> endoff, flags, &_got))) {
> + static const unsigned long wait_timeout = 5 * HZ;

Why exactly 5 * HZ? What is the basis for this timeout? Could we re-use any
available timeouts in CephFS declarations?

> +
> if (signal_pending(current)) {
> ret = -ERESTARTSYS;
> break;
> }
> - wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
> +
> + /*
> + * If a cap update is lost after
> + * mds_wanted was raised, waiting
> + * forever will never make progress.
> + * Retry the renew path periodically
> + * so we can resend synchronously.
> + */
> + if (!wait_woken(&wait, TASK_INTERRUPTIBLE, wait_timeout)) {
> + ret = -EUCLEAN;
> + break;
> + }
> }
>
> remove_wait_queue(&ci->i_cap_wq, &wait);
> @@ -3125,8 +3138,9 @@ int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
> continue;
> }
> if (ret == -EUCLEAN) {
> - /* session was killed, try renew caps */
> - ret = ceph_renew_caps(inode, flags);
> + /* session was killed or a waited cap
> + * request needs a retry */
> + ret = ceph_renew_caps(inode, flags & CEPH_FILE_MODE_MASK);

Frankly speaking, I don't quite follow why do we need to add flags &
CEPH_FILE_MODE_MASK?

Thanks,
Slava.

> if (ret == 0)
> continue;
> }
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d54d71669176..47c7d4a5ffed 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -314,7 +314,7 @@ static int ceph_init_file(struct inode *inode, struct file *file, int fmode)
> }
>
> /*
> - * try renew caps after session gets killed.
> + * Retry cap acquisition after a stale session or a lost cap update.
> */
> int ceph_renew_caps(struct inode *inode, int fmode)
> {
> @@ -322,14 +322,15 @@ int ceph_renew_caps(struct inode *inode, int fmode)
> struct ceph_client *cl = mdsc->fsc->client;
> struct ceph_inode_info *ci = ceph_inode(inode);
> struct ceph_mds_request *req;
> - int err, flags, wanted;
> + int err, flags, wanted, issued;
>
> spin_lock(&ci->i_ceph_lock);
> __ceph_touch_fmode(ci, mdsc, fmode);
> wanted = __ceph_caps_file_wanted(ci);
> + issued = __ceph_caps_issued(ci, NULL);
> if (__ceph_is_any_real_caps(ci) &&
> - (!(wanted & CEPH_CAP_ANY_WR) || ci->i_auth_cap)) {
> - int issued = __ceph_caps_issued(ci, NULL);
> + (!(wanted & CEPH_CAP_ANY_WR) || ci->i_auth_cap) &&
> + (issued & wanted) == wanted) {
> spin_unlock(&ci->i_ceph_lock);
> doutc(cl, "%p %llx.%llx want %s issued %s updating mds_wanted\n",
> inode, ceph_vinop(inode), ceph_cap_string(wanted),