RE: [PATCH] fs/ceph/dir: fix `i_nlink` underrun during async unlink

From: Viacheslav Dubeyko

Date: Tue Feb 24 2026 - 15:10:41 EST

On Tue, 2026-02-24 at 14:33 +0100, Max Kellermann wrote:
> There hasn't been a review of this patch and the bug is still present
> in 6.19 / 7.0rc.
>
> On Fri, Sep 5, 2025 at 11:15 PM Max Kellermann <max.kellermann@xxxxxxxxx> wrote:
> >
> > During async unlink, we drop the `i_nlink` counter before we receive
> > the completion (that will eventually update the `i_nlink`) because "we
> > assume that the unlink will succeed". That is not a bad idea, but it
> > races against deletions by other clients (or against the completion of
> > our own unlink) and can lead to an underrun which emits a WARNING like
> > this one:
> >
> > WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68
> > Modules linked in:
> > CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655
> > Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
> > pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : drop_nlink+0x50/0x68
> > lr : ceph_unlink+0x6c4/0x720
> > sp : ffff80012173bc90
> > x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680
> > x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647
> > x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203
> > x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365
> > x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec
> > x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74
> > x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94
> > x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002
> > x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000
> > x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8
> > Call trace:
> > drop_nlink+0x50/0x68 (P)
> > vfs_unlink+0xb0/0x2e8
> > do_unlinkat+0x204/0x288
> > __arm64_sys_unlinkat+0x3c/0x80
> > invoke_syscall.constprop.0+0x54/0xe8
> > do_el0_svc+0xa4/0xc8
> > el0_svc+0x18/0x58
> > el0t_64_sync_handler+0x104/0x130
> > el0t_64_sync+0x154/0x158
> >
> > In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the
> > CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion.
> >
> > Meanwhile, between this call and the following drop_nlink() call, a
> > worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or
> > just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own
> > completion). These will lead to a set_nlink() call, updating the
> > `i_nlink` counter to the value received from the MDS. If that new
> > `i_nlink` value happens to be zero, it is illegal to decrement it
> > further. But that is exactly what ceph_unlink() will do then.
> >
> > The WARNING can be reproduced this way:
> >
> > 1. Force async unlink; only the async code path is affected. Having
> > no real clue about Ceph internals, I was unable to find out why the
> > MDS wouldn't give me the "Fxr" capabilities, so I patched
> > get_caps_for_async_unlink() to always succeed.
> >
> > (Note that the WARNING dump above was found on an unpatched kernel,
> > without this kludge - this is not a theoretical bug.)
> >
> > 2. Add a sleep call after ceph_mdsc_submit_request() so the unlink
> > completion gets handled by a worker thread before drop_nlink() is
> > called. This guarantees that the `i_nlink` is already zero before
> > drop_nlink() runs.
> >
> > The solution is to skip the counter decrement when it is already zero,
> > but doing so without a lock is still racy (TOCTOU). Since
> > ceph_fill_inode() and handle_cap_grant() both hold the
> > `ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this
> > seems like the proper lock to protect the `i_nlink` updates.
> >
> > I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using
> > `afs_vnode.cb_lock`). All three have the zero check as well.
> >
> > Fixes: 2ccb45462aea ("ceph: perform asynchronous unlink if we have sufficient caps")
> > Cc: stable@xxxxxxxxxxxxxxx
> > Signed-off-by: Max Kellermann <max.kellermann@xxxxxxxxx>
> > ---
> > fs/ceph/dir.c | 15 ++++++++++++++-
> > 1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index 8478e7e75df6..67f04e23f78a 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -1341,6 +1341,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
> > struct ceph_client *cl = fsc->client;
> > struct ceph_mds_client *mdsc = fsc->mdsc;
> > struct inode *inode = d_inode(dentry);
> > + struct ceph_inode_info *ci = ceph_inode(inode);
> > struct ceph_mds_request *req;
> > bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
> > struct dentry *dn;
> > @@ -1427,7 +1428,19 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
> > * We have enough caps, so we assume that the unlink
> > * will succeed. Fix up the target inode and dcache.
> > */
> > - drop_nlink(inode);
> > +
> > + /*
> > + * Protect the i_nlink update with i_ceph_lock
> > + * to precent racing against ceph_fill_inode()
> > + * handling our completion on a worker thread
> > + * and don't decrement if i_nlink has already
> > + * been updated to zero by this completion.
> > + */
> > + spin_lock(&ci->i_ceph_lock);
> > + if (inode->i_nlink > 0)
> > + drop_nlink(inode);
> > + spin_unlock(&ci->i_ceph_lock);
> > +
> > d_delete(dentry);
> > } else {
> > spin_lock(&fsc->async_unlink_conflict_lock);
> > --
> > 2.47.2
> >
>

Makes sense.

Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx>

Thanks,
Slava.