Re: [PATCH v2 4/4] ceph: cap delegated inode count in ceph_parse_deleg_inos()
From: Viacheslav Dubeyko
Date: Mon Jun 29 2026 - 15:58:52 EST
On Sat, 2026-06-06 at 15:00 -0400, Michael Bommarito wrote:
> ceph_parse_deleg_inos() decodes interval sets of delegated inode
> numbers
> from an MDS create-with-delegation reply. For each set it reads a 64-
> bit
> start and a 64-bit len with ceph_decode_64_safe(), which only
> validates
> that the eight bytes are present in the message, not the value, and
> then
> loops over len while inserting entries into s_delegated_inos.
>
> len is fully attacker controlled. A malicious or compromised MDS can
> send
> one huge interval, many intervals in one reply, duplicate intervals,
> or
> repeated replies that accumulate delegated inodes on the same
> session.
> The v1 fix bounded only one interval and still allowed aggregate CPU
> and
> memory pressure.
>
> Bound both dimensions. Track the number of delegated inodes currently
> held by each MDS session, reject replies that would exceed that
> population cap, and also cap the total interval length accepted from
> one
> reply so duplicate ranges cannot spin through an attacker-controlled
> len
> while making no successful xarray inserts. The counter is decremented
> when async create consumes a delegated inode, incremented when a
> delegated inode is restored, initialized with the session xarray, and
> reset when reconnect destroys the xarray contents.
>
> The cap is a fixed, client-chosen constant rather than a value
> derived
> from the MDS. mds_client_prealloc_inos is a userspace MDS
> configuration
> option; it is never sent to the kernel client on the wire, and a
> server-supplied bound could not be trusted for a defensive limit in
> any
> case. The constant is set well above that option's documented default
> of
> 1000 (a generous multiple), so legitimate refill behavior is
> unaffected
> while the CPU and xarray memory a malformed delegation stream can
> consume
> stays bounded.
>
> Impact: a malicious or compromised Ceph MDS can no longer make a
> client
> spin through an unbounded delegated-inode interval or grow one
> session's
> delegated-inode xarray without limit.
>
> Fixes: d48464878708 ("ceph: decode interval_sets for delegated inos")
> Cc: stable@xxxxxxxxxxxxxxx
> Suggested-by: Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Michael Bommarito <michael.bommarito@xxxxxxxxx>
> ---
> v2:
> - Replace the per-interval cap with a per-session delegated-inode
> population counter.
> - Add a per-reply interval budget so duplicate ranges cannot run the
> insertion loop without increasing the population counter.
> - Reset the counter when reconnect destroys the delegated-inode
> xarray.
> - Document why the cap is a fixed client-side constant (the kernel
> client
> cannot read the userspace mds_client_prealloc_inos option).
>
> fs/ceph/mds_client.c | 51 ++++++++++++++++++++++++++++++++++++++----
> --
> fs/ceph/mds_client.h | 1 +
> fs/ceph/super.h | 9 ++++++++
> 3 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 4f36ac73305dc..ec5a80c9f86a7 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -612,16 +612,34 @@ static int parse_reply_info_filelock(void **p,
> void *end,
>
> #define DELEGATED_INO_AVAILABLE xa_mk_value(1)
>
> +static int ceph_insert_deleg_ino(struct ceph_mds_session *s, u64
> ino)
static inline?
> +{
> + int err;
> +
> + if (atomic_inc_return(&s->s_num_deleg_inos) >
> CEPH_MAX_DELEG_INOS) {
> + atomic_dec(&s->s_num_deleg_inos);
The atomic_add_unless() sounds like a better way to manage this. What
do you think?
> + return -EOVERFLOW;
> + }
> +
> + err = xa_insert(&s->s_delegated_inos, ino,
> DELEGATED_INO_AVAILABLE,
> + GFP_KERNEL);
> + if (err)
> + atomic_dec(&s->s_num_deleg_inos);
> + return err;
> +}
> +
> static int ceph_parse_deleg_inos(void **p, void *end,
> struct ceph_mds_session *s)
> {
> struct ceph_client *cl = s->s_mdsc->fsc->client;
> + u64 msg_deleg_inos = 0;
> u32 sets;
>
> ceph_decode_32_safe(p, end, sets, bad);
> doutc(cl, "got %u sets of delegated inodes\n", sets);
> while (sets--) {
> u64 start, len;
> + int session_deleg_inos;
>
> ceph_decode_64_safe(p, end, start, bad);
> ceph_decode_64_safe(p, end, len, bad);
> @@ -633,16 +651,34 @@ static int ceph_parse_deleg_inos(void **p, void
> *end,
> start, len);
> continue;
> }
> +
> + if (len > CEPH_MAX_DELEG_INOS ||
> + msg_deleg_inos > CEPH_MAX_DELEG_INOS - len) {
The len is u64. Should we use the cast (u64) for CEPH_MAX_DELEG_INOS in
comparison and calculation?
> + pr_warn_ratelimited_client(cl, "too many
> delegated inodes in reply\n");
Should we show the number of delegated inodes in output?
> + return -EIO;
> + }
> +
> + session_deleg_inos = atomic_read(&s-
> >s_num_deleg_inos);
Is it possible that s_num_deleg_inos could be bigger than
CEPH_MAX_DELEG_INOS? Because, ceph_insert_deleg_ino() should not
increment it bigger than CEPH_MAX_DELEG_INOS.
> + if (session_deleg_inos > CEPH_MAX_DELEG_INOS ||
> + len > CEPH_MAX_DELEG_INOS - session_deleg_inos)
> {
> + pr_warn_ratelimited_client(cl, "too many
> delegated inodes for session\n");
> + return -EIO;
> + }
I assume that it is enough to have the check for msg_deleg_inos. Am I
wrong here?
> +
> + msg_deleg_inos += len;
> +
> while (len--) {
> - int err = xa_insert(&s->s_delegated_inos,
> start++,
> - DELEGATED_INO_AVAILABLE,
> - GFP_KERNEL);
> + int err = ceph_insert_deleg_ino(s, start++);
> +
> if (!err) {
> doutc(cl, "added delegated inode
> 0x%llx\n", start - 1);
Maybe, do we need to keep previous value in variable?
> } else if (err == -EBUSY) {
> pr_warn_client(cl,
> "MDS delegated inode 0x%llx
> more than once.\n",
> start - 1);
> + } else if (err == -EOVERFLOW) {
> + pr_warn_ratelimited_client(cl, "too
> many delegated inodes\n");
I assume that we already have error message from
ceph_insert_deleg_ino(). Do we need in this error message here?
Thanks,
Slava.
> + return -EIO;
> } else {
> return err;
> }
> @@ -660,16 +696,17 @@ u64 ceph_get_deleg_ino(struct ceph_mds_session
> *s)
>
> xa_for_each(&s->s_delegated_inos, ino, val) {
> val = xa_erase(&s->s_delegated_inos, ino);
> - if (val == DELEGATED_INO_AVAILABLE)
> + if (val == DELEGATED_INO_AVAILABLE) {
> + atomic_dec(&s->s_num_deleg_inos);
> return ino;
> + }
> }
> return 0;
> }
>
> int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> {
> - return xa_insert(&s->s_delegated_inos, ino,
> DELEGATED_INO_AVAILABLE,
> - GFP_KERNEL);
> + return ceph_insert_deleg_ino(s, ino);
> }
> #else /* BITS_PER_LONG == 64 */
> /*
> @@ -1056,6 +1093,7 @@ static struct ceph_mds_session
> *register_session(struct ceph_mds_client *mdsc,
> INIT_LIST_HEAD(&s->s_waiting);
> INIT_LIST_HEAD(&s->s_unsafe);
> xa_init(&s->s_delegated_inos);
> + atomic_set(&s->s_num_deleg_inos, 0);
> INIT_LIST_HEAD(&s->s_cap_releases);
> INIT_WORK(&s->s_cap_release_work, ceph_cap_release_work);
>
> @@ -4971,6 +5009,7 @@ static void send_mds_reconnect(struct
> ceph_mds_client *mdsc,
> goto fail_nomsg;
>
> xa_destroy(&session->s_delegated_inos);
> + atomic_set(&session->s_num_deleg_inos, 0);
>
> mutex_lock(&session->s_mutex);
> session->s_state = CEPH_MDS_SESSION_RECONNECTING;
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 4e6c87f8414cc..3810b51ece2e6 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -255,6 +255,7 @@ struct ceph_mds_session {
> struct list_head s_waiting; /* waiting requests */
> struct list_head s_unsafe; /* unsafe requests */
> struct xarray s_delegated_inos;
> + atomic_t s_num_deleg_inos;
> };
>
> /*
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index afc89ce91804e..bfc26e4d83bb4 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -634,6 +634,15 @@ static inline int ceph_ino_compare(struct inode
> *inode, void *data)
> #define CEPH_MDS_INO_LOG_OFFSET (2 * CEPH_MAX_MDS)
> #define CEPH_INO_SYSTEM_BASE ((6*CEPH_MAX_MDS) +
> (CEPH_MAX_MDS * CEPH_NUM_STRAY))
>
> +/*
> + * Upper bound on the number of delegated inodes a single MDS
> session may
> + * hold. The MDS normally hands out a small preallocation window
> (the
> + * userspace mds_client_prealloc_inos option defaults to 1000) and
> refills
> + * it as the client consumes entries. This leaves generous headroom
> while
> + * bounding the CPU and memory a malformed delegation interval can
> consume.
> + */
> +#define CEPH_MAX_DELEG_INOS 8192
> +
> static inline bool ceph_vino_is_reserved(const struct ceph_vino
> vino)
> {
> if (vino.ino >= CEPH_INO_SYSTEM_BASE ||