Re: [RFC PATCH v3] ceph: prevent a client from exceeding the MDS maximum xattr size

From: Luís Henriques
Date: Thu Jun 02 2022 - 06:27:57 EST


Xiubo Li <xiubli@xxxxxxxxxx> writes:

> On 6/2/22 5:26 PM, Luís Henriques wrote:
>> Xiubo Li <xiubli@xxxxxxxxxx> writes:
>>
>>> On 6/2/22 12:29 AM, Luís Henriques wrote:
>>>> The MDS tries to enforce a limit on the total key/values in extended
>>>> attributes. However, this limit is enforced only if doing a synchronous
>>>> operation (MDS_OP_SETXATTR) -- if we're buffering the xattrs, the MDS
>>>> doesn't have a chance to enforce these limits.
>>>>
>>>> This patch adds support for decoding the xattrs maximum size setting that is
>>>> distributed in the mdsmap. Then, when setting an xattr, the kernel client
>>>> will revert to do a synchronous operation if that maximum size is exceeded.
>>>>
>>>> While there, fix a dout() that would trigger a printk warning:
>>>>
>>>> [ 98.718078] ------------[ cut here ]------------
>>>> [ 98.719012] precision 65536 too large
>>>> [ 98.719039] WARNING: CPU: 1 PID: 3755 at lib/vsprintf.c:2703 vsnprintf+0x5e3/0x600
>>>> ...
>>>>
>>>> URL: https://tracker.ceph.com/issues/55725
>>>> Signed-off-by: Luís Henriques <lhenriques@xxxxxxx>
>>>> ---
>>>> fs/ceph/mdsmap.c | 27 +++++++++++++++++++++++----
>>>> fs/ceph/xattr.c | 12 ++++++++----
>>>> include/linux/ceph/mdsmap.h | 1 +
>>>> 3 files changed, 32 insertions(+), 8 deletions(-)
>>>>
>>>> * Changes since v2
>>>>
>>>> Well, a lot has changed since v2! Now the xattr max value setting is
>>>> obtained through the mdsmap, which needs to be decoded, and the feature
>>>> that was used in the previous revision was dropped. The drawback is that
>>>> the MDS isn't unable to know in advance if a client is aware of this xattr
>>>> max value.
>>>>
>>>> * Changes since v1
>>>>
>>>> Added support for new feature bit to get the MDS max_xattr_pairs_size
>>>> setting.
>>>>
>>>> Also note that this patch relies on a patch that hasn't been merged yet
>>>> ("ceph: use correct index when encoding client supported features"),
>>>> otherwise the new feature bit won't be correctly encoded.
>>>>
>>>> diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c
>>>> index 30387733765d..36b2bc18ca2a 100644
>>>> --- a/fs/ceph/mdsmap.c
>>>> +++ b/fs/ceph/mdsmap.c
>>>> @@ -13,6 +13,12 @@
>>>> #include "super.h"
>>>> +/*
>>>> + * Maximum size of xattrs the MDS can handle per inode by default. This
>>>> + * includes the attribute name and 4+4 bytes for the key/value sizes.
>>>> + */
>>>> +#define MDS_MAX_XATTR_SIZE (1<<16) /* 64K */
>>>> +
>>>> #define CEPH_MDS_IS_READY(i, ignore_laggy) \
>>>> (m->m_info[i].state > 0 && ignore_laggy ? true : !m->m_info[i].laggy)
>>>> @@ -352,12 +358,10 @@ struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void
>>>> *end, bool msgr2)
>>>> __decode_and_drop_type(p, end, u8, bad_ext);
>>>> }
>>>> if (mdsmap_ev >= 8) {
>>>> - u32 name_len;
>>>> /* enabled */
>>>> ceph_decode_8_safe(p, end, m->m_enabled, bad_ext);
>>>> - ceph_decode_32_safe(p, end, name_len, bad_ext);
>>>> - ceph_decode_need(p, end, name_len, bad_ext);
>>>> - *p += name_len;
>>>> + /* fs_name */
>>>> + ceph_decode_skip_string(p, end, bad_ext);
>>>> }
>>>> /* damaged */
>>>> if (mdsmap_ev >= 9) {
>>>> @@ -370,6 +374,21 @@ struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end, bool msgr2)
>>>> } else {
>>>> m->m_damaged = false;
>>>> }
>>>> + if (mdsmap_ev >= 17) {
>>>> + /* balancer */
>>>> + ceph_decode_skip_string(p, end, bad_ext);
>>>> + /* standby_count_wanted */
>>>> + ceph_decode_skip_32(p, end, bad_ext);
>>>> + /* old_max_mds */
>>>> + ceph_decode_skip_32(p, end, bad_ext);
>>>> + /* min_compat_client */
>>>> + ceph_decode_skip_8(p, end, bad_ext);
>>> This is incorrect.
>>>
>>> If mdsmap_ev == 15 the min_compat_client will be a feature_bitset_t instead of
>>> int8_t.
>> Hmm... can you point me at where that's done in the code? As usual, I'm
>> confused with that code and simply can't see that.
>>
>> Also, if that happens only when mdsmap_ev == 15, then there's no problem
>> because that branch is only taken if it's >= 17.
>
> Yeah, so you should skip 32 or 32+64 bits instead here, just likes:
>
> 3536                 /* version >= 3, feature bits */
> 3537                 ceph_decode_32_safe(&p, end, len, bad);
> 3538                 if (len) {
> 3539                         ceph_decode_64_safe(&p, end, features, bad);
> 3540                         p += len - sizeof(features);
> 3541                 }
>
> For the ceph code please see:
>
> Please see https://github.com/ceph/ceph/blob/main/src/mds/MDSMap.cc#L925.

I still don't see what your saying. From what I understand, with <= 15 we
used to have 'min_compat_client', which is of type 'ceph_release_t',
defined in src/common/ceph_releases.h:

enum class ceph_release_t : std::uint8_t {
...
}

Then, starting with >= 16 the MDS ignores this 'min_compat_client' field
(but still encodes/decodes it), and it *adds* 'required_client_features',
which is a 'feature_bitset_t' and that is decoded immediately after (see
bellow, the ceph_decode_skip_set() call).

Cheers,
--
Luís

>>>
>>>> + /* required_client_features */
>>>> + ceph_decode_skip_set(p, end, 64, bad_ext);
>>>> + ceph_decode_64_safe(p, end, m->m_max_xattr_size, bad_ext);
>>>> + } else {
>>>> + m->m_max_xattr_size = MDS_MAX_XATTR_SIZE;
>>>> + }
>>>> bad_ext:
>>>> dout("mdsmap_decode m_enabled: %d, m_damaged: %d, m_num_laggy: %d\n",
>>>> !!m->m_enabled, !!m->m_damaged, m->m_num_laggy);
>>>> diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
>>>> index 8c2dc2c762a4..67f046dac35c 100644
>>>> --- a/fs/ceph/xattr.c
>>>> +++ b/fs/ceph/xattr.c
>>>> @@ -1086,7 +1086,7 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name,
>>>> flags |= CEPH_XATTR_REMOVE;
>>>> }
>>>> - dout("setxattr value=%.*s\n", (int)size, value);
>>>> + dout("setxattr value size: %ld\n", size);
>>>> /* do request */
>>>> req = ceph_mdsc_create_request(mdsc, op, USE_AUTH_MDS);
>>>> @@ -1184,8 +1184,14 @@ int __ceph_setxattr(struct inode *inode, const char *name,
>>>> spin_lock(&ci->i_ceph_lock);
>>>> retry:
>>>> issued = __ceph_caps_issued(ci, NULL);
>>>> - if (ci->i_xattrs.version == 0 || !(issued & CEPH_CAP_XATTR_EXCL))
>>>> + required_blob_size = __get_required_blob_size(ci, name_len, val_len);
>>>> + if ((ci->i_xattrs.version == 0) || !(issued & CEPH_CAP_XATTR_EXCL) ||
>>>> + (required_blob_size >= mdsc->mdsmap->m_max_xattr_size)) {
>>> Shouldn't it be '>' instead ?
>> Ok, I'll fix that.
>>
>>> We'd better always force to do a sync request with old ceph. Just check if the
>>> mdsmap_ev < 17. It's not safe to buffer it because it maybe discarded as your
>>> ceph PR does.
>> Right, that can be done. So, I can simply set the m_max_xattr_size to '0'
>> if mdsmap_ev < 17. Then, this 'if' condition will always be evaluated to
>> true because required_blob_size will be > 0. Does that sound OK?
>
> Yeah, sounds good.
>
> -- Xiubo
>
>
>>
>> Cheers,
>