Re: [PATCH v3 1/3] ocfs2: give ocfs2 the ability to reclaim suballoc free bg

From: Heming Zhao
Date: Tue Oct 08 2024 - 22:57:05 EST


On 10/8/24 15:16, Joseph Qi wrote:


On 9/8/24 10:07 PM, Heming Zhao wrote:
The current ocfs2 code can't reclaim suballocator block group space.
This cause ocfs2 to hold onto a lot of space in some cases. for example,
when creating lots of small files, the space is held/managed by
'//inode_alloc'. After the user deletes all the small files, the space
never returns to '//global_bitmap'. This issue prevents ocfs2 from
providing the needed space even when there is enough free space in a
small ocfs2 volume.
This patch gives ocfs2 the ability to reclaim suballoc free space when
the block group is freed. For performance reasons, this patch keeps
the first suballocator block group.

Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
Reviewed-by: Su Yue <glass.su@xxxxxxxx>
---
fs/ocfs2/suballoc.c | 302 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 292 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index f7b483f0de2a..d62010166c34 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -294,6 +294,68 @@ static int ocfs2_validate_group_descriptor(struct super_block *sb,
return ocfs2_validate_gd_self(sb, bh, 0);
}
+/*
+ * hint gd may already be released in _ocfs2_free_suballoc_bits(),
+ * we first check gd descriptor signature, then do the
+ * ocfs2_read_group_descriptor() jobs.
+ *
+ * When the group descriptor is invalid, we return 'rc=0' and
+ * '*released=1'. The caller should handle this case. Otherwise,
+ * we return the real error code.
+ */
+static int ocfs2_read_hint_group_descriptor(struct inode *inode,
+ struct ocfs2_dinode *di, u64 gd_blkno,
+ struct buffer_head **bh, int *released)
+{
+ int rc;
+ struct buffer_head *tmp = *bh;
+ struct ocfs2_group_desc *gd;
+
+ *released = 0;

I'd like the caller is responsible for the initialization.

OK.


+
+ rc = ocfs2_read_block(INODE_CACHE(inode), gd_blkno, &tmp, NULL);
+ if (rc)
+ goto out;
+
+ gd = (struct ocfs2_group_desc *) tmp->b_data;
+ if (!OCFS2_IS_VALID_GROUP_DESC(gd)) {

How to distinguish the release case or a bug?

Good question.

Before this patch, OCFS2 never releases suballocator space.
The ocfs2_read_group_descriptor() doesn't need to handle the
case of reading a bad 'struct ocfs2_group_desc'.

After this patch, there is a gap between
_ocfs2_free_suballoc_bits() and ocfs2_read_hint_group_descriptor().


thread A thread B
-------------------------------------------------------------
ocfs2_claim_suballoc_bits
hint is not zero
ocfs2_search_one_group
+ ocfs2_read_hint_group_descriptor
| + OCFS2_IS_VALID_GROUP_DESC(gd)
| returns true
| _ocfs2_free_suballoc_bits
+ ... + free the last bit of gd
| + release gd
+ ocfs2_block_group_set_bits
uses released gd, data corruption
--------------------------------------------------------------

I plan to introduce a new cache_info flag 'OCFS2_CACHE_CLEAN_GD' to protect this case.
e.g. (just demo, not tested)


thread A thread B
-------------------------------------------------------------
ocfs2_read_hint_group_descriptor()
ocfs2_read_block()

//protect code begin
ci = INODE_CACHE(inode);
ocfs2_metadata_cache_io_lock(ci);
if (ci->ci_flags & OCFS2_CACHE_CLEAN_GD)
goto free_bh;
ocfs2_metadata_cache_io_unlock(ci);
//protect code end

gd = (struct ocfs2_group_desc *) tmp->b_data;
if (!OCFS2_IS_VALID_GROUP_DESC(gd)) {
... ...
}

_ocfs2_free_suballoc_bits()
... ...
if (ocfs2_is_cluster_bitmap(alloc_inode) ||
(le32_to_cpu(rec->c_free) != (le32_to_cpu(rec->c_total) - 1)) ||
(le16_to_cpu(cl->cl_next_free_rec) == 1)) {
goto bail;
}

//protect code begin
ci = INODE_CACHE(alloc_inode);
ocfs2_metadata_cache_io_lock(ci);
if (ci->ci_num_cached > 1) {
goto bail;
}
ci->ci_flags |= OCFS2_CACHE_CLEAN_GD;
ocfs2_metadata_cache_io_unlock(ci);
//protect code end

_ocfs2_reclaim_suballoc_to_main(handle, alloc_inode, alloc_bh, group_bh);
--------------------------------------------------------------


+ /*
+ * Invalid gd cache was set in ocfs2_read_block(),
... ...
+/*
+ * Reclaim the suballocator managed space to main bitmap.
+ * This function first works on the suballocator then switch to the
+ * main bitmap.
+ *
+ * handle: The transaction handle
+ * alloc_inode: The suballoc inode
+ * alloc_bh: The buffer_head of suballoc inode
+ * group_bh: The group descriptor buffer_head of suballocator managed.
+ * Caller should release the input group_bh.
+ */
+static int _reclaim_to_main_bm(handle_t *handle,

Better to rename it to _ocfs2_reclaim_suballoc_to_main().

OK.

+ struct inode *alloc_inode,
+ struct buffer_head *alloc_bh,
+ struct buffer_head *group_bh)
+{
+ int idx, status = 0;
+ int i, next_free_rec, len = 0;
+ __le16 old_bg_contig_free_bits = 0;
... ...
+ le32_add_cpu(&rec->c_free, count);
tmp_used = le32_to_cpu(fe->id1.bitmap1.i_used);
fe->id1.bitmap1.i_used = cpu_to_le32(tmp_used - count);
ocfs2_journal_dirty(handle, alloc_bh);
+ /*
+ * Reclaim suballocator free space.
+ * Bypass: global_bitmap, not empty rec, first rec in cl_recs[]

s/not empty rec/non empty rec

OK.

/Heming