Re: [PATCH 1/2] ocfs2: Revert "ocfs2: fix the la space leak when unmounting an ocfs2 volume"
From: Joseph Qi
Date: Wed Dec 04 2024 - 04:28:33 EST
On 12/4/24 2:46 PM, Heming Zhao wrote:
> On 12/4/24 11:47, Joseph Qi wrote:
>>
>>
>> On 12/4/24 11:32 AM, Heming Zhao wrote:
>>> This reverts commit dfe6c5692fb5 ("ocfs2: fix the la space leak when
>>> unmounting an ocfs2 volume").
>>>
>>> In commit dfe6c5692fb5, the commit log stating "This bug has existed
>>> since the initial OCFS2 code." is incorrect. The correct introduction
>>> commit is 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()").
>>>
>>
>> Could you please elaborate more how it happens?
>> And it seems no difference with the new version. So we may submit a
>> standalone revert patch to those backported stable kernels (< 6.10).
>
> commit log from patch [2/2] should be revised.
> change: This bug has existed since the initial OCFS2 code.
> to : This bug was introduced by commit 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()")
>
> ----
> See below for the details of patch [1/2].
>
> following is "the code before commit 30dd3478c3cd7" + "commit dfe6c5692fb525e".
>
> static int ocfs2_sync_local_to_main()
> {
> ... ...
> 1 while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start))
> 2 != -1) {
> 3 if ((bit_off < left) && (bit_off == start)) {
> 4 count++;
> 5 start++;
> 6 continue;
> 7 }
> 8 if (count) {
> 9 blkno = la_start_blk +
> 10 ocfs2_clusters_to_blocks(osb->sb,
> 11 start - count);
> 12
> 13 trace_ocfs2_sync_local_to_main_free();
> 14
> 15 status = ocfs2_release_clusters(handle,
> 16 main_bm_inode,
> 17 main_bm_bh, blkno,
> 18 count);
> 19 if (status < 0) {
> 20 mlog_errno(status);
> 21 goto bail;
> 22 }
> 23 }
> 24 if (bit_off >= left)
> 25 break;
> 26 count = 1;
> 27 start = bit_off + 1;
> 28 }
> 29
> 30 /* clear the contiguous bits until the end boundary */
> 31 if (count) {
> 32 blkno = la_start_blk +
> 33 ocfs2_clusters_to_blocks(osb->sb,
> 34 start - count);
> 35
> 36 trace_ocfs2_sync_local_to_main_free();
> 37
> 38 status = ocfs2_release_clusters(handle,
> 39 main_bm_inode,
> 40 main_bm_bh, blkno,
> 41 count);
> 42 if (status < 0)
> 43 mlog_errno(status);
> 44 }
> ... ...
> }
>
> bug flow:
> 1. the left:10000, start:0, bit_off:9000, and there are zeros from 9000 to the end of bitmap.
> 2. when 'start' is 9999, code runs to line 3, where bit_off is 10000 (the 'left' value), it doesn't trigger line 3.
> 3. code runs to line 8 (where 'count' is 9999), this area releases 9999 bytes of space to main_bm.
> 4. code runs to line 24, triggering "bit_off == left" and 'break' the loop. at this time, the 'count' still retains its old value 9999.
> 5. code runs to line 31, this area code releases space to main_bm for the same gd again.
>
> kernel will report the following likely error:
> OCFS2: ERROR (device dm-0): ocfs2_block_group_clear_bits: Group descriptor # 349184 has bit count 15872 but claims 19871 are freed. num_bits 7878
>
Okay, IIUC, it seems we have to:
1. revert commit dfe6c5692fb5 (so does stable kernel).
2. fix 30dd3478c3cd in following way:
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 5df34561c551..f0feadac2ef1 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -971,9 +971,9 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
start = count = 0;
left = le32_to_cpu(alloc->id1.bitmap1.i_total);
- while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) <
+ while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) <=
left) {
- if (bit_off == start) {
+ if ((bit_off < left) && (bit_off == start)) {
count++;
start++;
continue;
@@ -997,7 +997,8 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
goto bail;
}
}
-
+ if (bit_off >= left)
+ break;
count = 1;
start = bit_off + 1;
}
Thanks,
Joseph