Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

From: Wen Yang
Date: Thu Apr 08 2021 - 14:50:50 EST


> On Apr 7, 2021, at 5:16 AM, riteshh <riteshh@xxxxxxxxxxxxx> wrote:
>>
>> On 21/04/07 03:01PM, Wen Yang wrote:
>>> From: Wen Yang <simon.wy@xxxxxxxxxxxxxxx>
>>>
>>> The kworker has occupied 100% of the CPU for several days:
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11
>>>
>>> And the stack obtained through sysrq is as follows:
>>> [20613144.850426] task: ffff8800b5e08000 task.stack: ffffc9001342c000
>>> [20613144.850438] Call Trace:
>>> [20613144.850439] [<ffffffffa0244209>]ext4_mb_new_blocks+0x429/0x550 [ext4]
>>> [20613144.850439] [<ffffffffa02389ae>] ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
>>> [20613144.850441] [<ffffffffa0204b52>] ext4_map_blocks+0x172/0x620 [ext4]
>>> [20613144.850442] [<ffffffffa0208675>] ext4_writepages+0x7e5/0xf00 [ext4]
>>> [20613144.850443] [<ffffffff811c487e>] do_writepages+0x1e/0x30
>>> [20613144.850444] [<ffffffff81280265>] __writeback_single_inode+0x45/0x320
>>> [20613144.850444] [<ffffffff81280ab2>] writeback_sb_inodes+0x272/0x600
>>> [20613144.850445] [<ffffffff81280ed2>] __writeback_inodes_wb+0x92/0xc0
>>> [20613144.850445] [<ffffffff81281238>] wb_writeback+0x268/0x300
>>> [20613144.850446] [<ffffffff812819f4>] wb_workfn+0xb4/0x380
>>> [20613144.850447] [<ffffffff810a5dc9>] process_one_work+0x189/0x420
>>> [20613144.850447] [<ffffffff810a60ae>] worker_thread+0x4e/0x4b0
>>>
>>> The cpu resources of the cloud server are precious, and the server
>>> cannot be restarted after running for a long time, so a configuration
>>> parameter is added to prevent this endless loop.
>>
>> Strange, if there is a endless loop here. Then I would definitely see
>> if there is any accounting problem in pa->pa_count. Otherwise busy=1
>> should not be set everytime. ext4_mb_show_pa() function may help debug this.
>>
>> If yes, then that means there always exists either a file preallocation
>> or a group preallocation. Maybe it is possible, in some use case.
>> Others may know of such use case, if any.

> If this code is broken, then it doesn't make sense to me that we would
> leave it in the "run forever" state after the patch, and require a sysfs
> tunable to be set to have a properly working system?

> Is there anything particularly strange about the workload/system that
> might cause this? Filesystem is very full, memory is very low, etc?

Hi Ritesh and Andreas,

Thank you for your reply. Since there is still a faulty machine, we have analyzed it again and found it is indeed a very special case:


crash> struct ext4_group_info ffff8813bb5f72d0
struct ext4_group_info {
bb_state = 0,
bb_free_root = {
rb_node = 0x0
},
bb_first_free = 1681,
bb_free = 0,
bb_fragments = 0,
bb_largest_free_order = -1,
bb_prealloc_list = {
next = 0xffff880268291d78,
prev = 0xffff880268291d78 ---> *** The list is empty
},
alloc_sem = {
count = {
counter = 0
},
wait_list = {
next = 0xffff8813bb5f7308,
prev = 0xffff8813bb5f7308
},
wait_lock = {
raw_lock = {
{
val = {
counter = 0
},
{
locked = 0 '\000',
pending = 0 '\000'
},
{
locked_pending = 0,
tail = 0
}
}
}
},
osq = {
tail = {
counter = 0
}
},
owner = 0x0
},
bb_counters = 0xffff8813bb5f7328
}
crash>


crash> list 0xffff880268291d78 -l ext4_prealloc_space.pa_group_list -s ext4_prealloc_space.pa_count
ffff880268291d78
pa_count = {
counter = 1 ---> ****pa->pa_count
}
ffff8813bb5f72f0
pa_count = {
counter = -30701
}


crash> struct -xo ext4_prealloc_space
struct ext4_prealloc_space {
[0x0] struct list_head pa_inode_list;
[0x10] struct list_head pa_group_list;
union {
struct list_head pa_tmp_list;
struct callback_head pa_rcu;
[0x20] } u;
[0x30] spinlock_t pa_lock;
[0x34] atomic_t pa_count;
[0x38] unsigned int pa_deleted;
[0x40] ext4_fsblk_t pa_pstart;
[0x48] ext4_lblk_t pa_lstart;
[0x4c] ext4_grpblk_t pa_len;
[0x50] ext4_grpblk_t pa_free;
[0x54] unsigned short pa_type;
[0x58] spinlock_t *pa_obj_lock;
[0x60] struct inode *pa_inode;
}
SIZE: 0x68


crash> rd 0xffff880268291d68 20
ffff880268291d68: ffff881822f8a4c8 ffff881822f8a4c8 ..."......."....
ffff880268291d78: ffff8813bb5f72f0 ffff8813bb5f72f0 .r_......r_.....
ffff880268291d88: 0000000000001000 ffff880db2371000 ..........7.....
ffff880268291d98: 0000000100000001 0000000000000000 ................
ffff880268291da8: 0000000000029c39 0000001700000c41 9.......A.......
ffff880268291db8: ffff000000000016 ffff881822f8a4d8 ..........."....
ffff880268291dc8: ffff881822f8a268 ffff880268291af8 h.."......)h....
ffff880268291dd8: ffff880268291dd0 ffffea00069a07c0 ..)h............
ffff880268291de8: 00000000004d5bb7 0000000000001000 .[M.............
ffff880268291df8: ffff8801a681f000 0000000000000000 ................



Before "goto repeat", it is necessary to check whether grp->bb_prealloc_list is empty.
This patch may fix it.
Please kindly give us your comments and suggestions.
Thanks.


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 99bf091..8082e2d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4290,7 +4290,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
free_total += free;

/* if we still need more blocks and some PAs were used, try again */
- if (free_total < needed && busy) {
+ if (free_total < needed && busy && !list_empty(&grp->bb_prealloc_list)) {
ext4_unlock_group(sb, group);
cond_resched();
busy = 0;



--
Best wishes,
Wen