Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
From: Marek Szyprowski
Date: Wed Jun 03 2020 - 02:48:57 EST
Hi Ritesh,
On 20.05.2020 08:40, Ritesh Harjani wrote:
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
>
> Consider for a case where we have less number of groups
> (for e.g. just group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread
> to release all the PAs and should have retried for allocation.
> Since 1st thread was anyway going to discard the PAs.
>
> The algorithm using this percpu seq counter goes below:
> 1. We sample the percpu discard_pa_seq counter before trying for block
> allocation in ext4_mb_new_blocks().
> 2. We increment this percpu discard_pa_seq counter when we either allocate
> or free these blocks i.e. while marking those blocks as used/free in
> mb_mark_used()/mb_free_blocks().
> 3. We also increment this percpu seq counter when we successfully identify
> that the bb_prealloc_list is not empty and hence proceed for discarding
> of those PAs inside ext4_mb_discard_group_preallocations().
>
> Now to make sure that the regular fast path of block allocation is not
> affected, as a small optimization we only sample the percpu seq counter
> on that cpu. Only when the block allocation fails and when freed blocks
> found were 0, that is when we sample percpu seq counter for all cpus using
> below function ext4_get_discard_pa_seq_sum(). This happens after making
> sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
>
> It can be well argued that why don't just check for grp->bb_free to
> see if there are any free blocks to be allocated. So here are the two
> concerns which were discussed:-
>
> 1. If for some reason the blocks available in the group are not
> appropriate for allocation logic (say for e.g.
> EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
> the retry logic may result into infinte looping since grp->bb_free is
> non-zero.
>
> 2. Also before preallocation was clubbed with block allocation with the
> same ext4_lock_group() held, there were lot of races where grp->bb_free
> could not be reliably relied upon.
> Due to above, this patch considers discard_pa_seq logic to determine if
> we should retry for block allocation. Say if there are are n threads
> trying for block allocation and none of those could allocate or discard
> any of the blocks, then all of those n threads will fail the block
> allocation and return -ENOSPC error. (Since the seq counter for all of
> those will match as no block allocation/discard was done during that
> duration).
>
> Signed-off-by: Ritesh Harjani <riteshh@xxxxxxxxxxxxx>
This patch landed in yesterday's linux-next and causes following
WARNING/BUG on various Samsung Exynos-based boards:
ÂBUG: using smp_processor_id() in preemptible [00000000] code: logsave/552
Âcaller is ext4_mb_new_blocks+0x404/0x1300
ÂCPU: 3 PID: 552 Comm: logsave Tainted: GÂÂÂÂÂÂÂ W 5.7.0-next-20200602 #4
ÂHardware name: Samsung Exynos (Flattened Device Tree)
Â[<c011184c>] (unwind_backtrace) from [<c010d250>] (show_stack+0x10/0x14)
Â[<c010d250>] (show_stack) from [<c05185fc>] (dump_stack+0xbc/0xe8)
Â[<c05185fc>] (dump_stack) from [<c0ab689c>]
(check_preemption_disabled+0xec/0xf0)
Â[<c0ab689c>] (check_preemption_disabled) from [<c03a7b38>]
(ext4_mb_new_blocks+0x404/0x1300)
Â[<c03a7b38>] (ext4_mb_new_blocks) from [<c03780f4>]
(ext4_ext_map_blocks+0xc7c/0x10f4)
Â[<c03780f4>] (ext4_ext_map_blocks) from [<c03902b4>]
(ext4_map_blocks+0x118/0x5a0)
Â[<c03902b4>] (ext4_map_blocks) from [<c0394524>]
(mpage_map_and_submit_extent+0x134/0x9c0)
Â[<c0394524>] (mpage_map_and_submit_extent) from [<c03958c8>]
(ext4_writepages+0xb18/0xcb0)
Â[<c03958c8>] (ext4_writepages) from [<c02588ec>] (do_writepages+0x20/0x94)
Â[<c02588ec>] (do_writepages) from [<c024c688>]
(__filemap_fdatawrite_range+0xac/0xcc)
Â[<c024c688>] (__filemap_fdatawrite_range) from [<c024c700>]
(filemap_flush+0x28/0x30)
Â[<c024c700>] (filemap_flush) from [<c037eedc>]
(ext4_release_file+0x70/0xac)
Â[<c037eedc>] (ext4_release_file) from [<c02c3310>] (__fput+0xc4/0x234)
Â[<c02c3310>] (__fput) from [<c014eb74>] (task_work_run+0x88/0xcc)
Â[<c014eb74>] (task_work_run) from [<c010ca40>]
(do_work_pending+0x52c/0x5cc)
Â[<c010ca40>] (do_work_pending) from [<c0100094>]
(slow_work_pending+0xc/0x20)
ÂException stack(0xec9c1fb0 to 0xec9c1ff8)
Â1fa0:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 00000000 0044969c 0000006c
00000000
Â1fc0: 00000001 0045a014 00000241 00000006 00000000 be91abb4 be91abb0
0000000c
Â1fe0: 00459fd4 be91ab90 00448ed4 b6e43444 60000050 00000003
Please let me know how I can help debugging this issue. The above log is
from linux-next 20200602 compiled from exynos_defconfig running on ARM
32bit Samsung Exynos4412-based Odroid U3 board, however I don't think
this is Exynos specific issue. Probably I've observed it, because
exynos_defconfig has most of the debugging options enabled.
> ...
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland