Re: [PATCH] f2fs: avoid deadlock in gc thread under low memory
From: Jaegeuk Kim
Date: Wed Apr 13 2022 - 22:42:56 EST
On 04/14, Wu Yan wrote:
> On 4/14/22 10:18, Jaegeuk Kim wrote:
> > On 04/14, Wu Yan wrote:
> > > On 4/14/22 01:00, Jaegeuk Kim wrote:
> > > > On 04/13, Rokudo Yan wrote:
> > > > > There is a potential deadlock in gc thread may happen
> > > > > under low memory as below:
> > > > >
> > > > > gc_thread_func
> > > > > -f2fs_gc
> > > > > -do_garbage_collect
> > > > > -gc_data_segment
> > > > > -move_data_block
> > > > > -set_page_writeback(fio.encrypted_page);
> > > > > -f2fs_submit_page_write
> > > > > as f2fs_submit_page_write try to do io merge when possible, so the
> > > > > encrypted_page is marked PG_writeback but may not submit to block
> > > > > layer immediately, if system enter low memory when gc thread try
> > > > > to move next data block, it may do direct reclaim and enter fs layer
> > > > > as below:
> > > > > -move_data_block
> > > > > -f2fs_grab_cache_page(index=?, for_write=false)
> > > > > -grab_cache_page
> > > > > -find_or_create_page
> > > > > -pagecache_get_page
> > > > > -__page_cache_alloc -- __GFP_FS is set
> > > > > -alloc_pages_node
> > > > > -__alloc_pages
> > > > > -__alloc_pages_slowpath
> > > > > -__alloc_pages_direct_reclaim
> > > > > -__perform_reclaim
> > > > > -try_to_free_pages
> > > > > -do_try_to_free_pages
> > > > > -shrink_zones
> > > > > -mem_cgroup_soft_limit_reclaim
> > > > > -mem_cgroup_soft_reclaim
> > > > > -mem_cgroup_shrink_node
> > > > > -shrink_node_memcg
> > > > > -shrink_list
> > > > > -shrink_inactive_list
> > > > > -shrink_page_list
> > > > > -wait_on_page_writeback -- the page is marked
> > > > > writeback during previous move_data_block call
> > > > >
> > > > > the gc thread wait for the encrypted_page writeback complete,
> > > > > but as gc thread held sbi->gc_lock, the writeback & sync thread
> > > > > may blocked waiting for sbi->gc_lock, so the bio contain the
> > > > > encrypted_page may nerver submit to block layer and complete the
> > > > > writeback, which cause deadlock. To avoid this deadlock condition,
> > > > > we mark the gc thread with PF_MEMALLOC_NOFS flag, then it will nerver
> > > > > enter fs layer when try to alloc cache page during move_data_block.
> > > > >
> > > > > Signed-off-by: Rokudo Yan <wu-yan@xxxxxxx>
> > > > > ---
> > > > > fs/f2fs/gc.c | 6 ++++++
> > > > > 1 file changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> > > > > index e020804f7b07..cc71f77b98c8 100644
> > > > > --- a/fs/f2fs/gc.c
> > > > > +++ b/fs/f2fs/gc.c
> > > > > @@ -38,6 +38,12 @@ static int gc_thread_func(void *data)
> > > > > wait_ms = gc_th->min_sleep_time;
> > > > > + /*
> > > > > + * Make sure that no allocations from gc thread will ever
> > > > > + * recurse to the fs layer to avoid deadlock as it will
> > > > > + * hold sbi->gc_lock during garbage collection
> > > > > + */
> > > > > + memalloc_nofs_save();
> > > >
> > > > I think this cannot cover all the f2fs_gc() call cases. Can we just avoid by:
> > > >
> > > > --- a/fs/f2fs/gc.c
> > > > +++ b/fs/f2fs/gc.c
> > > > @@ -1233,7 +1233,7 @@ static int move_data_block(struct inode *inode, block_t bidx,
> > > > CURSEG_ALL_DATA_ATGC : CURSEG_COLD_DATA;
> > > >
> > > > /* do not read out */
> > > > - page = f2fs_grab_cache_page(inode->i_mapping, bidx, false);
> > > > + page = f2fs_grab_cache_page(inode->i_mapping, bidx, true);
> > > > if (!page)
> > > > return -ENOMEM;
> > > >
> > > > Thanks,
> > > >
> > > > > set_freezable();
> > > > > do {
> > > > > bool sync_mode, foreground = false;
> > > > > --
> > > > > 2.25.1
> > >
> > > Hi, Jaegeuk
> > >
> > > I'm not sure if any other case may trigger the issue, but the stack traces I
> > > have caught so far are all the same as below:
> > >
> > > f2fs_gc-253:12 D 226966.808196 572 302561 150976 0x1200840 0x0 572
> > > 237207473347056
> > > <ffffff889d88668c> __switch_to+0x134/0x150
> > > <ffffff889e764b6c> __schedule+0xd5c/0x1100
> > > <ffffff889e76554c> io_schedule+0x90/0xc0
> > > <ffffff889d9fb880> wait_on_page_bit+0x194/0x208
> > > <ffffff889da167b4> shrink_page_list+0x62c/0xe74
> > > <ffffff889da1d354> shrink_inactive_list+0x2c0/0x698
> > > <ffffff889da181f4> shrink_node_memcg+0x3dc/0x97c
> > > <ffffff889da17d44> mem_cgroup_shrink_node+0x144/0x218
> > > <ffffff889da6610c> mem_cgroup_soft_limit_reclaim+0x188/0x47c
> > > <ffffff889da17a40> do_try_to_free_pages+0x204/0x3a0
> > > <ffffff889da176c8> try_to_free_pages+0x35c/0x4d0
> > > <ffffff889da05d60> __alloc_pages_nodemask+0x7a4/0x10d0
> > > <ffffff889d9fc82c> pagecache_get_page+0x184/0x2ec
> >
> > Is this deadlock trying to grab a lock, instead of waiting for writeback?
> > Could you share all the backtraces of the tasks?
> >
> > For writeback above, looking at the code, f2fs_gc uses three mappings, meta,
> > node, and data, and meta/node inodes are masking GFP_NOFS in f2fs_iget(),
> > while data inode does not. So, the above f2fs_grab_cache_page() in
> > move_data_block() is actually calling w/o NOFS.
> >
> > > <ffffff889dbf8860> do_garbage_collect+0xfe0/0x2828
> > > <ffffff889dbf7434> f2fs_gc+0x4a0/0x8ec
> > > <ffffff889dbf6bf4> gc_thread_func+0x240/0x4d4
> > > <ffffff889d8de9b0> kthread+0x17c/0x18c
> > > <ffffff889d88567c> ret_from_fork+0x10/0x18
> > >
> > > Thanks
> > > yanwu
>
> Hi, Jaegeuk
>
> The gc thread is blocked on wait_on_page_writeback(encrypted page submit
> before) when it try grab data inode page, the parsed stack traces as below:
>
> ppid=572 pid=572 D cpu=1 prio=120 wait=378s f2fs_gc-253:12
> Native callstack:
> vmlinux wait_on_page_bit_common(page=0xFFFFFFBF7D2CD700, state=2,
> lock=false) + 304 <mm/filemap.c:1035>
> vmlinux wait_on_page_bit(page=0xFFFFFFBF7D2CD700, bit_nr=15) + 400
>
> <mm/filemap.c:1074>
> vmlinux wait_on_page_writeback(page=0xFFFFFFBF7D2CD700) + 36
>
> <include/linux/pagemap.h:557>
> vmlinux shrink_page_list(page_list=0xFFFFFF8011E83418,
> pgdat=contig_page_data, sc=0xFFFFFF8011E835B8, ttu_flags=0,
> stat=0xFFFFFF8011E833F0, force_reclaim=false) + 1576 <mm/vmscan.c:1171>
> vmlinux shrink_inactive_list(lruvec=0xFFFFFFE003C304C0,
> sc=0xFFFFFF8011E835B8, lru=LRU_INACTIVE_FILE) + 700
> <mm/vmscan.c:1966>
> vmlinux shrink_list(lru=LRU_INACTIVE_FILE, lruvec=0xFFFFFF8011E834B8,
> sc=0xFFFFFF8011E835B8) + 128 <mm/vmscan.c:2350>
> vmlinux shrink_node_memcg(pgdat=contig_page_data,
> memcg=0xFFFFFFE003C1A300, sc=0xFFFFFF8011E835B8,
> lru_pages=0xFFFFFF8011E835B0) + 984 <mm/vmscan.c:2726>
> vmlinux mem_cgroup_shrink_node(memcg=0xFFFFFFE003C1A300,
> gfp_mask=21102794, noswap=false, pgdat=contig_page_data,
> nr_scanned=0xFFFFFF8011E836A0) + 320 <mm/vmscan.c:3416>
> vmlinux mem_cgroup_soft_reclaim(root_memcg=0xFFFFFFE003C1A300,
> pgdat=contig_page_data) + 164
> <mm/memcontrol.c:1643>
> vmlinux mem_cgroup_soft_limit_reclaim(pgdat=contig_page_data, order=0,
> gfp_mask=21102794, total_scanned=0xFFFFFF8011E83720) + 388
> <mm/memcontrol.c:2913>
> vmlinux shrink_zones(zonelist=contig_page_data + 14784,
> sc=0xFFFFFF8011E83790) + 352
> <mm/vmscan.c:3094>
> vmlinux do_try_to_free_pages(zonelist=contig_page_data + 14784,
> sc=0xFFFFFF8011E83790) + 512
> <mm/vmscan.c:3164>
> vmlinux try_to_free_pages(zonelist=contig_page_data + 14784, order=0,
> gfp_mask=21102794, nodemask=0) + 856
> <mm/vmscan.c:3370>
> vmlinux __perform_reclaim(gfp_mask=300431548, order=0,
> ac=0xFFFFFF8011E83900) + 60
> <mm/page_alloc.c:3831>
> vmlinux __alloc_pages_direct_reclaim(gfp_mask=300431548, order=0,
> alloc_flags=300431604, ac=0xFFFFFF8011E83900) + 60
> <mm/page_alloc.c:3853>
> vmlinux __alloc_pages_slowpath(gfp_mask=300431548, order=0,
> ac=0xFFFFFF8011E83900) + 1244
> <mm/page_alloc.c:4240>
> vmlinux __alloc_pages_nodemask() + 1952
>
> <mm/page_alloc.c:4463>
> vmlinux __alloc_pages(gfp_mask=21102794, order=0, preferred_nid=0) + 16
> <include/linux/gfp.h:515>
> vmlinux __alloc_pages_node(nid=0, gfp_mask=21102794, order=0) + 16
>
> <include/linux/gfp.h:528>
> vmlinux alloc_pages_node(nid=0, gfp_mask=21102794, order=0) + 16
>
> <include/linux/gfp.h:542>
> vmlinux __page_cache_alloc(gfp=21102794) + 16
>
> <include/linux/pagemap.h:226>
> vmlinux pagecache_get_page() + 384
>
> <mm/filemap.c:1520>
> vmlinux find_or_create_page(offset=209) + 112
>
> <include/linux/pagemap.h:333>
> vmlinux grab_cache_page(index=209) + 112
>
> <include/linux/pagemap.h:399>
> vmlinux f2fs_grab_cache_page(index=209, for_write=false) + 112
Ok, I think this should be enough.
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -1233,7 +1233,7 @@ static int move_data_block(struct inode *inode, block_t bidx,
CURSEG_ALL_DATA_ATGC : CURSEG_COLD_DATA;
/* do not read out */
- page = f2fs_grab_cache_page(inode->i_mapping, bidx, false);
+ page = f2fs_grab_cache_page(inode->i_mapping, bidx, true);
if (!page)
return -ENOMEM;
>
> <fs/f2fs/f2fs.h:2429>
> vmlinux move_data_block(inode=0xFFFFFFDFD578EEA0, gc_type=300432152,
> segno=21904, off=145) + 3584 <fs/f2fs/gc.c:1119>
> vmlinux gc_data_segment(sbi=0xFFFFFFE007C03000, sum=0xFFFFFF8011E83B10,
> gc_list=0xFFFFFF8011E83AB8, segno=21904, gc_type=300432152) + 3644
> <fs/f2fs/gc.c:1475>
> vmlinux do_garbage_collect(sbi=0xFFFFFFE007C03000, start_segno=21904,
> gc_list=0xFFFFFF8011E83CF0, gc_type=0) + 4060
> <fs/f2fs/gc.c:1592>
> vmlinux f2fs_gc(sbi=0xFFFFFFE007C03000, background=true, segno=4294967295)
> + 1180 <fs/f2fs/gc.c:1684>
> vmlinux gc_thread_func(data=0xFFFFFFE007C03000) + 572
>
> <fs/f2fs/gc.c:118>
> vmlinux kthread() + 376
>
> <kernel/kthread.c:232>
> vmlinux ret_from_fork() +
>
> Thanks
> yanwu