Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing
From: Hugh Dickins
Date: Tue Feb 21 2023 - 17:25:53 EST
On Tue, 21 Feb 2023, Huang, Ying wrote:
>
> On second thought, I think that it may be better to provide a fix as
> simple as possible firstly. Then we can work on a more complex fix as
> we discussed above. The simple fix is easy to review now. And, we will
> have more time to test and review the complex fix.
>
> In the following fix, I disabled the migration batching except for the
> MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I
> will work on the complex fix to enable migration batching for all modes.
>
> What do you think about that?
I don't think there's a need to rush in the wrong fix so quickly.
Your series was in (though sometimes out of) linux-next for some
while, without causing any widespread problems. Andrew did send
it to Linus yesterday, I expect he'll be pushing it out later today
or tomorrow, but I don't think it's going to cause big problems.
Aiming for a fix in -rc2 would be good. Why would it be complex?
Hugh
>
> Best Regards,
> Huang, Ying
>
> -------------------------------8<---------------------------------
> From 8e475812eacd9f2eeac76776c2b1a17af3e59b89 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@xxxxxxxxx>
> Date: Tue, 21 Feb 2023 16:37:50 +0800
> Subject: [PATCH] migrate_pages: fix deadlock in batched migration
>
> Two deadlock bugs were reported for the migrate_pages() batching
> series. Thanks Hugh and Pengfei! For example, in the following
> deadlock trace snippet,
>
> INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
> Not tainted 6.2.0-rc4-kvm+ #1314
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
> Workqueue: loop4 loop_rootcg_workfn
> Call Trace:
> <TASK>
> __schedule+0x43b/0xd00
> schedule+0x6a/0xf0
> io_schedule+0x4a/0x80
> folio_wait_bit_common+0x1b5/0x4e0
> ? __pfx_wake_page_function+0x10/0x10
> __filemap_get_folio+0x73d/0x770
> shmem_get_folio_gfp+0x1fd/0xc80
> shmem_write_begin+0x91/0x220
> generic_perform_write+0x10e/0x2e0
> __generic_file_write_iter+0x17e/0x290
> ? generic_write_checks+0x12b/0x1a0
> generic_file_write_iter+0x97/0x180
> ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> do_iter_readv_writev+0x13c/0x210
> ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> do_iter_write+0xf6/0x330
> vfs_iter_write+0x46/0x70
> loop_process_work+0x723/0xfe0
> loop_rootcg_workfn+0x28/0x40
> process_one_work+0x3cc/0x8d0
> worker_thread+0x66/0x630
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x153/0x190
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x29/0x50
> </TASK>
>
> INFO: task repro:1023 blocked for more than 147 seconds.
> Not tainted 6.2.0-rc4-kvm+ #1314
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
> Call Trace:
> <TASK>
> __schedule+0x43b/0xd00
> schedule+0x6a/0xf0
> io_schedule+0x4a/0x80
> folio_wait_bit_common+0x1b5/0x4e0
> ? compaction_alloc+0x77/0x1150
> ? __pfx_wake_page_function+0x10/0x10
> folio_wait_bit+0x30/0x40
> folio_wait_writeback+0x2e/0x1e0
> migrate_pages_batch+0x555/0x1ac0
> ? __pfx_compaction_alloc+0x10/0x10
> ? __pfx_compaction_free+0x10/0x10
> ? __this_cpu_preempt_check+0x17/0x20
> ? lock_is_held_type+0xe6/0x140
> migrate_pages+0x100e/0x1180
> ? __pfx_compaction_free+0x10/0x10
> ? __pfx_compaction_alloc+0x10/0x10
> compact_zone+0xe10/0x1b50
> ? lock_is_held_type+0xe6/0x140
> ? check_preemption_disabled+0x80/0xf0
> compact_node+0xa3/0x100
> ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> ? _find_first_bit+0x7b/0x90
> sysctl_compaction_handler+0x5d/0xb0
> proc_sys_call_handler+0x29d/0x420
> proc_sys_write+0x2b/0x40
> vfs_write+0x3a3/0x780
> ksys_write+0xb7/0x180
> __x64_sys_write+0x26/0x30
> do_syscall_64+0x3b/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7f3a2471f59d
> RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
> RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
> R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
> R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
> </TASK>
>
> The page migration task has held the lock of the shmem folio A, and is
> waiting the writeback of the folio B of the file system on the loop
> block device to complete. While the loop worker task which writes
> back the folio B is waiting to lock the shmem folio A, because the
> folio A backs the folio B in the loop device. Thus deadlock is
> triggered.
>
> In general, if we have locked some other folios except the one we are
> migrating, it's not safe to wait synchronously, for example, to wait
> the writeback to complete or wait to lock the buffer head.
>
> To fix the deadlock, in this patch, we avoid to batch the page
> migration except for MIGRATE_ASYNC mode or the split folios of a THP
> folio. In MIGRATE_ASYNC mode, synchronous waiting is avoided. And
> there isn't any dependency relationship among the split folios of a
> THP folio.
>
> The fix can be improved via converting migration mode from synchronous
> to asynchronous if we have locked some other folios except the one we
> are migrating. We will do that in the near future.
>
> Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@xxxxxxxxxx/
> Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
> Reported-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Reported-by: "Xu, Pengfei" <pengfei.xu@xxxxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Stefan Roesch <shr@xxxxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Xin Hao <xhao@xxxxxxxxxxxxxxxxx>
> Cc: Zi Yan <ziy@xxxxxxxxxx>
> Cc: Yang Shi <shy828301@xxxxxxxxx>
> Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
> mm/migrate.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ef68a1aff35c..bc04c34543f3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1937,7 +1937,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
> {
> int rc, rc_gather;
> - int nr_pages;
> + int nr_pages, batch;
> struct folio *folio, *folio2;
> LIST_HEAD(folios);
> LIST_HEAD(ret_folios);
> @@ -1951,6 +1951,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> mode, reason, &stats, &ret_folios);
> if (rc_gather < 0)
> goto out;
> +
> + if (mode == MIGRATE_ASYNC)
> + batch = NR_MAX_BATCHED_MIGRATION;
> + else
> + batch = 1;
> again:
> nr_pages = 0;
> list_for_each_entry_safe(folio, folio2, from, lru) {
> @@ -1961,11 +1966,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> }
>
> nr_pages += folio_nr_pages(folio);
> - if (nr_pages > NR_MAX_BATCHED_MIGRATION)
> + if (nr_pages >= batch)
> break;
> }
> - if (nr_pages > NR_MAX_BATCHED_MIGRATION)
> - list_cut_before(&folios, from, &folio->lru);
> + if (nr_pages >= batch)
> + list_cut_before(&folios, from, &folio2->lru);
> else
> list_splice_init(from, &folios);
> rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> --
> 2.39.1